vmhost-x86-copr01.rdu-cc.fedoraproject.org DOWN #11950
Labels
No labels
announcement
anubis
authentication
aws
backlog
blocked
bodhi
ci
cloud
communishift
copr
database
day-to-day
dc-move
deprecated
dev
discourse
dns
downloads
easyfix
epel
firmitas
forgejo_migration
Gain
High
Gain
Low
Gain
Medium
gitlab
greenwave
hardware
help wanted
high-trouble
koji
koschei
lists
low-trouble
medium-trouble
mirrorlists
monitoring
Needs investigation
odcs
OpenShift
ops
outage
packager_workflow_blocker
pagure
permissions
Priority
Needs Review
Priority
Next Meeting
Priority
🔥 URGENT 🔥
Priority
Waiting on Assignee
Priority
Waiting on External
Priority
Waiting on Reporter
rabbitmq
release-monitoring
releng
request-for-resources
s390x
security
SMTP
sprint-0
sprint-1
src.fp.o
staging
unfreeze
waiverdb
websites-general
wiki
Backlog Status
Needs Review
Backlog Status
Ready
chore
documentation
points
01
points
02
points
03
points
05
points
08
points
13
Priority
High
Priority
Low
Priority
Medium
Sprint Status
Blocked
Sprint Status
Done
Sprint Status
In Progress
Sprint Status
Review
Sprint Status
To Do
Technical Debt
Work Item
Bug
Work Item
Epic
Work Item
Spike
Work Item
Task
Work Item
User Story
No milestone
No project
No assignees
5 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
infra/tickets#11950
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
IDRAC claims:
The machine can not be turned on.
Yeah, this happened yesterday night... I wasn't able to file a ticket on it then, so many thanks for doing so.
We will need to engage dell tomorrow and see what can be done... ;(
Metadata Update from @phsmoura:
I think @dkirwan and @jnsamyak and @patrikp are going to work on this one and try and get dell on the line to fix things. ;)
Working with Dell tech support to resolve.
Latest update:
RH Tech is unavailable until week after June 10th to carry out a reseat of the OCP card on the server, we are blocked with the Dell tech support steps until this work is carried out.
James Gibson has responded to me this afternoon, he will be in the RDU2-CC datacenter and can take care of reseating this OCP card.
James messaged that he discovered an issue with a PSU. Updated Dell with information.
So whats the current status here? I see the machine is up, do we need to replace the bad psu? or ?
Yup, just going through the tech support process once more, and will try get them to replace the damaged PSU.
Metadata Update from @dkirwan:
Updated internal ticket for James Gibson, to carry out the next troubleshooting steps requested by Dell.
James swapped the PSUs on Friday, and the server boots up correctly with the other PSU also so not a PSU issue on its own, updated Dell tech support with the information. Should get another update today from them.
Checking to see if there is an issue external to the server, perhaps with the UPS.
Asking James Gibson to check:
OK, looks like we're back in action fully, the 2nd psu is connected, and we've connected to power from different power sockets. Seems like it may have been a faulty power outlet?
Metadata Update from @dkirwan:
Metadata Update from @praiskup:
We are offline again.
Can not be turned on 🤷
Yeah, it has some weird power error. ;(
"The system board Pfault fail-safe voltage is outside of range."
Seems like this box is just cursed. ;(
@dkirwan can you look at this and get onsite/dell folks to work on it?
Spoke with Dell, and James Gibson in RH. Opened tickets and got the following troubleshooting steps to carryout next:
James Gibson has configured the server with minimal config. Server is showing orange light, gathering logs and reopening case with Dell to troubleshoot further.
Got caught here with tickets timing out and closing, Flock, and pto, having to open a new ticket with Dell support 😿
Some new support steps to follow. Reaching out to James Gibson to try arrange someone to handle it.
James will get to visit later this week, currently in RDU3.
James has carried out the steps today, orange light still on the chassis, capturing logs and uploaded to Dell.
Dell has responded with an appointment date.
We have successfully placed the order for the parts replacement. The appointment is scheduled for 9/9/2024.
Dell engineer has replaced the hardware and the system now appears to be up and running although with different networking config. Will look into getting external network access restored asap to this machine now.
We should just need to update the ansible vars and re-run the playbook for the main interfaces... the mgmt is static, so shouldn't matter.
The network device with mac
f4:02:70:d0:05:00is still present - that's the one we use for networking; so I think everything is OK right now (Copr allocated VMs, and system is utilized).The changed devices are not used I think: https://pagure.io/fedora-infra/ansible/c/79ee807af52a0ed12ef7d6588d39f3198d917b9f?branch=main
Thank you for the help here! ❤️
Metadata Update from @praiskup:
Hm, for the record, I spend a while in the ssh command line and the system alerts:
That doesn't look good. I guess it didn't crash though?
Perhaps we need a bios/firmware update?
I'll update the firmware on there and see if that improves anything so.
Metadata Update from @dkirwan:
This server is already running the latest bios version apparently.
Need to get the server updated to RHEL 8.10 in order to install the Dell iDRAC Service Module iSM utility, so we can gather host bundle logs from iDRAC for Dell tech support.
Do it when you have time for it; copr will re-start the builds that were taken by this box. If you want to be super gentle on Copr users, let us know ~3 hours in advance, we'll deallocate the machine.
Hi @praiskup can I give you 3 hours notice now, and I'll take care of this upgrade later this afternoon!
System upgraded to RHEL 8.10, and the Dell Service Module iSM service is installed.
So we are waiting on dell here?
No this is currently with me, I need to figure out the connection between idrac and this ism service module to enable it to gather logs, once this is capable of gathering the host logs I'll then be able to approach dell tech support and re engage further troubleshooting.
ok, cool. Thanks for the update.
Any news here?
So the ISM service is installed and running:
It can see the OS to iDRAC Pass-through ethernet device being enabled and disabled, but its not actually creating an ethernet device on the system. Might require a module to be enabled.. need to do some more research.
Hi @praiskup would it be possible to upgrade this system to RHEL 9.0 without affecting the copr workloads? Wondering if this and using the latest Dell iDRAC ISM on RHEL 9.0 might unblock me here..
Any news here? I think we can upgrade it... it's a bit tricky in the env it's in now, but it might be easier once it's moved over to the new datacenter later this year...
Sorry I missed the replies:
Shouldn't be a problem.
No, the system seems to work.
+1
Just to keep the ticket updated....
This box has been migrated to RDU3 (so now it's called vmhost-x86-copr01.rdu3.fedoraproject.org). We had it online briefly while testing, but the 10G NIC appears to have failed, so we need to get that sorted before we can bring this online.
Once we do, It'll have RHEL10, if that helps with ISM stuff...
The machine is up in rdu3 now with a fresh rhel10 install and new network card... and seems ok so far.