vmhost-x86-copr01.rdu-cc.fedoraproject.org DOWN #11950

Closed
opened 2024-05-27 07:08:55 +00:00 by praiskup · 50 comments
Member

IDRAC claims:

The system board Pfault fail-safe voltage is outside of range. 	Sun 26 May 2024 23:48:49
The OCP PG voltage is outside of range. 	Sun 26 May 2024 23:48:42

The machine can not be turned on.

IDRAC claims: The system board Pfault fail-safe voltage is outside of range. Sun 26 May 2024 23:48:49 The OCP PG voltage is outside of range. Sun 26 May 2024 23:48:42 The machine can not be turned on.
Owner

Yeah, this happened yesterday night... I wasn't able to file a ticket on it then, so many thanks for doing so.

We will need to engage dell tomorrow and see what can be done... ;(

Yeah, this happened yesterday night... I wasn't able to file a ticket on it then, so many thanks for doing so. We will need to engage dell tomorrow and see what can be done... ;(
Member

Metadata Update from @phsmoura:

  • Issue priority set to: Waiting on Assignee (was: Needs Review)
  • Issue tagged with: medium-gain, medium-trouble, ops
**Metadata Update from @phsmoura**: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: medium-gain, medium-trouble, ops
Owner

I think @dkirwan and @jnsamyak and @patrikp are going to work on this one and try and get dell on the line to fix things. ;)

I think @dkirwan and @jnsamyak and @patrikp are going to work on this one and try and get dell on the line to fix things. ;)
Member

Working with Dell tech support to resolve.

Latest update:

  • updating idrac firmware
  • failing to update server firmware
  • Requesting RH tech in datacenter reseat an "OCP card" on the server.
Working with Dell tech support to resolve. Latest update: - updating idrac firmware - failing to update server firmware - Requesting RH tech in datacenter reseat an "OCP card" on the server.
Member

RH Tech is unavailable until week after June 10th to carry out a reseat of the OCP card on the server, we are blocked with the Dell tech support steps until this work is carried out.

RH Tech is unavailable until week after June 10th to carry out a reseat of the OCP card on the server, we are blocked with the Dell tech support steps until this work is carried out.
Member

James Gibson has responded to me this afternoon, he will be in the RDU2-CC datacenter and can take care of reseating this OCP card.

James Gibson has responded to me this afternoon, he will be in the RDU2-CC datacenter and can take care of reseating this OCP card.
Member

James messaged that he discovered an issue with a PSU. Updated Dell with information.

It's a bad PSU, removed one, reboots itself
Remove the other, boots just fine
James messaged that he discovered an issue with a PSU. Updated Dell with information. ``` It's a bad PSU, removed one, reboots itself Remove the other, boots just fine ```
Owner

So whats the current status here? I see the machine is up, do we need to replace the bad psu? or ?

So whats the current status here? I see the machine is up, do we need to replace the bad psu? or ?
Member

Yup, just going through the tech support process once more, and will try get them to replace the damaged PSU.

Yup, just going through the tech support process once more, and will try get them to replace the damaged PSU.
Member

Metadata Update from @dkirwan:

  • Issue assigned to dkirwan
**Metadata Update from @dkirwan**: - Issue assigned to dkirwan
Member

Updated internal ticket for James Gibson, to carry out the next troubleshooting steps requested by Dell.

Updated internal ticket for James Gibson, to carry out the next troubleshooting steps requested by Dell.
Member

James swapped the PSUs on Friday, and the server boots up correctly with the other PSU also so not a PSU issue on its own, updated Dell tech support with the information. Should get another update today from them.

James swapped the PSUs on Friday, and the server boots up correctly with the other PSU also so not a PSU issue on its own, updated Dell tech support with the information. Should get another update today from them.
Member

Checking to see if there is an issue external to the server, perhaps with the UPS.

Asking James Gibson to check:

  • Are the PSUs set to redundant?
  • When plugged at the same time, are them being plug to the same outlet/UPS?
  • If so, can we test by plugging them to different outlets/UPS ?
Checking to see if there is an issue external to the server, perhaps with the UPS. Asking James Gibson to check: - Are the PSUs set to redundant? - When plugged at the same time, are them being plug to the same outlet/UPS? - If so, can we test by plugging them to different outlets/UPS ?
Member

OK, looks like we're back in action fully, the 2nd psu is connected, and we've connected to power from different power sockets. Seems like it may have been a faulty power outlet?

OK, looks like we're back in action fully, the 2nd psu is connected, and we've connected to power from different power sockets. Seems like it may have been a faulty power outlet?
Member

Metadata Update from @dkirwan:

  • Issue close_status updated to: Fixed
  • Issue status updated to: Closed (was: Open)
**Metadata Update from @dkirwan**: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Author
Member

Metadata Update from @praiskup:

  • Issue status updated to: Open (was: Closed)
**Metadata Update from @praiskup**: - Issue status updated to: Open (was: Closed)
Author
Member

We are offline again.

We are offline again.
Author
Member

Can not be turned on 🤷

Can not be turned on :shrug:
Owner

Yeah, it has some weird power error. ;(

"The system board Pfault fail-safe voltage is outside of range."

Seems like this box is just cursed. ;(

Yeah, it has some weird power error. ;( "The system board Pfault fail-safe voltage is outside of range." Seems like this box is just cursed. ;(
Owner

@dkirwan can you look at this and get onsite/dell folks to work on it?

@dkirwan can you look at this and get onsite/dell folks to work on it?
Member

Spoke with Dell, and James Gibson in RH. Opened tickets and got the following troubleshooting steps to carryout next:

1. Power the server down.  
2. Disconnect server from all power cables, Network cables. 
3. Hold down the power button continuously for at least 10 seconds.  
4. Insert power cabless and network cables back to the system.  
5. Wait about 2 minutes before powering on the server for iDRAC to be refreshed.
6. Power the system on.  

If the issue persist after the power flea drain, please perform the following:

Reseat the OCP card and perform another flea drain.

Once that is performed please let us know the results, if the server still doesn't turn on, we'll have to perform a minimum 2 POST.

The components mentioned below are the minimum configuration to POST:

● One processor (CPU) in socket processor 1
● One memory module (DIMM) in socket A1
● One power supply unit
● System board + LOM card + RIO 

Everything else must be disconnected / removed (please take pictures to confirm the configuration).
Spoke with Dell, and James Gibson in RH. Opened tickets and got the following troubleshooting steps to carryout next: ``` 1. Power the server down. 2. Disconnect server from all power cables, Network cables. 3. Hold down the power button continuously for at least 10 seconds. 4. Insert power cabless and network cables back to the system. 5. Wait about 2 minutes before powering on the server for iDRAC to be refreshed. 6. Power the system on. If the issue persist after the power flea drain, please perform the following: Reseat the OCP card and perform another flea drain. Once that is performed please let us know the results, if the server still doesn't turn on, we'll have to perform a minimum 2 POST. The components mentioned below are the minimum configuration to POST: ● One processor (CPU) in socket processor 1 ● One memory module (DIMM) in socket A1 ● One power supply unit ● System board + LOM card + RIO Everything else must be disconnected / removed (please take pictures to confirm the configuration). ```
Member

James Gibson has configured the server with minimal config. Server is showing orange light, gathering logs and reopening case with Dell to troubleshoot further.

James Gibson has configured the server with minimal config. Server is showing orange light, gathering logs and reopening case with Dell to troubleshoot further.
Member

Got caught here with tickets timing out and closing, Flock, and pto, having to open a new ticket with Dell support 😿

Some new support steps to follow. Reaching out to James Gibson to try arrange someone to handle it.

1. Check the electrical environment for any external voltage issues.
2. Update the Firmware 
 
iDRAC with Lifecycle Controller to v7.10.50.10 
 
BIOS to v2.16.2 (System restart is required) 
 
3.  Swap the Network Card from Slot 1 to Slot 2 
 
Collect a TSR at this point to identify if the errors are following the Card or the Slot. 
 
If the error persists, please connect one card at a time and check if the error persists. 
 
Keep a note of which card is throwing the errors when connected. 
 
Network Cards are located in the Riser 2. 
 
4. Perform a flea power drain 
 
- Power the server down.  
- Disconnect server from all power cables, Network cables. 
- Hold down the power button continuously for at least 10 seconds.  
- Insert power cables and network cables back to the system.  
- Wait about 2 minutes before powering on the server for iDRAC to be refreshed.
- Power the system on. 
Got caught here with tickets timing out and closing, Flock, and pto, having to open a new ticket with Dell support :crying_cat_face: Some new support steps to follow. Reaching out to James Gibson to try arrange someone to handle it. ``` 1. Check the electrical environment for any external voltage issues. 2. Update the Firmware iDRAC with Lifecycle Controller to v7.10.50.10 BIOS to v2.16.2 (System restart is required) 3. Swap the Network Card from Slot 1 to Slot 2 Collect a TSR at this point to identify if the errors are following the Card or the Slot. If the error persists, please connect one card at a time and check if the error persists. Keep a note of which card is throwing the errors when connected. Network Cards are located in the Riser 2. 4. Perform a flea power drain - Power the server down. - Disconnect server from all power cables, Network cables. - Hold down the power button continuously for at least 10 seconds. - Insert power cables and network cables back to the system. - Wait about 2 minutes before powering on the server for iDRAC to be refreshed. - Power the system on. ```
Member

James will get to visit later this week, currently in RDU3.

James will get to visit later this week, currently in RDU3.
Member

James has carried out the steps today, orange light still on the chassis, capturing logs and uploaded to Dell.

James has carried out the steps today, orange light still on the chassis, capturing logs and uploaded to Dell.
Member
  • James Gibson has managed to swap the network cards as requested
  • Unfortunately the Server still showing yellow light, same voltage warnings in the logs on idrac.
  • Uploaded the logs to dell once more, and now they are going to replace OCP card, MB and the CPU.
  • Contacted James to get information required in order to have Dell engineer call out and perform this hardware swap out.
- James Gibson has managed to swap the network cards as requested - Unfortunately the Server still showing yellow light, same voltage warnings in the logs on idrac. - Uploaded the logs to dell once more, and now they are going to replace OCP card, MB and the CPU. - Contacted James to get information required in order to have Dell engineer call out and perform this hardware swap out.
Member

Dell has responded with an appointment date.

We have successfully placed the order for the parts replacement. The appointment is scheduled for 9/9/2024.

Dell has responded with an appointment date. We have successfully placed the order for the parts replacement. The appointment is scheduled for 9/9/2024.
Member

Dell engineer has replaced the hardware and the system now appears to be up and running although with different networking config. Will look into getting external network access restored asap to this machine now.

Dell engineer has replaced the hardware and the system now appears to be up and running although with different networking config. Will look into getting external network access restored asap to this machine now.
Owner

We should just need to update the ansible vars and re-run the playbook for the main interfaces... the mgmt is static, so shouldn't matter.

We should just need to update the ansible vars and re-run the playbook for the main interfaces... the mgmt is static, so shouldn't matter.
Author
Member

The network device with mac f4:02:70:d0:05:00 is still present - that's the one we use for networking; so I think everything is OK right now (Copr allocated VMs, and system is utilized).

The changed devices are not used I think: https://pagure.io/fedora-infra/ansible/c/79ee807af52a0ed12ef7d6588d39f3198d917b9f?branch=main

Thank you for the help here! ❤️

The network device with mac `f4:02:70:d0:05:00` is still present - that's the one we use for networking; so I think everything is OK right now (Copr allocated VMs, and system is utilized). The changed devices are not used I think: https://pagure.io/fedora-infra/ansible/c/79ee807af52a0ed12ef7d6588d39f3198d917b9f?branch=main Thank you for the help here! :heart:
Author
Member

Metadata Update from @praiskup:

  • Issue close_status updated to: Fixed
  • Issue status updated to: Closed (was: Open)
**Metadata Update from @praiskup**: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Author
Member

Hm, for the record, I spend a while in the ssh command line and the system alerts:

Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ...
 kernel:[Hardware Error]: Corrected error, no action required.

Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ...
 kernel:[Hardware Error]: CPU:0 (17:31:0) MC255_STATUS[-|CE|-|AddrV|-|-|-|-|-]: 0x940000000000009f

Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ...
 kernel:[Hardware Error]: Error Addr: 0x0000001a1c188820

Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ...
 kernel:[Hardware Error]: PPIN: 0x02b49efcae18c05e

Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ...
 kernel:[Hardware Error]: IPID: 0x0000000000000000

Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ...
 kernel:[Hardware Error]: cache level: L3/GEN, tx: RESV
Hm, for the record, I spend a while in the ssh command line and the system alerts: ``` Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ... kernel:[Hardware Error]: Corrected error, no action required. Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ... kernel:[Hardware Error]: CPU:0 (17:31:0) MC255_STATUS[-|CE|-|AddrV|-|-|-|-|-]: 0x940000000000009f Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ... kernel:[Hardware Error]: Error Addr: 0x0000001a1c188820 Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ... kernel:[Hardware Error]: PPIN: 0x02b49efcae18c05e Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ... kernel:[Hardware Error]: IPID: 0x0000000000000000 Message from syslogd@vmhost-x86-copr01 at Sep 12 07:31:04 ... kernel:[Hardware Error]: cache level: L3/GEN, tx: RESV ```
Owner

That doesn't look good. I guess it didn't crash though?

Perhaps we need a bios/firmware update?

That doesn't look good. I guess it didn't crash though? Perhaps we need a bios/firmware update?
Member

I'll update the firmware on there and see if that improves anything so.

I'll update the firmware on there and see if that improves anything so.
Member

Metadata Update from @dkirwan:

  • Issue status updated to: Open (was: Closed)
**Metadata Update from @dkirwan**: - Issue status updated to: Open (was: Closed)
Member

This server is already running the latest bios version apparently.

This server is already running the latest bios version apparently.
Member

Need to get the server updated to RHEL 8.10 in order to install the Dell iDRAC Service Module iSM utility, so we can gather host bundle logs from iDRAC for Dell tech support.

Need to get the server updated to RHEL 8.10 in order to install the Dell iDRAC Service Module iSM utility, so we can gather host bundle logs from iDRAC for Dell tech support.
Author
Member

Do it when you have time for it; copr will re-start the builds that were taken by this box. If you want to be super gentle on Copr users, let us know ~3 hours in advance, we'll deallocate the machine.

Do it when you have time for it; copr will re-start the builds that were taken by this box. If you want to be super gentle on Copr users, let us know ~3 hours in advance, we'll deallocate the machine.
Member

Hi @praiskup can I give you 3 hours notice now, and I'll take care of this upgrade later this afternoon!

Hi @praiskup can I give you 3 hours notice now, and I'll take care of this upgrade later this afternoon!
Member

System upgraded to RHEL 8.10, and the Dell Service Module iSM service is installed.

System upgraded to RHEL 8.10, and the Dell Service Module iSM service is installed.
Owner

So we are waiting on dell here?

So we are waiting on dell here?
Member

No this is currently with me, I need to figure out the connection between idrac and this ism service module to enable it to gather logs, once this is capable of gathering the host logs I'll then be able to approach dell tech support and re engage further troubleshooting.

No this is currently with me, I need to figure out the connection between idrac and this ism service module to enable it to gather logs, once this is capable of gathering the host logs I'll then be able to approach dell tech support and re engage further troubleshooting.
Owner

ok, cool. Thanks for the update.

ok, cool. Thanks for the update.
Owner

Any news here?

Any news here?
Member

So the ISM service is installed and running:

systemctl status dcismeng.service:
EventCategory="Audit" EventSeverity="info" IsPastEvent="false" language="en-US"] The iDRAC Service Module is started on the operating system (OS) of server.     
8194" EventCategory="Audit" EventSeverity="warn" IsPastEvent="false" language="en-US"] The iDRAC Service Module is unable to discover iDRAC from the operating system of the server.
8194" EventCategory="Audit" EventSeverity="warn" IsPastEvent="false" language="en-US"] The iDRAC Service Module is running with Limited Functionality Mode hence some features are unavailable. Possible reasons are: 1) OS-to-BMC Passthrough setting in iDRAC is disabled 2) USBNIC interface on the host OS does not have a configured IP address.
 The iDRAC Service Module detected an OS to iDRAC Pass-through in the disabled mode. Enable the OS to iDRAC Pass-through (USB NIC) or reinstall iSM.

lsusb:
Bus 001 Device 009: ID 413c:a102 Dell Computer Corp. iDRAC Virtual NIC                                                 
                                                       
dmesg: 
[5088820.001846] usb 1-1.3: USB disconnect, device number 9
[5088826.065569] usb 1-1.3: new high-speed USB device number 10 using xhci_hcd
[5088826.160723] usb 1-1.3: New USB device found, idVendor=413c, idProduct=a102, bcdDevice= 3.16
[5088826.160732] usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[5088826.160735] usb 1-1.3: Product: iDRAC Virtual NIC USB Device
[5088826.160738] usb 1-1.3: Manufacturer: Dell(TM)
[5088826.160740] usb 1-1.3: SerialNumber: 5678

It can see the OS to iDRAC Pass-through ethernet device being enabled and disabled, but its not actually creating an ethernet device on the system. Might require a module to be enabled.. need to do some more research.

So the ISM service is installed and running: ``` systemctl status dcismeng.service: EventCategory="Audit" EventSeverity="info" IsPastEvent="false" language="en-US"] The iDRAC Service Module is started on the operating system (OS) of server. 8194" EventCategory="Audit" EventSeverity="warn" IsPastEvent="false" language="en-US"] The iDRAC Service Module is unable to discover iDRAC from the operating system of the server. 8194" EventCategory="Audit" EventSeverity="warn" IsPastEvent="false" language="en-US"] The iDRAC Service Module is running with Limited Functionality Mode hence some features are unavailable. Possible reasons are: 1) OS-to-BMC Passthrough setting in iDRAC is disabled 2) USBNIC interface on the host OS does not have a configured IP address. The iDRAC Service Module detected an OS to iDRAC Pass-through in the disabled mode. Enable the OS to iDRAC Pass-through (USB NIC) or reinstall iSM. lsusb: Bus 001 Device 009: ID 413c:a102 Dell Computer Corp. iDRAC Virtual NIC dmesg: [5088820.001846] usb 1-1.3: USB disconnect, device number 9 [5088826.065569] usb 1-1.3: new high-speed USB device number 10 using xhci_hcd [5088826.160723] usb 1-1.3: New USB device found, idVendor=413c, idProduct=a102, bcdDevice= 3.16 [5088826.160732] usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [5088826.160735] usb 1-1.3: Product: iDRAC Virtual NIC USB Device [5088826.160738] usb 1-1.3: Manufacturer: Dell(TM) [5088826.160740] usb 1-1.3: SerialNumber: 5678 ``` It can see the OS to iDRAC Pass-through ethernet device being enabled and disabled, but its not actually creating an ethernet device on the system. Might require a module to be enabled.. need to do some more research.
Member

Hi @praiskup would it be possible to upgrade this system to RHEL 9.0 without affecting the copr workloads? Wondering if this and using the latest Dell iDRAC ISM on RHEL 9.0 might unblock me here..

Hi @praiskup would it be possible to upgrade this system to RHEL 9.0 without affecting the copr workloads? Wondering if this and using the latest Dell iDRAC ISM on RHEL 9.0 might unblock me here..
Owner

Any news here? I think we can upgrade it... it's a bit tricky in the env it's in now, but it might be easier once it's moved over to the new datacenter later this year...

Any news here? I think we can upgrade it... it's a bit tricky in the env it's in now, but it might be easier once it's moved over to the new datacenter later this year...
Author
Member

Sorry I missed the replies:

Hi @praiskup would it be possible to upgrade this system to RHEL 9.0 without affecting the copr workloads

Shouldn't be a problem.

Any news here?

No, the system seems to work.

it's a bit tricky in the env it's in now, but it might be easier once it's moved over to the new datacenter later this year...

+1

Sorry I missed the replies: > Hi @praiskup would it be possible to upgrade this system to RHEL 9.0 without affecting the copr workloads Shouldn't be a problem. > Any news here? No, the system seems to work. > it's a bit tricky in the env it's in now, but it might be easier once it's moved over to the new datacenter later this year... +1
Member

Just to keep the ticket updated....

This box has been migrated to RDU3 (so now it's called vmhost-x86-copr01.rdu3.fedoraproject.org). We had it online briefly while testing, but the 10G NIC appears to have failed, so we need to get that sorted before we can bring this online.

Once we do, It'll have RHEL10, if that helps with ISM stuff...

Just to keep the ticket updated.... This box has been migrated to RDU3 (so now it's called vmhost-x86-copr01.rdu3.fedoraproject.org). We had it online briefly while testing, but the 10G NIC appears to have failed, so we need to get that sorted before we can bring this online. Once we do, It'll have RHEL10, if that helps with ISM stuff...
Owner

The machine is up in rdu3 now with a fresh rhel10 install and new network card... and seems ok so far.

The machine is up in rdu3 now with a fresh rhel10 install and new network card... and seems ok so far.
kevin closed this issue 2026-01-30 22:08:55 +00:00
Sign in to join this conversation.
No milestone
No assignees
5 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
infra/tickets#11950
No description provided.