Xen1 crash #477

Closed
opened 2008-03-29 18:50:24 +00:00 by toshio · 3 comments

== Summary ==

  • xen1 crashed at 2008-03-29 8:49UTC.
  • buildsystem was down until ~ 2008-03-29 11:50 UTC.
  • Other services were load balanced or not serving end users so no outage was visible

== Log of Actions ==

  • Found out koji was down at 9:50UTC
  • Found that koji1/publictest8 wasn't running anywhere and neither was puppet1
  • Started koji1 and puppet1 on xen2
    • puppet1 came up
    • koji1's root filesystem mounted itself read-only
  • Rebooted koji1
    • Manual fsck was needed and required a root password
  • Found that xen1 didn't have any guests running and had a configuration for both koji1 and puppet1
  • Logs showed koji1, puppet1, and others had been running on xen1 and xen1 had crashed at 8:49UTC
  • Shutdown koji1 and puppet1 on xen2
  • Brought up koji1, puppet1, app4, proxy4, fas2, smtp, publictest3, and noc1 on xen1
    • koji1 still required root for fsck
    • publictest3 did not restart due to xen saying not enough memory
  • Called Dennis who had root password and was able to fsck koji1
  • Dennis ran xm memset so that publictest3 could be brought back up as well
  • Outage over

== Possible Causes ==
/var/log/messages and /var/log/dmesg on xen1 have been saved to ~root/outage-2008-03-29.messages ~root/outage-2008-03-29.dmesg in case someone can pull some useful information from them later.

/var/log/messages around the crash::

{{{
Mar 29 08:45:35 xen1 puppetd[19997]: Finished catalog run in 21.25 seconds
Mar 29 08:46:57 xen1 iscsid: Nop-out timedout after 15 seconds on connection 2:0
state (3). Dropping session.
Mar 29 08:49:38 xen1 syslogd 1.4.1: restart.
Mar 29 08:49:38 xen1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
}}}

The iscsi timeout could be a red-herring since we see them throughout the logs but on reboot we do have multiple failures to connect to iscsi before it finlly succeeds::
{{{
[Others like this]
Mar 29 08:50:47 xen1 iscsid: Nop-out timedout after 15 seconds on connection 1:0 state (3). Dropping session.
Mar 29 08:50:51 xen1 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8
Mar 29 08:50:53 xen1 iscsid: connect failed (113)
Mar 29 08:50:53 xen1 iscsid: Nop-out timedout after 15 seconds on connection 2:0 state (3). Dropping session.
Mar 29 08:50:56 xen1 iscsid: connect failed (113)
Mar 29 08:50:58 xen1 iscsid: connect failed (113)
[...]
Mar 29 08:51:02 xen1 iscsid: connection2:0 is operational after recovery (3 attempts)
Mar 29 08:51:02 xen1 iscsid: connection1:0 is operational after recovery (4 attempts)
}}}

== Summary == * xen1 crashed at 2008-03-29 8:49UTC. * buildsystem was down until ~ 2008-03-29 11:50 UTC. * Other services were load balanced or not serving end users so no outage was visible == Log of Actions == * Found out koji was down at 9:50UTC * Found that koji1/publictest8 wasn't running anywhere and neither was puppet1 * Started koji1 and puppet1 on xen2 * puppet1 came up * koji1's root filesystem mounted itself read-only * Rebooted koji1 * Manual fsck was needed and required a root password * Found that xen1 didn't have any guests running and had a configuration for both koji1 and puppet1 * Logs showed koji1, puppet1, and others had been running on xen1 and xen1 had crashed at 8:49UTC * Shutdown koji1 and puppet1 on xen2 * Brought up koji1, puppet1, app4, proxy4, fas2, smtp, publictest3, and noc1 on xen1 * koji1 still required root for fsck * publictest3 did not restart due to xen saying not enough memory * Called Dennis who had root password and was able to fsck koji1 * Dennis ran xm memset so that publictest3 could be brought back up as well * Outage over == Possible Causes == /var/log/messages and /var/log/dmesg on xen1 have been saved to ~root/outage-2008-03-29.messages ~root/outage-2008-03-29.dmesg in case someone can pull some useful information from them later. /var/log/messages around the crash:: {{{ Mar 29 08:45:35 xen1 puppetd[19997]: Finished catalog run in 21.25 seconds Mar 29 08:46:57 xen1 iscsid: Nop-out timedout after 15 seconds on connection 2:0 state (3). Dropping session. Mar 29 08:49:38 xen1 syslogd 1.4.1: restart. Mar 29 08:49:38 xen1 kernel: klogd 1.4.1, log source = /proc/kmsg started. }}} The iscsi timeout could be a red-herring since we see them throughout the logs but on reboot we do have multiple failures to connect to iscsi before it finlly succeeds:: {{{ [Others like this] Mar 29 08:50:47 xen1 iscsid: Nop-out timedout after 15 seconds on connection 1:0 state (3). Dropping session. Mar 29 08:50:51 xen1 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8 Mar 29 08:50:53 xen1 iscsid: connect failed (113) Mar 29 08:50:53 xen1 iscsid: Nop-out timedout after 15 seconds on connection 2:0 state (3). Dropping session. Mar 29 08:50:56 xen1 iscsid: connect failed (113) Mar 29 08:50:58 xen1 iscsid: connect failed (113) [...] Mar 29 08:51:02 xen1 iscsid: connection2:0 is operational after recovery (3 attempts) Mar 29 08:51:02 xen1 iscsid: connection1:0 is operational after recovery (4 attempts) }}}
This is likely related to: https://bugzilla.redhat.com/show_bug.cgi?id=429469

Side note, We've got two nic's listening as the iscsi target. We can start looking into adjusting the timeout settings in /etc/iscsid as well as getting multipath properly setup.

Side note, We've got two nic's listening as the iscsi target. We can start looking into adjusting the timeout settings in /etc/iscsid as well as getting multipath properly setup.

Closing this, xen1 seems to have calmed down quite a bit with the newer 5.2 kernel

Closing this, xen1 seems to have calmed down quite a bit with the newer 5.2 kernel
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
infra/tickets#477
No description provided.