CentOS 5.5 - kernel panic - Help!

Mon Jun 7 10:49:57 EDT 2010

On Mon, Jun 7, 2010 at 10:14 AM, Tim Callaghan <tmcallaghan-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org> wrote:
> Jarod/All,
>
> Thanks for the advice, I'll try and get something hooked up so I can
> see the entirety of the kernel panic.
>
> In the meantime, what are the potential implications of running with
> the NMI Watchdog turned off?

Data corruption. You could get a legit NMI that doesn't get handled,
and some hardware is left in a funky state, but the machine keeps
going and your data gets corrupted. Or the machine might simply hang.
But its possible things will work just fine, fsvo fine. If I'm
thinking clearly, upstream kernels don't enable the nmi watchdog by
default, as it does occasionally trip on a false positive. However,
given your report that the same hardware was locking up w/ubuntu, I'm
guessing these aren't false positives. That's assuming its not enabled
by default on ubuntu though, I'm not sure if it is or not, I just know
that Red Hat explicitly decided to enable it by default for its
enterprise distributions.

For more fun reading:

http://www.mjmwired.net/kernel/Documentation/nmi_watchdog.txt

> On Fri, Jun 4, 2010 at 11:25 AM, Jarod Wilson <jarod-ajLrJawYSntWk0Htik3J/w at public.gmane.org> wrote:
>> On Fri, Jun 4, 2010 at 9:59 AM, Tim Callaghan <tmcallaghan-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org> wrote:
>>> I'm trying to track down the source of a kernel panic that I see once
>>> or twice a week on one of my CentOS machines, specifics:
>>>  CentOS 5.5
>>>  ASUS p6x58d Motherboard
>>>  Intel i7-920
>>>  3 x Corsair X3 2GB
>>>  WD 1TB 6Gb SATA - OS drive
>>>  Intel X-25M SSD - data drive for DB benchmarking
>>>
>>> After running for a few days, the machine fails to respond to ping.
>>> When I look at the console I see "kernel panic - not syncing - nmi
>>> watchdog".  Nothing is logging to /var/log/messages, when I open the
>>> file after lockup I see the log usual information plus all the new
>>> restart info.
>>>
>>> I used this machine prior with Ubuntu 9.10 and 10.04 and would
>>> occasionally experience lock-ups as well, I just never tried to track
>>> them down when running Ubuntu.  I need it to be stable now.
>>>
>>> I'm considering booting with "nmi_watchdog=0" but concerned that I'll
>>> just be masking a real issue.
>>>
>>> Any ideas?
>>
>> I would hook up a serial console on the machine to see if you can
>> capture more info about the panic. Another option would be to set up
>> kdump, so you (hopefully) get a vmcore file dumped when the machine
>> panics. If you can capture a full backtrace and/or a vmcore, I'd file
>> them in a bug at bugzilla.redhat.com, against Red Hat Enterprise Linux
>> 5. Red Hat does look at bugs reported by CentOS users, though it
>> likely wouldn't get as high a priority as a bug from a paying
>> customer, but you get what you pay for. :)
>>
>> --
>> Jarod Wilson
>> jarod-ajLrJawYSntWk0Htik3J/w at public.gmane.org
>>
>
> _______________________________________________
> Discuss mailing list
> Discuss-mNDKBlG2WHs at public.gmane.org
> http://lists.blu.org/mailman/listinfo/discuss
>

-- 
Jarod Wilson
jarod-ajLrJawYSntWk0Htik3J/w at public.gmane.org