Server goes catatonic after a few days

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Server goes catatonic after a few days

Alex Regan
Hi,
I have a fedora29 system in our colo that's a few years old now and
just goes catatonic and stops responding after a few days. It's
happened a few times now, even with different kernels, so I suspect
it's a memory or hardware problem.

Is it possible to run memtest without having physical access to the
machine to insert a USB stick or CDROM?

After the machine reboots (via IPMI access), there's nothing in the
logs and no abrt-cli info on a kernel crash or other info I can find
about why it died.

What else can I do to troubleshoot this without having to drive to the
colo to check on it?

The last entry from journalctl just before it stopped responding was
just a regular nrpe entry, unrelated to the crash.

I've pasted the current dmesg output here:
http://pasted.co/4b700ee1

Any ideas greatly appreciated.
_______________________________________________
users mailing list -- [hidden email]
To unsubscribe send an email to [hidden email]
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@...
Reply | Threaded
Open this post in threaded view
|

Re: Server goes catatonic after a few days

Terry Barnaby

You could try the program memtester "dnf install memtester" "memtester 1g". This is a user level memory tester.

I also have a server that occasionally dies. It started doing this late last year under Fedora27. I wasn't sure if it was a particular kernel change or hardware, but I replaced the motherboard/memory/CPU at that time as it was about 5 years old. However it has still crashed occasionally with the new hardware and with a fresh install of Fedora29.

The latest /var/log/messages entry when it crashed was:

Jan 8 10:42:52 king mosquitto[1435]: 1546944172: New connection from 192.168.202.30 on port 1883.

Jan 8 10:42:52 king mosquitto[1435]: 1546944172: New client connected from 192.168.202.30 as DVES_00B2F8 (c1, k10, u'DVES_USER').

Jan 8 10:43:13 king mosquitto[1435]: 1546944193: Client DVES_00B2F8 has exceeded timeout, disconnecting.

Jan 8 10:43:13 king mosquitto[1435]: 1546944193: Socket error on client DVES_00B2F8, disconnecting.

#########################################################################################################################################################################################################Jan 8 18:03:39 king kernel: microcode: microcode updated early to revision 0xc6, date = 2018-04-17

Jan 8 18:03:39 king kernel: Linux version 4.19.9-300.fc29.x86_64 ([hidden email]) (gcc version 8.2.1 20181105 (Red Hat 8.2.1-5) (GCC)) #1 SMP Thu Dec 13 17:25:01 UTC 2018

Jan 8 18:03:39 king kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-4.19.9-300.fc29.x86_64 root=UUID=5d3007f8-fa92-4fe6-98a8-e812b680198f ro rd.auto LANG=en_GB.UTF-8

Jan 8 18:03:39 king kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'

Jan 8 18:03:39 king kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'

Jan 8 18:03:39 king kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'

Jan 8 18:03:39 king kernel: x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'

Jan 8 18:03:39 king kernel: x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'

The "#" were in fact <nul> (0x00) bytes which is strange.


I do wonder if there is an obtuse kernel bug somewhere. This server has a Intel(R) Core(TM) i3-6100 CPU @ 3.70GHz and is doing DVB recording amoungst other work. Other servers I have though seem fine.

Terry

On 06/01/2019 22:15, Alex wrote:
Hi,
I have a fedora29 system in our colo that's a few years old now and
just goes catatonic and stops responding after a few days. It's
happened a few times now, even with different kernels, so I suspect
it's a memory or hardware problem.

Is it possible to run memtest without having physical access to the
machine to insert a USB stick or CDROM?

After the machine reboots (via IPMI access), there's nothing in the
logs and no abrt-cli info on a kernel crash or other info I can find
about why it died.

What else can I do to troubleshoot this without having to drive to the
colo to check on it?

The last entry from journalctl just before it stopped responding was
just a regular nrpe entry, unrelated to the crash.

I've pasted the current dmesg output here:
http://pasted.co/4b700ee1

Any ideas greatly appreciated.
_______________________________________________
users mailing list -- [hidden email]
To unsubscribe send an email to [hidden email]
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@...

_______________________________________________
users mailing list -- [hidden email]
To unsubscribe send an email to [hidden email]
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@...