SitePoint Sponsor

User Tag List

Results 1 to 14 of 14
  1. #1
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,827
    Mentioned
    142 Post(s)
    Tagged
    0 Thread(s)

    Linux Server is randomly halting

    I've got a weird issue with one of my test servers. It has happened twice now, originally, I thought it was related to my UPS battery going bad (and it probably was then), but last night, completely unplugged from the UPS (as I haven't ordered a new battery yet), by test server went down around 1:00 AM.

    So far, I have no idea why. I've checked the syslog and the kern.log (along with apache logs), but nothing is indicating a direct failure that would cause a system halt.

    I've just installed munin and munin-node to create system graphs (described here), what else should I be checking? The CPU and MB temperatures look good on reboot, and I've never seen them get high temps even under major load.
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  2. #2
    SitePoint Enthusiast PromptSpace's Avatar
    Join Date
    Jun 2012
    Posts
    96
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    it's usually overheating as per my experience. I had similar issues few years back on a Tower Server, changing the Thermal Compound solved the issue for me.
    Host4Geeks | Hosting powered by geeks, enhanced with a scoop of
    love and awesomeness!!

    The goal as a company is to have customer service that's not
    just great, but legendary -Henry Ford

  3. #3
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,827
    Mentioned
    142 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by PromptSpace View Post
    it's usually overheating as per my experience. I had similar issues few years back on a Tower Server, changing the Thermal Compound solved the issue for me.
    Alright, I'll have to give that a try. I'm hoping munin will point me in that direction, as it keeps a nice graph of the hd and mb temps so hopefully the next time it occurs I can verify the heat was excessive.
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  4. #4
    Community Advisor silver trophy

    Join Date
    Nov 2006
    Location
    UK
    Posts
    2,521
    Mentioned
    37 Post(s)
    Tagged
    1 Thread(s)
    If it turns out not to be heat related might be worth doing a memory integrity test as well.

  5. #5
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,827
    Mentioned
    142 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by EastCoast View Post
    If it turns out not to be heat related might be worth doing a memory integrity test as well.
    For incase the heat is degrading the memory? Or...
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  6. #6
    Foozle Reducer ServerStorm's Avatar
    Join Date
    Feb 2005
    Location
    Burlington, Canada
    Posts
    2,699
    Mentioned
    89 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by cpradio View Post
    For incase the heat is degrading the memory? Or...
    The heat isn't likely degrading memory, if it is memory then it is a bad stick of RAM. Did you add extra hard-drives or any other power sucking equipment to this machine?
    ictus==""

  7. #7
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,827
    Mentioned
    142 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by ServerStorm View Post
    The heat isn't likely degrading memory, if it is memory then it is a bad stick of RAM. Did you add extra hard-drives or any other power sucking equipment to this machine?
    Nope, quite the opposite actually, I removed several as I moved the data to a NAS that is mounted via smbfs
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  8. #8
    SitePoint Author silver trophybronze trophy
    wwb_99's Avatar
    Join Date
    May 2003
    Location
    Washington, DC
    Posts
    10,576
    Mentioned
    4 Post(s)
    Tagged
    0 Thread(s)
    smbfs could be your problem actually -- smb implementations are uneven, especially when talking about cheapo NAS boxes. I'd prefer iSCSI or at least NFS.

  9. #9
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,827
    Mentioned
    142 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by wwb_99 View Post
    smbfs could be your problem actually -- smb implementations are uneven, especially when talking about cheapo NAS boxes. I'd prefer iSCSI or at least NFS.
    Unfortunately, the NAS I am using doesn't properly support NFS without a LOT of tinkering (bootstraping it, installing it, etc). I'm not properly motivated to go through that process just yet, and I have other linux machines that are not having a reboot issue that have the same SMBFS mount.
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  10. #10
    Foozle Reducer ServerStorm's Avatar
    Join Date
    Feb 2005
    Location
    Burlington, Canada
    Posts
    2,699
    Mentioned
    89 Post(s)
    Tagged
    2 Thread(s)
    Hi cpradio,

    If it is not a power issue then one other thing to check is to look at the capacitors on the mother board. If they are not flat and are bulging then it could also be a capacitor issue. I've seen this on a number of older servers/workstations.
    ictus==""

  11. #11
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,827
    Mentioned
    142 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by ServerStorm View Post
    Hi cpradio,

    If it is not a power issue then one other thing to check is to look at the capacitors on the mother board. If they are not flat and are bulging then it could also be a capacitor issue. I've seen this on a number of older servers/workstations.
    Something I'll keep in mind. The MB is probably 7 years old. It does get a lot of load every day between specific hours (load tests and major processing is done to produce static data -- enough to where all 4 cores are 90-98% for several hours). Now I rarely reboot this machine, it usually runs for months until a patch requires a reboot, or power goes out (since my UPS is still without a battery), or something, so it is very obvious when it goes down.

    It has only gone down twice in the past 6-8 months on its own. I've restarted it maybe once during that time frame.
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  12. #12
    Foozle Reducer ServerStorm's Avatar
    Join Date
    Feb 2005
    Location
    Burlington, Canada
    Posts
    2,699
    Mentioned
    89 Post(s)
    Tagged
    2 Thread(s)
    Quote Originally Posted by cpradio View Post
    Something I'll keep in mind. The MB is probably 7 years old. It does get a lot of load every day between specific hours (load tests and major processing is done to produce static data -- enough to where all 4 cores are 90-98% for several hours). Now I rarely reboot this machine, it usually runs for months until a patch requires a reboot, or power goes out (since my UPS is still without a battery), or something, so it is very obvious when it goes down.

    It has only gone down twice in the past 6-8 months on its own. I've restarted it maybe once during that time frame.
    A board capacitor issue will happen far more frequently then twice in months, it will blue-screen server fault a machine in a 1/2 day of work - even less time. This doesn't sound like the issue. If it was then at that point you need to replace the motherboard.

    Have you looked at any of the persistent services on the machine to see if they are issuing routines at intervals that coincide with the crashes. Anything in the logs?

    It still smells like a hardware problem so have your run any hardware diagnostics yet to see if anything come up?
    ictus==""

  13. #13
    Hosting Team Leader silver trophybronze trophy
    cpradio's Avatar
    Join Date
    Jun 2002
    Location
    Ohio
    Posts
    4,827
    Mentioned
    142 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by ServerStorm View Post
    A board capacitor issue will happen far more frequently then twice in months, it will blue-screen server fault a machine in a 1/2 day of work - even less time. This doesn't sound like the issue. If it was then at that point you need to replace the motherboard.

    Have you looked at any of the persistent services on the machine to see if they are issuing routines at intervals that coincide with the crashes. Anything in the logs?

    It still smells like a hardware problem so have your run any hardware diagnostics yet to see if anything come up?
    Nothing in the logs are indicative of the problem yet (including apache logs, the logs of the services being executed that utilize a lot of load, etc). I gave up looking at the logs because all of them had no problems around the time that I suspected that the server halted (other than that was their last entry before it halted and the next entry was the startup).

    Hardware, seems to be good. There is only 1 primary hard drive that the operating system boots off of (was replaced earlier this year; February, I think; because the prior one was failing), then it mounts a couple of USB drives for backup purposes, and a SMBFS network share from a NAS (for both backup purposes and for reading of a few specific files).

    Memory looks to be good, I usually have screen with at least one session running htop so I can quickly check vitals and memory usage. SWAP memory is always low, network is always available, doesn't run a GUI so no video to deal with, CPU and MB temps are always decent when I look (hoping munin will show otherwise the next time it occurs though!), no funny noises (so nothing sounds like it is failing or being over worked).

    I'm not 100% sure SMBFS would be the culprit, because that was just a recent conversion, so the prior failure a few months ago, would have had nothing to do with that.

    I do know one of the USB drives unmounts or goes unresponsive fairly regularly (I need to replace it, but just haven't had the time), so I have an automated script that runs every hour to see if it is unresponsive and if it is, to remount it. Originally, I thought that was the issue several months ago, but removing the script from cron didn't resolve the issue, it happened again within days (I believe this was a hardware issue and I replaced a few drives), but hasn't happened again until last week.

    I'm just at a lost, because 1) it happens on rare occasions, and 2) none of the logs are indicative as to why it happened, if the command was harshly issued, etc.

    Granted, this could have been a power surge (small enough to kill my server, and not affect any other appliances), but I have several servers located against this wall, all without UPS right now, and only this one went down... albeit, not impossible, but seems improbable to be the cause.

    I might never know what exactly happened, was just looking for other indicators everyone else may consider that I may have overlooked and that I am getting plenty of suggestions on Thanks everyone for that. Hat Trick to @PromptSpace ; for the suggestion on over-heating, that got me to find munin (awesome utility!)
    Be sure to congratulate xMog on earning April's Member of the Month
    Go ahead and blame me, I still won't lose any sleep over it
    My Blog | My Technical Notes

  14. #14
    SitePoint Enthusiast PromptSpace's Avatar
    Join Date
    Jun 2012
    Posts
    96
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Glad to help
    Host4Geeks | Hosting powered by geeks, enhanced with a scoop of
    love and awesomeness!!

    The goal as a company is to have customer service that's not
    just great, but legendary -Henry Ford


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •