nvidia - How to diagnose multiple Linux system failures?

05
2014-04
  • JESii

    I am having several major problems with my Linux machine: Kubuntu 13.10 (recent clean installl), MSI m'board (AMD Phenom II X3 720; 8Gig RAM), NVidia GeForce GT 630 video card (using kernel module 319.60). The motherborad was recently replaced roughly 3 months ago -- a local 'repair' shop fried the original motherboard while 'diagnosing' what turned out to be a software problem. CPU, RAM, and video card all tested good, so they went into the new m'board.

    Problems started about 6-8 weeks ago, with only small, occasional issues, each of which I started off trying to resolve individually.

    1. Occasionally, application windows hang, and all that is displayed is a flat, grey screen: all window decorations gone. I worked around this problem by turning off the kwin desktop effects, when I saw a message flash by as follows:

      kwin desktop effects restarted... due to graphics reset

    2. tar backups have failed to complete the last three weeks: first a "crc verify error", then a hung system, and finally a hung gzip.

    3. Numerous dmesg messages like: "BUG: CPU#2: Soft lockup in tar". After researching this problem, I'm not sure this is a bug at all... just heavy tar/gzip CPU usage?

    4. Google Chrome randomly and frequently crashing tabs with an "Aw, Snap" message. Google Enterprise team suggested a V8 engine error, but also hinted at hardware issues.

    I'm trying to get a handle on what's going wrong and what to do to diagnose and resolve the problems. I'm guessing hardware? And if so, which component is most likely to be causing the problem and how do I isolate that? I'm going to be running a memtest86+, based on another post here.

  • Answers
  • Andrew Schulman

    Intermittent problems like these are hard to diagnose, but it does smell like a hardware problem.

    memtest86+ is a good idea. Also, are you monitoring your CPU and other mainboard temperatures? I believe that overheating can cause intermittent glitches like the ones you're seeing. When your mainboard was replaced, the heatsink may not have been reconnected well to the CPU (or chipset). A cheap thing to try would be to replace the thermal interface layer between the CPU and heatsink.

    If that doesn't work, then you're probably going to have to start swapping out hardware components to rule each one out. Start with the video card if you have a spare, but it sounds as though you may have failures in multiple systems, which suggests a bad mainboard. Sorry.

  • JESii

    Bad memory! All these symptoms were apparently caused by bad memory. Memtest86+ reported errors in 5 locations (8Gigs, 2x4). Replaced with new memory, Memtest86+ ran clean for two full passes. Now running for over 12 hours and no Chrome problems, tar backup ran just fine and verified.

    What I learned: Multiple symptoms? Test memory.


  • Related Question

    crash - how to diagnose a hard system seizure? Dell+Ubuntu
  • rob

    I've got Ubuntu 9.10 on a Dell Vostro 420 desktop, a little over a year old, which I use for plain vanilla work stuff (email, web, terminal, text editor). Every now and then, at totally random times, it completely freezes on me. Hard. Mouse and keyboard stop working, cursor stops blinking, clock stops moving. All I can do is hold down the power button on the front of the box to shut it off.

    Sometimes it happens after several months of continuous uptime; sometimes it happens a few minutes after a reboot, while all I've done is open a terminal to look at log files, or maybe firefox to do a google search. Each time, there is nothing at all in /var/log/messages at the time of the crash. This makes it seem like a hardware problem, and indeed a few months ago I opened the box and wiggled everything and the problem went away for a while. But now it's back. I went in and checked everything, took out each RAM card and reseated. No luck. I ran all the system diagnostics (the long version) and everything passed with flying colors. Something is messed up in this box, but without any useful logs or failed tests, how in the world am I going to find it? And of course, Dell's not gonna help me cause I went and replaced Windows with Ubuntu.

    What steps would you take next to track down this problem?


  • Related Answers
  • Janne Pikkarainen

    Here's a checklist I always follow in the situations similar to yours:

    • Keep an eye of the temperature. Last time I had this kind of problem, I put a temperature graph on my KDE 4.x desktop and quickly saw that the slowdowns/hangs were strictly related to temperature. After I opened up the laptop and cleaned the dust, everything started to work.

    • Are the fans working OK? Check the fan rotation speed.

    • Is some application suddenly and very rapidly eating up all the available RAM? See the HD activity and memory usage via your favourite application - sar, Gnome system monitor, mrtg, whatever.

    • If you have desktop effects enabled, try to disable them and see if the problem is related to 3d acceleration. And if you have 3d enabled, you might try to cause the crash with some 3d torturing, for example by installing & playing tuxracer (or ppracer, whatever it's called today).

    • If the hangs are completely random, suspect the power supply/battery. My Dell Latitude D830 has already one battery replaced already, I got this thing back in late 2007. In my case the battery just died one night - it did not recharge at all and the laptop was blinking some strange lights, but I would not wonder if a malfunctioning battery would cause sudden lockups.

    And as mentioned, flaky HDs can cause all kind of funny side-effects. Try smartctl -a /dev/sda (or whatever your HD is).

  • James T

    Bad hard drives can cause freezes. Check your S.M.A.R.T status and post it up. Be aware that many drives become flaky and fail without any sign in the S.M.A.R.T status. Is the hard drive light on solid when it freezes? You can try running from a live CD for a while to see if you can reproduce the freeze. If it is not reproducible from a live CD, your probably looking at a flaky hard drive. Keeping an eye on the system temps might also provide some clues. Does it crash more when the weather is warm? Since you don't see anything in the messages log, it does not sound like a software issue.

  • nik

    You can setup a persistent Ubuntu install on a USB Flash drive (8-16 GB will do fine).
    Then start using that for a while and access your data from the hard drive.
    Change your BIOS boot settings to first try the USB and then the harddisk
    (and, remember to avoid keeping any other USBs plugged in. Though, you can take a few trials do locate the first point in your USB ports, if you keep the Ubuntu USB plugged there, I think other USBs will not be attempted at boot time).

    Use can micro-USB flash drives (like this Transcend T3 model) if form-factor is a problem.

    While you continue your normal work from this USB booted Ubuntu,
    keep a check for your problem reproductions.
    Since the harddisk is not in the path, any problems related to it will be bypassed.