ubuntu - Linux: Random kernel oops in mmap functions

06
2014-04
  • shanet

    For the last month or so I've been getting seemingly random kernel oopses. I've started to notice a pattern: from looking at the traces, the call trace always involves mmap functions.

    Whenever one of these happens, the process that it occurred under (Chromium in the trace below) hangs and trying to kill it with SIGKILL only results in the kill command hanging as well. To return stability to the system I have to completely power off the box and reboot.

    Until a recent kernel update, the computer would just randomly turn off completely. No warnings and nothing in the logs. Thankfully, that has seemed to stop.

    Question: Is this indicative of a hardware problem? Mmap failures suggest RAM problems (I ran memcheck for 12+ hours with no errors though). Or is this really just a bug in the kernel? If so, what can I do about it?

    $ uname -a
    Linux [name] 3.11.0-15-generic #23-Ubuntu SMP Mon Dec 9 18:17:04 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
    

    Trace from dmesg:

    [252563.113569] BUG: unable to handle kernel paging request at 0000020000000018
    [252563.113589] IP: [<ffffffff811619e0>] vma_interval_tree_insert+0x30/0x90
    [252563.113607] PGD 0 
    [252563.113612] Oops: 0000 [#1] SMP 
    [252563.113620] Modules linked in: serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common xts hidp pci_stub vboxpci(OF) vboxnetadp(OF) vboxnetflt(OF) vboxdrv(OF) vmw_vsock_vmci_transport vsock vmw_vmci parport_pc ppdev rfcomm bnep binfmt_misc usblp x86_pkg_temp_thermal kvm_intel kvm eeepc_wmi asus_wmi sparse_keymap snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel joydev snd_hda_codec btusb bluetooth cdc_acm snd_hwdep snd_pcm microcode snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer psmouse snd serio_raw mei_me mei lpc_ich soundcore mac_hid coretemp lp parport dm_crypt raid10 raid456 async_memcpy async_raid6_recov async_pq async_xor async_tx xor hid_generic raid6_pq raid0 multipath linear hid_logitech_dj usbhid hid raid1 mxm_wmi radeon crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd i2c_algo_bit ttm ahci libahci drm_kms_helper e1000e drm video ptp pps_core wmi
    [252563.113870] CPU: 3 PID: 13428 Comm: Chrome_IOThread Tainted: GF          O 3.11.0-15-generic #23-Ubuntu
    [252563.113890] Hardware name: ASUS All Series/MAXIMUS VI HERO, BIOS 0224 04/25/2013
    [252563.113906] task: ffff88079bc9aee0 ti: ffff880768020000 task.ti: ffff880768020000
    [252563.113922] RIP: 0010:[<ffffffff811619e0>]  [<ffffffff811619e0>] vma_interval_tree_insert+0x30/0x90
    [252563.113943] RSP: 0018:ffff880768021d90  EFLAGS: 00010206
    [252563.113954] RAX: 0000020000000000 RBX: ffff8806d7f4c980 RCX: 0000000000000000
    [252563.113969] RDX: ffff88079bb7bd70 RSI: ffff88079bb7bd70 RDI: ffff88038fa57c38
    [252563.113984] RBP: ffff880768021d98 R08: 000000000000007f R09: 0000000000000000
    [252563.114000] R10: ffff88038fa57c38 R11: 00007f3f14132000 R12: ffff88038fa57c38
    [252563.114015] R13: ffff880100babae8 R14: ffff880100babaf0 R15: ffff88079bb7bd88
    [252563.114030] FS:  00007f3f4fffe700(0000) GS:ffff88081ecc0000(0000) knlGS:0000000000000000
    [252563.114047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [252563.114059] CR2: 0000020000000018 CR3: 00000007ed0b8000 CR4: 00000000001407e0
    [252563.114074] Stack:
    [252563.114079]  ffffffff8116b698 ffff880768021dd8 ffffffff8116c275 ffff880100babac8
    [252563.114097]  ffff880100babaf0 00007f3f140b2000 ffff880100babae8 ffff8806daf9fd00
    [252563.114114]  ffff880100babac8 ffff880768021e60 ffffffff8116e77c ffff8806daf9fd00
    [252563.114132] Call Trace:
    [252563.114139]  [<ffffffff8116b698>] ? __vma_link_file+0x48/0x80
    [252563.114153]  [<ffffffff8116c275>] vma_link+0x75/0xc0
    [252563.114164]  [<ffffffff8116e77c>] mmap_region+0x48c/0x610
    [252563.114177]  [<ffffffff8116ec05>] do_mmap_pgoff+0x305/0x3c0
    [252563.114190]  [<ffffffff8115a3fd>] vm_mmap_pgoff+0x8d/0xc0
    [252563.114202]  [<ffffffff8116d253>] SyS_mmap_pgoff+0x1d3/0x270
    [252563.114215]  [<ffffffff81017402>] SyS_mmap+0x22/0x30
    [252563.114227]  [<ffffffff816f721d>] system_call_fastpath+0x1a/0x1f
    [252563.114240] Code: 48 8b 47 08 48 2b 07 49 89 fa 4c 8b 8f 98 00 00 00 48 89 f2 31 c9 48 c1 e8 0c 4d 8d 44 01 ff eb 27 66 2e 0f 1f 84 00 00 00 00 00 <4c> 39 40 18 73 04 4c 89 40 18 4c 3b 48 40 48 8d 48 08 48 8d 50 
    [252563.114312] RIP  [<ffffffff811619e0>] vma_interval_tree_insert+0x30/0x90
    [252563.114327]  RSP <ffff880768021d90>
    [252563.114335] CR2: 0000020000000018
    [252563.117845] ---[ end trace eb82b12e51fc5733 ]---
    
  • Answers
  • MariusMatutiae

    Since you already have run memtest for a sufficient amount of time, the most obvious hardware suspect has been disculpated. I take it that you have noticed whether the line

     BUG: unable to handle kernel paging request at 0000020000000018
    

    carries the same or a different address every time, right?

    I cannot help you with this report, but may I suggest you use Apport for collecting info on your crashes? Apport is Ubuntu's official package for data collection in cases of crashes and bugs, you find a good Intro here.

    You need to activate it, (edit as sudo /etc/apport/crashdb.conf, find this line,

      'problem_types': ['Bug', 'Package'],
    

    and add a hash symbol, # at its beginning), and it will produce a full trace of the call that generated the crash. No need to worry about ulimit either in recent versions of Ubuntu, since Apport is able to circumvent its indication, even if set to 0.

    By and large, the best thing to do is to upload the crash report to Launchpad; Apport does this automatically. Yet there is some info that may be helpful even to the unexperienced user. The Intro referenced above states:

    Some fields warrant further details:
    
    SegvAnalysis: when examining a Segmentation Fault (signal 11), Apport attempts to review the exact machine instruction that caused the fault, and checks the program counter, source, and destination addresses, looking for any virtual memory address (VMA) that is outside an allocated range (as reported in the ProcMaps attachment).
    
    SegvReason: a VMA can be read from, written to, or executed. On a SegFault, one of these 3 CPU actions has taken place at a given VMA that either not allocated, or lacks permissions to perform the action. For example:
    
    SegvReason: reading NULL VMA would mean that a NULL pointer was most likely dereferenced while reading a value.
    
    SegvReason: writing unknown VMA would mean that something was attempting to write to the destination of a pointer aimed outside of allocated memory. (This is sometimes a security issue.)
    
    SegvReason: executing writable VMA [stack] would mean that something was causing code on the stack to be executed, but the stack (correctly) lacked execute permissions. (This is almost always a security issue.)
    

    In the past, this has allowed me to pinpoint a program with a bug (VirtualBox) which caused the crashes. After a full purge and re-install, the problem evaporated. I just wish you the same luck.


  • Related Question

    ubuntu - What do I do when I get a Linux kernel bug?
  • user2898

    I just bought a tiny computer called a fit-pc2 which came with a somewhat customized Ubuntu 9.10 installation. uname -a reports:

    Linux 2.6.31-34-fitpc2 #7 SMP Thu Apr 22 17:43:26 IDT 2010 i686 GNU/Linux
    

    It seems that after several hours of running with heavy network load, all networking ceases and I get the following in kern.log:

    BUG: unable to handle kernel paging request at ff09dfc0
    IP: [<c0150300>] kthread_should_stop+0x10/0x20
    *pde = 00000000 
    Oops: 0000 [#1] SMP 
    last sysfs file: /sys/devices/pci0000:00/0000:00:1d.7/usb1/idVendor
    Modules linked in: binfmt_misc ppdev sbc_fitpc2_wdt snd_usb_audio snd_usb_lib i2c_isch sch_gpio snd_seq_dummy snd_hda_intel snd_pcm_oss snd_seq_oss snd_seq_midi snd_rawmidi snd_mixer_oss snd_seq_midi_event snd_seq snd_pcm snd_timer snd_page_alloc snd_seq_device iptable_filter ip_tables x_tables snd_hwdep lpc_sch snd psmouse rt2860sta(C) uvcvideo video pl2303 soundcore mfd_core output videodev v4l1_compat lirc_igorplugusb lirc_dev serio_raw lp parport usbhid r8169 mii iegd_mod drm agpgart
    
    Pid: 16, comm: kblockd/1 Tainted: G         C (2.6.31-34-fitpc2 #7) SBC-FITPC2
    EIP: 0060:[<c0150300>] EFLAGS: 00010246 CPU: 1
    EIP is at kthread_should_stop+0x10/0x20
    EAX: ff09dfc4 EBX: c180cbac ECX: 0109d000 EDX: f709df98
    ESI: f709df98 EDI: c180cba0 EBP: f709dfb8 ESP: f709df90
     DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    Process kblockd/1 (pid: 16, ti=f709c000 task=f7084b60 task.ti=f709c000)
    Stack:
     c014c14d c180cba4 00000000 f7084b60 c0150770 f709dfa4 f709dfa4 f7023ef4
    <0> c180cba0 c014c0d0 f709dfe0 c015047c 00000000 00000000 00000000 f709dfcc
    <0> f709dfcc c0150400 00000000 00000000 00000000 c0103ce7 f7023ef4 00000000
    Call Trace:
     [<c014c14d>] ? worker_thread+0x7d/0xe0
     [<c0150770>] ? autoremove_wake_function+0x0/0x40
     [<c014c0d0>] ? worker_thread+0x0/0xe0
     [<c015047c>] ? kthread+0x7c/0x90
     [<c0150400>] ? kthread+0x0/0x90
     [<c0103ce7>] ? kernel_thread_helper+0x7/0x10
    Code: a6 8b 55 0c 8d 4d e0 89 f8 89 34 24 e8 7a fd ff ff 89 c3 eb 92 90 90 90 90 90 90 55 64 a1 00 80 76 c0 8b 80 70 02 00 00 89 e5 5d <8b> 40 fc c3 8d b6 00 00 00 00 8d bf 00 00 00 00 55 ba d7 86 62 
    EIP: [<c0150300>] kthread_should_stop+0x10/0x20 SS:ESP 0068:f709df90
    CR2: 00000000ff09dfc0
    ---[ end trace 06004df70b9cf435 ]---
    BUG: unable to handle kernel paging request at ff09dfc8
    IP: [<c0521bc8>] _spin_lock_irqsave+0x18/0x30
    *pde = 00000000 
    Oops: 0002 [#2] SMP 
    last sysfs file: /sys/devices/pci0000:00/0000:00:1d.7/usb1/idVendor
    Modules linked in: binfmt_misc ppdev sbc_fitpc2_wdt snd_usb_audio snd_usb_lib i2c_isch sch_gpio snd_seq_dummy snd_hda_intel snd_pcm_oss snd_seq_oss snd_seq_midi snd_rawmidi snd_mixer_oss snd_seq_midi_event snd_seq snd_pcm snd_timer snd_page_alloc snd_seq_device iptable_filter ip_tables x_tables snd_hwdep lpc_sch snd psmouse rt2860sta(C) uvcvideo video pl2303 soundcore mfd_core output videodev v4l1_compat lirc_igorplugusb lirc_dev serio_raw lp parport usbhid r8169 mii iegd_mod drm agpgart
    
    Pid: 16, comm: kblockd/1 Tainted: G      D  C (2.6.31-34-fitpc2 #7) SBC-FITPC2
    EIP: 0060:[<c0521bc8>] EFLAGS: 00010086 CPU: 1
    EIP is at _spin_lock_irqsave+0x18/0x30
    EAX: 00000100 EBX: ff09dfc8 ECX: 00000286 EDX: ff09dfc8
    ESI: f7084b60 EDI: ff09dfc4 EBP: f709dd88 ESP: f709dd88
     DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    Process kblockd/1 (pid: 16, ti=f709c000 task=f7084b60 task.ti=f709c000)
    Stack:
     f709dda4 c0127c0b 00000082 00000001 ff09dfc4 f7084b60 00000000 f709ddd0
    <0> c0137fd2 00000086 f70954c4 00000000 f7098480 f709ddf0 f7094fc0 f7084b60
    <0> 00000000 00000009 f709ddf0 c013c3f8 00000001 c1807c60 f709ddf0 f7084b60
    Call Trace:
     [<c0127c0b>] ? complete+0x1b/0x60
     [<c0137fd2>] ? mm_release+0x52/0xf0
     [<c013c3f8>] ? exit_mm+0x18/0x110
     [<c013c6db>] ? do_exit+0xfb/0x2e0
     [<c013998a>] ? print_oops_end_marker+0x2a/0x30
     [<c0522aab>] ? oops_end+0x8b/0xd0
     [<c011eac4>] ? no_context+0xb4/0xd0
     [<c011eb1d>] ? __bad_area_nosemaphore+0x3d/0x1a0
     [<c0133a56>] ? load_balance_newidle+0x96/0x320
     [<c011ec92>] ? bad_area_nosemaphore+0x12/0x20
     [<c0524106>] ? do_page_fault+0x2f6/0x380
     [<c012cc30>] ? finish_task_switch+0x50/0xe0
     [<c0523e10>] ? do_page_fault+0x0/0x380
     [<c0522006>] ? error_code+0x66/0x70
     [<c0523e10>] ? do_page_fault+0x0/0x380
     [<c0150300>] ? kthread_should_stop+0x10/0x20
     [<c014c14d>] ? worker_thread+0x7d/0xe0
     [<c0150770>] ? autoremove_wake_function+0x0/0x40
     [<c014c0d0>] ? worker_thread+0x0/0xe0
     [<c015047c>] ? kthread+0x7c/0x90
     [<c0150400>] ? kthread+0x0/0x90
     [<c0103ce7>] ? kernel_thread_helper+0x7/0x10
    Code: 00 00 00 55 89 e5 f0 83 28 01 79 05 e8 02 ff ff ff 5d c3 55 89 c2 89 e5 9c 58 8d 74 26 00 89 c1 fa 90 8d 74 26 00 b8 00 01 00 00 <f0> 66 0f c1 02 38 e0 74 06 f3 90 8a 02 eb f6 89 c8 5d c3 90 8d 
    EIP: [<c0521bc8>] _spin_lock_irqsave+0x18/0x30 SS:ESP 0068:f709dd88
    CR2: 00000000ff09dfc8
    ---[ end trace 06004df70b9cf436 ]---
    Fixing recursive fault but reboot is needed!
    

    This seems to happen at least once a day. How do I even begin to debug this?


  • Related Answers
  • user42539

    I got the exact same problem when I recompiled the Lenny stock kernel with Atom CPU and RTL8168c/8111c NIC enabled.

    After reverting to the "ubuntu" kernel provided by Compulab, the kernel messages went away. However, the NIC is still dropping connection when the link is under moderate load which is a real PITA for headless servers, which after all these boxes are designed for!

    I would advice anyone planning to run Linux on these boxes to look else where, since these boxes are very unstable on Linux. Better to buy some standard mini-atx setup since Compulab seems as interested in supporting Linux as ATI or FireDTV have been the past decade.

  • William Hilsum

    Simply put, you have narrowed it down to heavy networking use, so I would say, look for a different/better/newer network driver as the problem most likely lies there.

    If you can't, I would go in to the BIOS and disable the network interface (if it is built in) and buy yourself a bog standard non Realtek card then see if you have better luck with that instead.

  • vtest

    If you paid money for a computer that comes with a custom Linux distribution installed, bug the vendor. I believe it's their job to handle it or pass the bug report upstream.

    If you're on your own, it's a different story. You have many choices: change hardware, try building another kernel version, try debugging the issue yourself if you are skilled enough.