rcu_sched detected stall on CPU

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite












Seen multiple rcu_sched stall messages in a customer device and it gets crashed/hung. Under this condition, the device is not accessible via SSH or 3G.
Kernel version is 3.2.54. "rcu_sched detected stall on CPU 0" is repeated many times, what does this indicate? This device exhibits this crash during a power cycling test. acpower_isr()/poe_isr() is used to update AC power status/PoE status during each switch-over. Does this causing the issue? (unable to release the lock?)



 Backtrace: 
[<c4011504>] (dump_backtrace+0x0/0x110) from [<c43924bc>] (dump_stack+0x18/0x1c)
r6:c962e080 r5:c96462e0 r4:c9ec4674 r3:c96429bc
[<c43924a4>] (dump_stack+0x0/0x1c) from [<c4082188>] (__rcu_pending+0x88/0x38c)
[<c4082100>] (__rcu_pending+0x0/0x38c) from [<c4083218>] (rcu_check_callbacks+0xe8/0x17c)
[<c4083130>] (rcu_check_callbacks+0x0/0x17c) from [<c4043ac4>] (update_process_times+0x40/0x64)
r8:23339c9a r7:00000000 r6:c6f06ae0 r5:00000000 r4:c8ac8000
r3:00010000
[<c4043a84>] (update_process_times+0x0/0x64) from [<c406513c>] (tick_sched_timer+0x9c/0xdc)
r7:c9ec44a0 r6:c8ac9dd8 r5:c8ac8000 r4:c9ec4598
[<c40650a0>] (tick_sched_timer+0x0/0xdc) from [<c405805c>] (__run_hrtimer+0xf4/0x1c8)
r9:c8ac9d20 r8:23339580 r6:c9ec44d8 r5:c9ec44a0 r4:c9ec4598
[<c4057f68>] (__run_hrtimer+0x0/0x1c8) from [<c4058db4>] (hrtimer_interrupt+0x124/0x288)
[<c4058c90>] (hrtimer_interrupt+0x0/0x288) from [<c40139e0>] (twd_handler+0x28/0x30)
[<c40139b8>] (twd_handler+0x0/0x30) from [<c407f880>] (handle_percpu_devid_irq+0xd0/0x150)
r4:0000001d r3:c40139b8
[<c407f7b0>] (handle_percpu_devid_irq+0x0/0x150) from [<c407be30>] (generic_handle_irq+0x34/0x48)
[<c407bdfc>] (generic_handle_irq+0x0/0x48) from [<c400e5e0>] (handle_IRQ+0x80/0xc0)
[<c400e560>] (handle_IRQ+0x0/0xc0) from [<c40081d0>] (asm_do_IRQ+0x10/0x14)
r5:20000013 r4:c4395234
[<c40081c0>] (asm_do_IRQ+0x0/0x14) from [<c400d738>] (__irq_svc+0x38/0x120)
Exception stack(0xc8ac9dd8 to 0xc8ac9e20)
9dc0: c96ae534 00000013
9de0: 00000001 00000001 c96ae52c c82385a0 00000001 00000001 00006000 d0800000
9e00: d0800000 c8ac9e2c c8ac9e30 c8ac9e20 c40d2f4c c4395234 20000013 ffffffff
[<c4395218>] (_raw_spin_lock+0x0/0x30) from [<c40d2f4c>] (alloc_vmap_area.clone.18+0xa8/0x2f8)
[<c40d2ea4>] (alloc_vmap_area.clone.18+0x0/0x2f8) from [<c40d3268>] (__get_vm_area_node.clone.19+0xcc/0x164)
[<c40d319c>] (__get_vm_area_node.clone.19+0x0/0x164) from [<c40d3bec>] (__vmalloc_node_range+0x5c/0x1d0)
[<c40d3b90>] (__vmalloc_node_range+0x0/0x1d0) from [<c40d3da0>] (__vmalloc_node+0x40/0x4c)
r8:c400de84 r7:00000080 r6:00b7a080 r5:0000465c r4:0000465c
[<c40d3d60>] (__vmalloc_node+0x0/0x4c) from [<c40d3ee4>] (vmalloc+0x30/0x3c)
[<c40d3eb4>] (vmalloc+0x0/0x3c) from [<c406de40>] (sys_init_module+0x5c/0x1878)
[<c406dde4>] (sys_init_module+0x0/0x1878) from [<c400dd00>] (ret_fast_syscall+0x0/0x30)
acpower_isr() [105]
poe_isr() [136]
INFO: rcu_sched detected stall on CPU 0 (t=204330 jiffies)






share|improve this question






















  • You should specify which kernel version this is, and try if you can with another (higher) version to see if the problem remains.
    – Patrick Mevzek
    Nov 28 '17 at 11:04










  • Kernel version is 3.2.54, since this is a customer unit, can not check with other version.
    – Ravi
    Nov 28 '17 at 11:11














up vote
2
down vote

favorite












Seen multiple rcu_sched stall messages in a customer device and it gets crashed/hung. Under this condition, the device is not accessible via SSH or 3G.
Kernel version is 3.2.54. "rcu_sched detected stall on CPU 0" is repeated many times, what does this indicate? This device exhibits this crash during a power cycling test. acpower_isr()/poe_isr() is used to update AC power status/PoE status during each switch-over. Does this causing the issue? (unable to release the lock?)



 Backtrace: 
[<c4011504>] (dump_backtrace+0x0/0x110) from [<c43924bc>] (dump_stack+0x18/0x1c)
r6:c962e080 r5:c96462e0 r4:c9ec4674 r3:c96429bc
[<c43924a4>] (dump_stack+0x0/0x1c) from [<c4082188>] (__rcu_pending+0x88/0x38c)
[<c4082100>] (__rcu_pending+0x0/0x38c) from [<c4083218>] (rcu_check_callbacks+0xe8/0x17c)
[<c4083130>] (rcu_check_callbacks+0x0/0x17c) from [<c4043ac4>] (update_process_times+0x40/0x64)
r8:23339c9a r7:00000000 r6:c6f06ae0 r5:00000000 r4:c8ac8000
r3:00010000
[<c4043a84>] (update_process_times+0x0/0x64) from [<c406513c>] (tick_sched_timer+0x9c/0xdc)
r7:c9ec44a0 r6:c8ac9dd8 r5:c8ac8000 r4:c9ec4598
[<c40650a0>] (tick_sched_timer+0x0/0xdc) from [<c405805c>] (__run_hrtimer+0xf4/0x1c8)
r9:c8ac9d20 r8:23339580 r6:c9ec44d8 r5:c9ec44a0 r4:c9ec4598
[<c4057f68>] (__run_hrtimer+0x0/0x1c8) from [<c4058db4>] (hrtimer_interrupt+0x124/0x288)
[<c4058c90>] (hrtimer_interrupt+0x0/0x288) from [<c40139e0>] (twd_handler+0x28/0x30)
[<c40139b8>] (twd_handler+0x0/0x30) from [<c407f880>] (handle_percpu_devid_irq+0xd0/0x150)
r4:0000001d r3:c40139b8
[<c407f7b0>] (handle_percpu_devid_irq+0x0/0x150) from [<c407be30>] (generic_handle_irq+0x34/0x48)
[<c407bdfc>] (generic_handle_irq+0x0/0x48) from [<c400e5e0>] (handle_IRQ+0x80/0xc0)
[<c400e560>] (handle_IRQ+0x0/0xc0) from [<c40081d0>] (asm_do_IRQ+0x10/0x14)
r5:20000013 r4:c4395234
[<c40081c0>] (asm_do_IRQ+0x0/0x14) from [<c400d738>] (__irq_svc+0x38/0x120)
Exception stack(0xc8ac9dd8 to 0xc8ac9e20)
9dc0: c96ae534 00000013
9de0: 00000001 00000001 c96ae52c c82385a0 00000001 00000001 00006000 d0800000
9e00: d0800000 c8ac9e2c c8ac9e30 c8ac9e20 c40d2f4c c4395234 20000013 ffffffff
[<c4395218>] (_raw_spin_lock+0x0/0x30) from [<c40d2f4c>] (alloc_vmap_area.clone.18+0xa8/0x2f8)
[<c40d2ea4>] (alloc_vmap_area.clone.18+0x0/0x2f8) from [<c40d3268>] (__get_vm_area_node.clone.19+0xcc/0x164)
[<c40d319c>] (__get_vm_area_node.clone.19+0x0/0x164) from [<c40d3bec>] (__vmalloc_node_range+0x5c/0x1d0)
[<c40d3b90>] (__vmalloc_node_range+0x0/0x1d0) from [<c40d3da0>] (__vmalloc_node+0x40/0x4c)
r8:c400de84 r7:00000080 r6:00b7a080 r5:0000465c r4:0000465c
[<c40d3d60>] (__vmalloc_node+0x0/0x4c) from [<c40d3ee4>] (vmalloc+0x30/0x3c)
[<c40d3eb4>] (vmalloc+0x0/0x3c) from [<c406de40>] (sys_init_module+0x5c/0x1878)
[<c406dde4>] (sys_init_module+0x0/0x1878) from [<c400dd00>] (ret_fast_syscall+0x0/0x30)
acpower_isr() [105]
poe_isr() [136]
INFO: rcu_sched detected stall on CPU 0 (t=204330 jiffies)






share|improve this question






















  • You should specify which kernel version this is, and try if you can with another (higher) version to see if the problem remains.
    – Patrick Mevzek
    Nov 28 '17 at 11:04










  • Kernel version is 3.2.54, since this is a customer unit, can not check with other version.
    – Ravi
    Nov 28 '17 at 11:11












up vote
2
down vote

favorite









up vote
2
down vote

favorite











Seen multiple rcu_sched stall messages in a customer device and it gets crashed/hung. Under this condition, the device is not accessible via SSH or 3G.
Kernel version is 3.2.54. "rcu_sched detected stall on CPU 0" is repeated many times, what does this indicate? This device exhibits this crash during a power cycling test. acpower_isr()/poe_isr() is used to update AC power status/PoE status during each switch-over. Does this causing the issue? (unable to release the lock?)



 Backtrace: 
[<c4011504>] (dump_backtrace+0x0/0x110) from [<c43924bc>] (dump_stack+0x18/0x1c)
r6:c962e080 r5:c96462e0 r4:c9ec4674 r3:c96429bc
[<c43924a4>] (dump_stack+0x0/0x1c) from [<c4082188>] (__rcu_pending+0x88/0x38c)
[<c4082100>] (__rcu_pending+0x0/0x38c) from [<c4083218>] (rcu_check_callbacks+0xe8/0x17c)
[<c4083130>] (rcu_check_callbacks+0x0/0x17c) from [<c4043ac4>] (update_process_times+0x40/0x64)
r8:23339c9a r7:00000000 r6:c6f06ae0 r5:00000000 r4:c8ac8000
r3:00010000
[<c4043a84>] (update_process_times+0x0/0x64) from [<c406513c>] (tick_sched_timer+0x9c/0xdc)
r7:c9ec44a0 r6:c8ac9dd8 r5:c8ac8000 r4:c9ec4598
[<c40650a0>] (tick_sched_timer+0x0/0xdc) from [<c405805c>] (__run_hrtimer+0xf4/0x1c8)
r9:c8ac9d20 r8:23339580 r6:c9ec44d8 r5:c9ec44a0 r4:c9ec4598
[<c4057f68>] (__run_hrtimer+0x0/0x1c8) from [<c4058db4>] (hrtimer_interrupt+0x124/0x288)
[<c4058c90>] (hrtimer_interrupt+0x0/0x288) from [<c40139e0>] (twd_handler+0x28/0x30)
[<c40139b8>] (twd_handler+0x0/0x30) from [<c407f880>] (handle_percpu_devid_irq+0xd0/0x150)
r4:0000001d r3:c40139b8
[<c407f7b0>] (handle_percpu_devid_irq+0x0/0x150) from [<c407be30>] (generic_handle_irq+0x34/0x48)
[<c407bdfc>] (generic_handle_irq+0x0/0x48) from [<c400e5e0>] (handle_IRQ+0x80/0xc0)
[<c400e560>] (handle_IRQ+0x0/0xc0) from [<c40081d0>] (asm_do_IRQ+0x10/0x14)
r5:20000013 r4:c4395234
[<c40081c0>] (asm_do_IRQ+0x0/0x14) from [<c400d738>] (__irq_svc+0x38/0x120)
Exception stack(0xc8ac9dd8 to 0xc8ac9e20)
9dc0: c96ae534 00000013
9de0: 00000001 00000001 c96ae52c c82385a0 00000001 00000001 00006000 d0800000
9e00: d0800000 c8ac9e2c c8ac9e30 c8ac9e20 c40d2f4c c4395234 20000013 ffffffff
[<c4395218>] (_raw_spin_lock+0x0/0x30) from [<c40d2f4c>] (alloc_vmap_area.clone.18+0xa8/0x2f8)
[<c40d2ea4>] (alloc_vmap_area.clone.18+0x0/0x2f8) from [<c40d3268>] (__get_vm_area_node.clone.19+0xcc/0x164)
[<c40d319c>] (__get_vm_area_node.clone.19+0x0/0x164) from [<c40d3bec>] (__vmalloc_node_range+0x5c/0x1d0)
[<c40d3b90>] (__vmalloc_node_range+0x0/0x1d0) from [<c40d3da0>] (__vmalloc_node+0x40/0x4c)
r8:c400de84 r7:00000080 r6:00b7a080 r5:0000465c r4:0000465c
[<c40d3d60>] (__vmalloc_node+0x0/0x4c) from [<c40d3ee4>] (vmalloc+0x30/0x3c)
[<c40d3eb4>] (vmalloc+0x0/0x3c) from [<c406de40>] (sys_init_module+0x5c/0x1878)
[<c406dde4>] (sys_init_module+0x0/0x1878) from [<c400dd00>] (ret_fast_syscall+0x0/0x30)
acpower_isr() [105]
poe_isr() [136]
INFO: rcu_sched detected stall on CPU 0 (t=204330 jiffies)






share|improve this question














Seen multiple rcu_sched stall messages in a customer device and it gets crashed/hung. Under this condition, the device is not accessible via SSH or 3G.
Kernel version is 3.2.54. "rcu_sched detected stall on CPU 0" is repeated many times, what does this indicate? This device exhibits this crash during a power cycling test. acpower_isr()/poe_isr() is used to update AC power status/PoE status during each switch-over. Does this causing the issue? (unable to release the lock?)



 Backtrace: 
[<c4011504>] (dump_backtrace+0x0/0x110) from [<c43924bc>] (dump_stack+0x18/0x1c)
r6:c962e080 r5:c96462e0 r4:c9ec4674 r3:c96429bc
[<c43924a4>] (dump_stack+0x0/0x1c) from [<c4082188>] (__rcu_pending+0x88/0x38c)
[<c4082100>] (__rcu_pending+0x0/0x38c) from [<c4083218>] (rcu_check_callbacks+0xe8/0x17c)
[<c4083130>] (rcu_check_callbacks+0x0/0x17c) from [<c4043ac4>] (update_process_times+0x40/0x64)
r8:23339c9a r7:00000000 r6:c6f06ae0 r5:00000000 r4:c8ac8000
r3:00010000
[<c4043a84>] (update_process_times+0x0/0x64) from [<c406513c>] (tick_sched_timer+0x9c/0xdc)
r7:c9ec44a0 r6:c8ac9dd8 r5:c8ac8000 r4:c9ec4598
[<c40650a0>] (tick_sched_timer+0x0/0xdc) from [<c405805c>] (__run_hrtimer+0xf4/0x1c8)
r9:c8ac9d20 r8:23339580 r6:c9ec44d8 r5:c9ec44a0 r4:c9ec4598
[<c4057f68>] (__run_hrtimer+0x0/0x1c8) from [<c4058db4>] (hrtimer_interrupt+0x124/0x288)
[<c4058c90>] (hrtimer_interrupt+0x0/0x288) from [<c40139e0>] (twd_handler+0x28/0x30)
[<c40139b8>] (twd_handler+0x0/0x30) from [<c407f880>] (handle_percpu_devid_irq+0xd0/0x150)
r4:0000001d r3:c40139b8
[<c407f7b0>] (handle_percpu_devid_irq+0x0/0x150) from [<c407be30>] (generic_handle_irq+0x34/0x48)
[<c407bdfc>] (generic_handle_irq+0x0/0x48) from [<c400e5e0>] (handle_IRQ+0x80/0xc0)
[<c400e560>] (handle_IRQ+0x0/0xc0) from [<c40081d0>] (asm_do_IRQ+0x10/0x14)
r5:20000013 r4:c4395234
[<c40081c0>] (asm_do_IRQ+0x0/0x14) from [<c400d738>] (__irq_svc+0x38/0x120)
Exception stack(0xc8ac9dd8 to 0xc8ac9e20)
9dc0: c96ae534 00000013
9de0: 00000001 00000001 c96ae52c c82385a0 00000001 00000001 00006000 d0800000
9e00: d0800000 c8ac9e2c c8ac9e30 c8ac9e20 c40d2f4c c4395234 20000013 ffffffff
[<c4395218>] (_raw_spin_lock+0x0/0x30) from [<c40d2f4c>] (alloc_vmap_area.clone.18+0xa8/0x2f8)
[<c40d2ea4>] (alloc_vmap_area.clone.18+0x0/0x2f8) from [<c40d3268>] (__get_vm_area_node.clone.19+0xcc/0x164)
[<c40d319c>] (__get_vm_area_node.clone.19+0x0/0x164) from [<c40d3bec>] (__vmalloc_node_range+0x5c/0x1d0)
[<c40d3b90>] (__vmalloc_node_range+0x0/0x1d0) from [<c40d3da0>] (__vmalloc_node+0x40/0x4c)
r8:c400de84 r7:00000080 r6:00b7a080 r5:0000465c r4:0000465c
[<c40d3d60>] (__vmalloc_node+0x0/0x4c) from [<c40d3ee4>] (vmalloc+0x30/0x3c)
[<c40d3eb4>] (vmalloc+0x0/0x3c) from [<c406de40>] (sys_init_module+0x5c/0x1878)
[<c406dde4>] (sys_init_module+0x0/0x1878) from [<c400dd00>] (ret_fast_syscall+0x0/0x30)
acpower_isr() [105]
poe_isr() [136]
INFO: rcu_sched detected stall on CPU 0 (t=204330 jiffies)








share|improve this question













share|improve this question




share|improve this question








edited Dec 7 '17 at 8:30

























asked Nov 28 '17 at 10:51









Ravi

329214




329214











  • You should specify which kernel version this is, and try if you can with another (higher) version to see if the problem remains.
    – Patrick Mevzek
    Nov 28 '17 at 11:04










  • Kernel version is 3.2.54, since this is a customer unit, can not check with other version.
    – Ravi
    Nov 28 '17 at 11:11
















  • You should specify which kernel version this is, and try if you can with another (higher) version to see if the problem remains.
    – Patrick Mevzek
    Nov 28 '17 at 11:04










  • Kernel version is 3.2.54, since this is a customer unit, can not check with other version.
    – Ravi
    Nov 28 '17 at 11:11















You should specify which kernel version this is, and try if you can with another (higher) version to see if the problem remains.
– Patrick Mevzek
Nov 28 '17 at 11:04




You should specify which kernel version this is, and try if you can with another (higher) version to see if the problem remains.
– Patrick Mevzek
Nov 28 '17 at 11:04












Kernel version is 3.2.54, since this is a customer unit, can not check with other version.
– Ravi
Nov 28 '17 at 11:11




Kernel version is 3.2.54, since this is a customer unit, can not check with other version.
– Ravi
Nov 28 '17 at 11:11










1 Answer
1






active

oldest

votes

















up vote
3
down vote













From the stack we can see that this CPU is stuck in a spinlock while trying to allocate memory (_raw_spin_lock inside alloc_vmap_area). More interestingly, it seems this is happening while trying to load a new module (sys_init_module), which just calls the module's initialisation code (through a pointer jump, which is why you don't see it in the stack trace).



This means that this is extremely likely to either be a kernel bug that's exercised when loading this module, or a bug in the module itself (probably the latter since vmalloc is almost certainly called by the underlying module).



You need to find the module which is responsible for this bug -- look at the processes stuck in D state when this happens, or use something like eBPF to trace new calls to module initialisation.






share|improve this answer




















  • When this happens, would not be able to access the unit. How do I check the process status that time?
    – Ravi
    Nov 29 '17 at 11:43






  • 1




    @Ravi You can use something like atop to do logging, then view processes in D state with atop -r, navigating to the time desired. Of course, in a volatile state like this, this is not guaranteed to work, but there's a reasonable chance that it will be able to continue.
    – Chris Down
    Nov 29 '17 at 12:06











  • Unfortunately don't have many such utilities on the unit.
    – Ravi
    Nov 30 '17 at 12:10










  • Edited- crash logs. Does this "rcu_sched detected stall on CPU 0" is due to acpower_isr()/poe_isr()? _raw_spin_lock() is holding back the CPU indefinitely? Unfortunately there is no debug utility present in the controller which is in this bad state...
    – Ravi
    Dec 5 '17 at 4:53










Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f407466%2frcu-sched-detected-stall-on-cpu%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
3
down vote













From the stack we can see that this CPU is stuck in a spinlock while trying to allocate memory (_raw_spin_lock inside alloc_vmap_area). More interestingly, it seems this is happening while trying to load a new module (sys_init_module), which just calls the module's initialisation code (through a pointer jump, which is why you don't see it in the stack trace).



This means that this is extremely likely to either be a kernel bug that's exercised when loading this module, or a bug in the module itself (probably the latter since vmalloc is almost certainly called by the underlying module).



You need to find the module which is responsible for this bug -- look at the processes stuck in D state when this happens, or use something like eBPF to trace new calls to module initialisation.






share|improve this answer




















  • When this happens, would not be able to access the unit. How do I check the process status that time?
    – Ravi
    Nov 29 '17 at 11:43






  • 1




    @Ravi You can use something like atop to do logging, then view processes in D state with atop -r, navigating to the time desired. Of course, in a volatile state like this, this is not guaranteed to work, but there's a reasonable chance that it will be able to continue.
    – Chris Down
    Nov 29 '17 at 12:06











  • Unfortunately don't have many such utilities on the unit.
    – Ravi
    Nov 30 '17 at 12:10










  • Edited- crash logs. Does this "rcu_sched detected stall on CPU 0" is due to acpower_isr()/poe_isr()? _raw_spin_lock() is holding back the CPU indefinitely? Unfortunately there is no debug utility present in the controller which is in this bad state...
    – Ravi
    Dec 5 '17 at 4:53














up vote
3
down vote













From the stack we can see that this CPU is stuck in a spinlock while trying to allocate memory (_raw_spin_lock inside alloc_vmap_area). More interestingly, it seems this is happening while trying to load a new module (sys_init_module), which just calls the module's initialisation code (through a pointer jump, which is why you don't see it in the stack trace).



This means that this is extremely likely to either be a kernel bug that's exercised when loading this module, or a bug in the module itself (probably the latter since vmalloc is almost certainly called by the underlying module).



You need to find the module which is responsible for this bug -- look at the processes stuck in D state when this happens, or use something like eBPF to trace new calls to module initialisation.






share|improve this answer




















  • When this happens, would not be able to access the unit. How do I check the process status that time?
    – Ravi
    Nov 29 '17 at 11:43






  • 1




    @Ravi You can use something like atop to do logging, then view processes in D state with atop -r, navigating to the time desired. Of course, in a volatile state like this, this is not guaranteed to work, but there's a reasonable chance that it will be able to continue.
    – Chris Down
    Nov 29 '17 at 12:06











  • Unfortunately don't have many such utilities on the unit.
    – Ravi
    Nov 30 '17 at 12:10










  • Edited- crash logs. Does this "rcu_sched detected stall on CPU 0" is due to acpower_isr()/poe_isr()? _raw_spin_lock() is holding back the CPU indefinitely? Unfortunately there is no debug utility present in the controller which is in this bad state...
    – Ravi
    Dec 5 '17 at 4:53












up vote
3
down vote










up vote
3
down vote









From the stack we can see that this CPU is stuck in a spinlock while trying to allocate memory (_raw_spin_lock inside alloc_vmap_area). More interestingly, it seems this is happening while trying to load a new module (sys_init_module), which just calls the module's initialisation code (through a pointer jump, which is why you don't see it in the stack trace).



This means that this is extremely likely to either be a kernel bug that's exercised when loading this module, or a bug in the module itself (probably the latter since vmalloc is almost certainly called by the underlying module).



You need to find the module which is responsible for this bug -- look at the processes stuck in D state when this happens, or use something like eBPF to trace new calls to module initialisation.






share|improve this answer












From the stack we can see that this CPU is stuck in a spinlock while trying to allocate memory (_raw_spin_lock inside alloc_vmap_area). More interestingly, it seems this is happening while trying to load a new module (sys_init_module), which just calls the module's initialisation code (through a pointer jump, which is why you don't see it in the stack trace).



This means that this is extremely likely to either be a kernel bug that's exercised when loading this module, or a bug in the module itself (probably the latter since vmalloc is almost certainly called by the underlying module).



You need to find the module which is responsible for this bug -- look at the processes stuck in D state when this happens, or use something like eBPF to trace new calls to module initialisation.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 28 '17 at 12:18









Chris Down

75.7k11178195




75.7k11178195











  • When this happens, would not be able to access the unit. How do I check the process status that time?
    – Ravi
    Nov 29 '17 at 11:43






  • 1




    @Ravi You can use something like atop to do logging, then view processes in D state with atop -r, navigating to the time desired. Of course, in a volatile state like this, this is not guaranteed to work, but there's a reasonable chance that it will be able to continue.
    – Chris Down
    Nov 29 '17 at 12:06











  • Unfortunately don't have many such utilities on the unit.
    – Ravi
    Nov 30 '17 at 12:10










  • Edited- crash logs. Does this "rcu_sched detected stall on CPU 0" is due to acpower_isr()/poe_isr()? _raw_spin_lock() is holding back the CPU indefinitely? Unfortunately there is no debug utility present in the controller which is in this bad state...
    – Ravi
    Dec 5 '17 at 4:53
















  • When this happens, would not be able to access the unit. How do I check the process status that time?
    – Ravi
    Nov 29 '17 at 11:43






  • 1




    @Ravi You can use something like atop to do logging, then view processes in D state with atop -r, navigating to the time desired. Of course, in a volatile state like this, this is not guaranteed to work, but there's a reasonable chance that it will be able to continue.
    – Chris Down
    Nov 29 '17 at 12:06











  • Unfortunately don't have many such utilities on the unit.
    – Ravi
    Nov 30 '17 at 12:10










  • Edited- crash logs. Does this "rcu_sched detected stall on CPU 0" is due to acpower_isr()/poe_isr()? _raw_spin_lock() is holding back the CPU indefinitely? Unfortunately there is no debug utility present in the controller which is in this bad state...
    – Ravi
    Dec 5 '17 at 4:53















When this happens, would not be able to access the unit. How do I check the process status that time?
– Ravi
Nov 29 '17 at 11:43




When this happens, would not be able to access the unit. How do I check the process status that time?
– Ravi
Nov 29 '17 at 11:43




1




1




@Ravi You can use something like atop to do logging, then view processes in D state with atop -r, navigating to the time desired. Of course, in a volatile state like this, this is not guaranteed to work, but there's a reasonable chance that it will be able to continue.
– Chris Down
Nov 29 '17 at 12:06





@Ravi You can use something like atop to do logging, then view processes in D state with atop -r, navigating to the time desired. Of course, in a volatile state like this, this is not guaranteed to work, but there's a reasonable chance that it will be able to continue.
– Chris Down
Nov 29 '17 at 12:06













Unfortunately don't have many such utilities on the unit.
– Ravi
Nov 30 '17 at 12:10




Unfortunately don't have many such utilities on the unit.
– Ravi
Nov 30 '17 at 12:10












Edited- crash logs. Does this "rcu_sched detected stall on CPU 0" is due to acpower_isr()/poe_isr()? _raw_spin_lock() is holding back the CPU indefinitely? Unfortunately there is no debug utility present in the controller which is in this bad state...
– Ravi
Dec 5 '17 at 4:53




Edited- crash logs. Does this "rcu_sched detected stall on CPU 0" is due to acpower_isr()/poe_isr()? _raw_spin_lock() is holding back the CPU indefinitely? Unfortunately there is no debug utility present in the controller which is in this bad state...
– Ravi
Dec 5 '17 at 4:53

















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f407466%2frcu-sched-detected-stall-on-cpu%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

Displaying single band from multi-band raster using QGIS

How many registers does an x86_64 CPU actually have?