剑客
关注科技互联网

Torvalds 就 Linux Kernel 4.8 存在的 bug 道歉

Linus Torvalds 在10月3日发布了Linux 4.8的正式版,在正式发布前他接受了 Andrew Morton 发来的补丁,补丁的目的是修复一个自3.15以来就存在的bug,然而它却导致了比原bug更严重的问题。

他在内核邮件列表上对此道歉,称他对 Andrew的补丁本有很高的期望,认为他肯定对递交的补丁进行了充分的测试,但在代码中加入随机的 BUG_ON()意味着测试不够。他在2002年曾经强烈批评过使用BUG_ON()调试bug的方法,但同样的事情却在将近15年后再次发生了。

邮件内容如下:

BUG_ON() in workingset_node_shadows_dec() triggers

From:  Linus Torvalds 

Date: 

Tue Oct 04 2016 – 00:01:12 EST

  • Next message: 
    Nicolas Pitre: "Re: net/sunrpc/stats.c:204: undefined reference to `_GLOBAL_OFFSET_TABLE_’"
  • Previous message: 
    Stephen Rothwell: "linux-next: Tree for Oct 4"
  • Next in thread: 
    Andrew Morton: "Re: BUG_ON() in workingset_node_shadows_dec() triggers"
  • Messages sorted by: 
    [ date ] [ thread ] [ subject ] [ author ]

I’m really sorry I applied that last series from Andrew just before

doing the 4.8 release, because they cause problems, and now it is in

4.8 (and that buggy crap is marked for stable too).

In particular, I just got this

kernel BUG at ./include/linux/swap.h:276

and the end result was a dead kernel.

The bug that commit 22f2ac51b6d64 ("mm: workingset: fix crash in

shadow node shrinker caused by replace_page_cache_page()") purports to

have fixed has apparently been there since 3.15, but the fix is

clearly worse than the bug it tried to fix, since that original bug

has never killed my machine!

I should have reacted to the damn added BUG_ON() lines. I suspect I

will have to finally just remove the idiotic BUG_ON() concept once and

for all, because there is NO F*CKING EXCUSE to knowingly kill the

kernel.

Why the hell was that not a *warning*?

Yes, I’m grumpy. This went in very late in the release candidates, and

I had higher expectations of things coming in through Andrew. Adding

random BUG_ON()’s to code that clearly hasn’t had sufficient testing

is *not* acceptable, and it’s definitely not acceptable to send that

to me after rc8 unless it has gotten a *lot* of testing, which it

clearly must not have had. Adding stable to the cc too to warn about

this.

The full report is

kernel BUG at ./include/linux/swap.h:276!

invalid opcode: 0000 [#1] SMP

Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE

nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns

nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6

soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis

industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss

nfs_acl lockd grace sunrpc dm_crypt

CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1

Hardware name: System manufacturer System Product Name/Z170-K, BIOS

1803 05/06/2016

task: ffff8faa93ecd940 task.stack: ffff8faa7f478000

RIP: page_cache_tree_insert+0xf1/0x100

RSP: 0018:ffff8faa7f47bab0 EFLAGS: 00010046

RAX: 0000000000000001 RBX: ffff8faadfaf8c18 RCX: ffff8fa8737b5488

RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8fa8737b4b48

RBP: ffff8faa7f47bae8 R08: 0000000000000012 R09: ffff8fa8737b54b0

R10: 0000000000000040 R11: ffff8fa8737b54b0 R12: ffffea000b1ad580

R13: 0000000000000000 R14: ffff8faa7f47bb48 R15: ffffea000b1ad580

FS: 00007ffba3a61780(0000) GS:ffff8faaf6c00000(0000) knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007ffba31a5430 CR3: 00000002c6d40000 CR4: 00000000003406f0

DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Call Trace:

__add_to_page_cache_locked+0x12e/0x270

add_to_page_cache_lru+0x4e/0xe0

mpage_readpages+0x112/0x1d0

blkdev_readpages+0x1d/0x20

__do_page_cache_readahead+0x1ad/0x290

force_page_cache_readahead+0xaa/0x100

page_cache_sync_readahead+0x3f/0x50

generic_file_read_iter+0x5af/0x740

blkdev_read_iter+0x35/0x40

__vfs_read+0xe1/0x130

vfs_read+0x96/0x130

SyS_read+0x55/0xc0

entry_SYSCALL_64_fastpath+0x13/0x8f

Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48

83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f>

0b e8 88 68 ef ff 0f 1f 84 00

RIP page_cache_tree_insert+0xf1/0x100

and I hope somebody can see what is going wrong in there. The reason

the machine *dies* from that thing is that we end up then immediately

having a

BUG: unable to handle kernel paging request at ffffffffb70bdaa8

IP: blk_flush_plug_list+0x8b/0x250

Call Trace:

schedule+0x61/0x80

do_exit+0x8c8/0xae0

rewind_stack_do_exit+0x17/0x20

and then a

Fixing recursive fault but reboot is needed!

and the machine will never recover.

People who add random assert statements that kill machines should damn

well not be let near the VM layer.

Johannes? Please make this your first priority. And in the meantime I

will make that VM_BUG_ON() be a VM_WARN_ON_ONCE().

And dammit, if anybody else feels that they had done "debugging

messages with BUG_ON()", I would suggest you

(a) rethink your approach to programming

(b) send me patches to remove the crap entirely, or make them real

*DEBUGGING* messages, not "kill the whole machine" messages.

I’ve ranted against people using BUG_ON() for debugging in the past.

Why the f*ck does this still happen? And Andrew – please stop taking

those kinds of patches! Lookie here:

https://lwn.net/Articles/13183/

so excuse me for being upset that people still do this shit almost 15

years later.

Linus

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址