[Bloat] bufferbloat.net server having troubles?

Discussion:

Rich Brown

2018-03-31 18:26:18 UTC

I just went to bufferbloat.net and https://www.bufferbloat.net/projects/bloat/wiki/RRUL_Chart_Explanation/ and am receiving intermittent 502 & 522 errors. Is anyone else seeing this? Let me know if you need more details. Thanks.

Rich

Toke Høiland-Jørgensen

2018-03-31 19:50:45 UTC

Permalink

Post by Rich Brown
I just went to bufferbloat.net and
https://www.bufferbloat.net/projects/bloat/wiki/RRUL_Chart_Explanation/
and am receiving intermittent 502 & 522 errors. Is anyone else seeing
this? Let me know if you need more details. Thanks.

Yeah, the box running the web server is having some issues with NULL
pointer dereferences in tcp_push() in the kernel crashing processes
running TCP. Haven't been able to figure out why :/

[332756.817052] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
[332756.817072] IP: tcp_push+0x40/0x120
[332756.817075] PGD 0 P4D 0
[332756.817082] Oops: 0002 [#11] SMP PTI
[332756.817085] Modules linked in: fuse md4 nls_utf8 cifs ccm dns_resolver fscache wireguard(O) ip6_udp_tunnel udp_tunnel ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_
ipv6 ip6table_filter ip6table_mangle ip6_tables ipt_REJECT nf_reject_ipv4 xt_policy xt_set xt_hashlimit xt_conntrack iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable
_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c crc32c_generic xt_TCPMSS xt_tcpudp iptable_mangle ip_set_hash_ip tun ip_set nfnetlink sunrpc snd_
hda_codec_hdmi nvidia_drm(PO) nvidia_modeset(PO) nls_iso8859_1 nls_cp437 intel_rapl vfat nvidia(PO) fat uvcvideo videobuf2_vmalloc x86_pkg_temp_thermal snd_usb_audio intel_power
clamp videobuf2_memops videobuf2_v4l2 videobuf2_core coretemp joydev dcdbas videodev kvm_intel mousedev
[332756.817155] snd_usbmidi_lib iTCO_wdt input_leds snd_hda_codec_realtek snd_rawmidi evdev mei_wdt iTCO_vendor_support media snd_seq_device drm_kms_helper mac_hid dell_smm_hwm
on snd_hda_codec_generic kvm drm irqbypass snd_hda_intel intel_cstate intel_rapl_perf snd_hda_codec pcspkr snd_hda_core agpgart ipmi_devintf snd_hwdep ipmi_msghandler snd_pcm sy
scopyarea sysfillrect sysimgblt fb_sys_fops snd_timer i2c_i801 snd mei_me soundcore mei lpc_ich shpchp wmi button vmmon(O) vmw_vmci vboxnetflt(O) vboxnetadp(O) pci_stub vboxpci(
O) vboxdrv(O) tcp_bbr sit tunnel4 ip_tunnel sg ip_tables x_tables ext4 crc16 mbcache jbd2 fscrypto algif_skcipher af_alg hid_generic usbhid hid arc4 rt2800usb rt2x00usb rt2800li
b rt2x00lib led_class mac80211 cfg80211 rfkill dm_crypt dm_mod sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel
[332756.817239] ghash_clmulni_intel pcbc xhci_pci xhci_hcd ehci_pci ahci ehci_hcd libahci aesni_intel aes_x86_64 crypto_simd glue_helper libata cryptd tg3 libphy scsi_mod e1000
e usbcore ptp usb_common pps_core sch_fq
[332756.817267] CPU: 0 PID: 22603 Comm: nginx Tainted: P D O 4.14.29-1-lts #1
[332756.817270] Hardware name: Dell Inc. Precision T3610/09M8Y8, BIOS A15 01/04/2018
[332756.817273] task: ffff97ccd2914880 task.stack: ffffbe1e04cc4000
[332756.817279] RIP: 0010:tcp_push+0x40/0x120
[332756.817282] RSP: 0018:ffffbe1e04cc7d00 EFLAGS: 00010246
[332756.817286] RAX: 0000000000000000 RBX: ffff97cb98b43b80 RCX: 0000000000000001
[332756.817289] RDX: 0000000000000000 RSI: 0000000000000040 RDI: ffff97cb98b43b80
[332756.817292] RBP: 00000000000021f0 R08: 00000000000005a8 R09: 00000000000005a8
[332756.817295] R10: ffff97cb98b43cd8 R11: 0000000000000000 R12: 00000000000005a8
[332756.817298] R13: 0000000000000040 R14: ffff97cb98b43cd8 R15: 00000000ffffffe0
[332756.817302] FS: 00007f909a778b80(0000) GS:ffff97ceafc00000(0000) knlGS:0000000000000000
[332756.817305] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[332756.817308] CR2: 0000000000000038 CR3: 000000031172a001 CR4: 00000000001606f0
[332756.817311] Call Trace:
[332756.817320] tcp_sendmsg_locked+0xb10/0xe50
[332756.817328] ? sock_poll+0x70/0x90
[332756.817334] tcp_sendmsg+0x27/0x40
[332756.817339] sock_write_iter+0xa3/0x110
[332756.817347] __vfs_write+0x102/0x180
[332756.817353] vfs_write+0xad/0x1a0
[332756.817358] SyS_write+0x52/0xc0
[332756.817366] do_syscall_64+0x67/0x120
[332756.817379] RIP: 0033:0x7f909a1958b4
[332756.817382] RSP: 002b:00007ffff5e51c48 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[332756.817386] RAX: ffffffffffffffda RBX: 0000000000002f4c RCX: 00007f909a1958b4
[332756.817389] RDX: 0000000000002f4c RSI: 000055be7827c36c RDI: 0000000000000016
[332756.817392] RBP: 000055be78050340 R08: 0000000000000000 R09: 0000000000000000
[332756.817395] R10: 000055be7804eab0 R11: 0000000000000246 R12: 000055be7827c36c
[332756.817398] R13: 0000000000002f4c R14: 00007f909a778b00 R15: 000055be78272150
[332756.817401] Code: 00 48 8b 87 60 01 00 00 4c 8d 97 58 01 00 00 ba 00 00 00 00 41 89 f3 49 39 c2 48 0f 44 c2 41 81 e3 00 80 00 00 0f 85 9d 00 00 00 <80> 48 38 08 8b 97 74 06
00 00 89 97 7c 06 00 00 83 e6 01 74 0c
[332756.817461] RIP: tcp_push+0x40/0x120 RSP: ffffbe1e04cc7d00
[332756.817463] CR2: 0000000000000038
[332756.817469] ---[ end trace 797c2d8c9eead6f1 ]---

-Toke

Jonathan Morton

2018-03-31 19:54:53 UTC

Permalink

Post by Toke HÃ¸iland-JÃ¸rgensen
Yeah, the box running the web server is having some issues with NULL
pointer dereferences in tcp_push() in the kernel crashing processes
running TCP. Haven't been able to figure out why :/

Maybe build/install a new kernel and reboot?

Possibility exists of hardware failure, too. Less likely, perhaps, but if you don't have ECC...

- Jonathan Morton

Eric Dumazet

2018-04-01 00:50:31 UTC

Permalink

Post by Jonathan Morton

Maybe build/install a new kernel and reboot?
Possibility exists of hardware failure, too. Less likely, perhaps, but if you don't have ECC...

Nope, known bug on stable kernels.

Please upgrade or downgrade.

$ git log --oneline v4.14.31..v4.14.32 -- net/ipv4/tcp.c
e44c1733059c tcp: purge write queue upon aborting the connection

commit e44c1733059c69868e81f82eb09fcb6bbc492050
Author: Soheil Hassas Yeganeh <***@google.com>
Date: Tue Mar 6 17:15:12 2018 -0500

tcp: purge write queue upon aborting the connection

[ Upstream commit e05836ac07c77dd90377f8c8140bce2a44af5fe7 ]

When the connection is aborted, there is no point in
keeping the packets on the write queue until the connection
is closed.

Similar to a27fd7a8ed38 ('tcp: purge write queue upon RST'),
this is essential for a correct MSG_ZEROCOPY implementation,
because userspace cannot call close(fd) before receiving
zerocopy signals even when the connection is aborted.

Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Soheil Hassas Yeganeh <***@google.com>
Signed-off-by: Neal Cardwell <***@google.com>
Reviewed-by: Eric Dumazet <***@google.com>
Signed-off-by: Yuchung Cheng <***@google.com>
Signed-off-by: David S. Miller <***@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <***@linuxfoundation.org>

Toke Høiland-Jørgensen

2018-04-01 15:45:21 UTC

Permalink

Post by Eric Dumazet

Post by Jonathan Morton

Maybe build/install a new kernel and reboot?
Possibility exists of hardware failure, too. Less likely, perhaps, but if you don't have ECC...

Nope, known bug on stable kernels.
Please upgrade or downgrade.

Ah, saw that in the changelog for 4.14.32, and was hoping it was a fix
for the issue I was seeing. But haven't had time to upgrade yet. Thanks
for confirming that it is! I'll go upgrade I guess :)

-Toke