cp3033 is unreachable via the production interface since 2018-07-15 11:47:31, mgmt interface is reachable and the console doesn't show nothing out of the ordinary, after logging, dmesg log shows NIC issues
Description
Description
Related Objects
Related Objects
Event Timeline
Comment Actions
root@cp3033:/var/log# ethtool -i eth0 driver: bnx2x version: 1.712.30-0 firmware-version: FFV7.10.17 bc 7.10.11 bus-info: 0000:01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes root@cp3033:/var/log# ethtool eth0 Settings for eth0: Supported ports: [ FIBRE ] Supported link modes: 1000baseT/Full 10000baseT/Full Supported pause frame use: Symmetric Receive-only Supports auto-negotiation: No Advertised link modes: 10000baseT/Full Advertised pause frame use: No Advertised auto-negotiation: No Speed: Unknown! Duplex: Unknown! (255) Port: FIBRE PHYAD: 1 Transceiver: internal Auto-negotiation: off Supports Wake-on: g Wake-on: d Current message level: 0x00000000 (0) Link detected: no
Comment Actions
[10415964.660782] ------------[ cut here ]------------ [10415964.660790] WARNING: CPU: 13 PID: 34222 at /srv/kernel/linux/net/sched/sch_generic.c:316 dev_watchdog+0x226/0x230 [10415964.660793] NETDEV WATCHDOG: eth0 (bnx2x): transmit queue 6 timed out [10415964.660793] Modules linked in: cdc_ether usbnet mii joydev hid_generic usbhid hid cpuid binfmt_misc esp6 xfrm6_mode_transport drbg ansi_cprng seqiv xfrm4_mode_transport cpufreq_conservative cpufreq_powersave cpufreq_userspace xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo 8021q garp mrp stp llc tcp_bbr sch_fq intel_rapl sb_edac ipmi_watchdog edac_core x86_pkg_temp_thermal intel_powerclamp coretemp mgag200 ttm drm_kms_helper kvm dcdbas irqbypass crct10dif_pclmul iTCO_wdt crc32_pclmul iTCO_vendor_support evdev drm ghash_clmulni_intel pcspkr i2c_algo_bit mei_me lpc_ich mei shpchp mfd_core wmi button ipmi_si ipmi_poweroff ipmi_devintf ipmi_msghandler autofs4 ext4 crc16 jbd2 fscrypto mbcache raid1 md_mod sg sd_mod ahci libahci aesni_intel aes_x86_64 glue_helper lrw ehci_pci [10415964.660847] gf128mul bnx2x ablk_helper ptp ehci_hcd cryptd libata pps_core mdio libcrc32c usbcore crc32c_generic scsi_mod usb_common crc32c_intel [10415964.660860] CPU: 13 PID: 34222 Comm: cache-worker Not tainted 4.9.0-0.bpo.6-amd64 #1 Debian 4.9.82-1~wmf1 [10415964.660861] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.0.4 08/28/2014 [10415964.660863] 0000000000000000 ffffffffa67305e5 ffff8fe9bf183e38 0000000000000000 [10415964.660865] ffffffffa6479184 0000000000000006 ffff8fe9bf183e90 ffff8fc9b136c000 [10415964.660868] 000000000000000d ffff8fc9b1377100 000000000000005b ffffffffa64791ff [10415964.660871] Call Trace: [10415964.660872] <IRQ> [10415964.660878] [<ffffffffa67305e5>] ? dump_stack+0x5c/0x77 [10415964.660882] [<ffffffffa6479184>] ? __warn+0xc4/0xe0 [10415964.660884] [<ffffffffa64791ff>] ? warn_slowpath_fmt+0x5f/0x80 [10415964.660888] [<ffffffffa696e476>] ? tcp_retransmit_timer+0x286/0x890 [10415964.660891] [<ffffffffa69369a6>] ? dev_watchdog+0x226/0x230 [10415964.660893] [<ffffffffa6936780>] ? dev_deactivate_queue.constprop.27+0x60/0x60 [10415964.660898] [<ffffffffa64e85b2>] ? call_timer_fn+0x32/0x130 [10415964.660899] [<ffffffffa64e9385>] ? run_timer_softirq+0x1e5/0x440 [10415964.660902] [<ffffffffa67398a4>] ? timerqueue_add+0x54/0xa0 [10415964.660904] [<ffffffffa64ea808>] ? enqueue_hrtimer+0x38/0x90 [10415964.660909] [<ffffffffa6a1617c>] ? __do_softirq+0x10c/0x2a2 [10415964.660911] [<ffffffffa647f4b8>] ? irq_exit+0x98/0xa0 [10415964.660913] [<ffffffffa6a15c14>] ? smp_apic_timer_interrupt+0x44/0x50 [10415964.660915] [<ffffffffa6a14496>] ? apic_timer_interrupt+0x96/0xa0 [10415964.660916] <EOI> [10415964.660920] [<ffffffffa64c5bb3>] ? native_queued_spin_lock_slowpath+0x113/0x190 [10415964.660922] [<ffffffffa6a1245d>] ? _raw_spin_lock+0x1d/0x20 [10415964.660924] [<ffffffffa64fb018>] ? futex_wake+0xc8/0x170 [10415964.660926] [<ffffffffa64fd149>] ? do_futex+0x2d9/0xb40 [10415964.660930] [<ffffffffa64257d9>] ? __switch_to+0x2c9/0x730 [10415964.660932] [<ffffffffa64fda33>] ? SyS_futex+0x83/0x180 [10415964.660936] [<ffffffffa6a0dd52>] ? schedule+0x32/0x80 [10415964.660939] [<ffffffffa6403bd3>] ? do_syscall_64+0x93/0x1a0 [10415964.660941] [<ffffffffa6a126b8>] ? entry_SYSCALL_64_after_swapgs+0x42/0xb0 [10415964.660942] ---[ end trace 17a2f2dfd85d5ced ]---
Comment Actions
Mentioned in SAL (#wikimedia-operations) [2018-07-16T11:38:43Z] <vgutierrez> power cycle cp3033 - T199677
Comment Actions
After a power cycle the server it's behaving properly. Since it was already depooled I'm not repooling it
Comment Actions
That sounds like a hang in the NIC, but I doubt we have any useful hardware diagnostics/logging on that level.
Comment Actions
The host also shows that power supplies are not redundant.. which had a comment linking to T177403 -> T177228.
And support has expired (https://netbox.wikimedia.org/dcim/devices/831/)
Should we rather create a decom ticket for it?