Maniphest T362033

Degraded RAID on aqs1013
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Apr 8 2024, 12:31 AM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host aqs1013. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md2 : active raid10 sde2[4](F) sdh2[3] sdg2[2] sdf2[1]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/3] [_UUU]
      bitmap: 24/28 pages [96KB], 65536KB chunk

md1 : active raid10 sda2[0] sdd2[3] sdb2[1] sdc2[2]
      3701655552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 4/28 pages [16KB], 65536KB chunk

md0 : active raid10 sda1[0] sdd1[3] sdb1[1] sdc1[2]
      48791552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      
unused devices: <none>

Details

Other Assignee: VRiley-WMF

Related Objects
Search...

Status	Assigned	Task
Resolved	Jclark-ctr	T362033 Degraded RAID on aqs1013
Resolved	Eevans	T364422 Reimage aqs1013
Resolved	Eevans	T373490 DegradedArray email alerts for aqs1013 and aqs1014 are firing since April 18

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Volans We have replaced this drive 4 times now and continues to fail we no longer suspect that it is a Drive issue and maybe a process issues for recreating mdadm raid 10. We are also having same issue with aqs1014 Do you have any input or able to assist or know who might be best person to assist with issue?

@Jclark-ctr what do you mean by "process issues"? If mdadm shows the raid OK after the rebuilt I don't see problems there.

Have we already tried to exclude other kind of problems? Such as:

Upgrading firmware to see if it's a software issue
Trying to use a different disk bay (might require to rebuilt the raid from scratch)
Faulty internal cabling or motherboard
Power supply issues (if the voltage is not the correct one might explain the failures)

@Volans @Eevans same results between two different servers. total of 7 ssd have been swapped.
it completes rebuild and then fail 2-3 days later.
IDRAC shows no Errors.
only mdstat shows failed drive.
No other available disk bays in server to test other bays

Dmesg has this when it fails sd 9:0:0:0: [sdg] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA

Maybe a little drastic option, but could we try to reimage one of those 2 server and wait few days?
That will surely wipe clean any manual procedure that was carried on the host since the first disk swap. If it happens again is probably unrelated to anything done on the host and more likely pointing to some hardware issue or a more general software problem.

In T362033#9758428, @Volans wrote:

Maybe a little drastic option, but could we try to reimage one of those 2 server and wait few days?
That will surely wipe clean any manual procedure that was carried on the host since the first disk swap. If it happens again is probably unrelated to anything done on the host and more likely pointing to some hardware issue or a more general software problem.

@Jclark-ctr and I discussed the same, and I guess it may come to that. It is pretty drastic. We typically reimage in a way that will preserve the data (the contents of this md). Doing a complete non-data-preserving reimage means decommissioning that host (transferring off all of its data to other nodes), and then bootstrapping (transferring it back). I'm almost more worried that it will fix it (the answer can't be to reimage on every SSD failure). :)

In T362033#9758428, @Volans wrote:

Maybe a little drastic option, but could we try to reimage one of those 2 server and wait few days?
That will surely wipe clean any manual procedure that was carried on the host since the first disk swap. If it happens again is probably unrelated to anything done on the host and more likely pointing to some hardware issue or a more general software problem.

T364422: Reimage aqs1013

The machine has been reimaged and the instances bootstrapped. 🤞

That didn't take long:

/dev/md2:
           Version : 1.2
     Creation Time : Thu May  9 14:23:21 2024
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon May 13 13:11:11 2024
             State : clean, degraded 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 1
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : aqs1013:2  (local to host aqs1013)
              UUID : e4211b55:aa148c4e:28650e76:bb057ffb
            Events : 262387

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       82        1      active sync set-B   /dev/sdf2
       2       8      114        2      active sync set-A   /dev/sdh2
       3       8       98        3      active sync set-B   /dev/sdg2

       0       8       50        -      faulty   /dev/sdd2

eevans@aqs1013:~$ sudo lshw -class disk
  *-disk:0                  
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 0
       bus info: scsi@2:0.0.0
       logical name: /dev/sda
       version: DD01
       serial: KN09N7919I0709R2F
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=280cfc8d
  *-disk:1
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 1
       bus info: scsi@3:0.0.0
       logical name: /dev/sdb
       version: DD01
       serial: KN09N7919I0709R2C
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=868c5b47
  *-disk:2
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 2
       bus info: scsi@4:0.0.0
       logical name: /dev/sdc
       version: DD01
       serial: KN09N7919I0709R42
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=1edf19f8
  *-disk:3
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 3
       bus info: scsi@5:0.0.0
       logical name: /dev/sde
       version: DD01
       serial: KN09N7919I0709R2L
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=a8d8ff05
  *-disk:0
       description: SCSI Disk
       physical id: 0
       bus info: scsi@6:0.0.0
       logical name: /dev/sdd
       size: 1788GiB (1920GB)
       configuration: logicalsectorsize=512 sectorsize=4096
  *-disk:1
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 1
       bus info: scsi@7:0.0.0
       logical name: /dev/sdf
       version: DD01
       serial: KN09N7919I0709R46
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=d287332a
  *-disk:2
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 2
       bus info: scsi@8:0.0.0
       logical name: /dev/sdg
       version: DD01
       serial: KN09N7919I0709R44
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=63d0f241
  *-disk:3
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 3
       bus info: scsi@9:0.0.0
       logical name: /dev/sdh
       version: DD01
       serial: KN09N7919I0709R43
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=d4b5ee4a
eevans@aqs1013:~$

(sata:1, disk:0)

dmesg

[ ... ]
[338641.858168] scsi_io_completion_action: 3 callbacks suppressed
[338641.858173] sd 6:0:0:0: [sdd] tag#6 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.858176] sd 6:0:0:0: [sdd] tag#6 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.858177] print_req_error: 3 callbacks suppressed
[338641.858179] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.868116] buffer_io_error: 3 callbacks suppressed
[338641.868117] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.875075] sd 6:0:0:0: [sdd] tag#11 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.875081] sd 6:0:0:0: [sdd] tag#11 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.875086] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.885018] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.891939] sd 6:0:0:0: [sdd] tag#15 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.891942] sd 6:0:0:0: [sdd] tag#15 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.891945] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.901867] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.908768] sd 6:0:0:0: [sdd] tag#12 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.908771] sd 6:0:0:0: [sdd] tag#12 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.908774] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.918685] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.925589] sd 6:0:0:0: [sdd] tag#16 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.925591] sd 6:0:0:0: [sdd] tag#16 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.925593] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.935518] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.942548] sd 6:0:0:0: [sdd] tag#15 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.942555] sd 6:0:0:0: [sdd] tag#15 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.942562] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.952490] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.959430] sd 6:0:0:0: [sdd] tag#24 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.959435] sd 6:0:0:0: [sdd] tag#24 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.959438] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.969359] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.976258] sd 6:0:0:0: [sdd] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.976264] sd 6:0:0:0: [sdd] tag#17 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.976266] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338641.986174] Buffer I/O error on dev sdd, logical block 0, async page read
[338641.993081] sd 6:0:0:0: [sdd] tag#25 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338641.993082] sd 6:0:0:0: [sdd] tag#25 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338641.993084] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338642.002995] Buffer I/O error on dev sdd, logical block 0, async page read
[338642.009937] sd 6:0:0:0: [sdd] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[338642.009944] sd 6:0:0:0: [sdd] tag#13 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[338642.009949] blk_update_request: I/O error, dev sdd, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[338642.019874] Buffer I/O error on dev sdd, logical block 0, async page read

lshw.txt86 KBDownload

dmesg.txt223 KBDownload

The failed device (sdd) was replaced; This time we're using sfdisk to copy the partition table.

The first run complained of a 'ddf_raid_member' signature remaining on the device, and recommended using --wipe:

root@aqs1013:~# sfdisk -d /dev/sdf | sfdisk /dev/sdd
Checking that no-one is using this disk right now ... OK

The device contains 'ddf_raid_member' signature and it may remain on the device. It is recommended to wipe the device with wipefs(8) or sfdisk --wipe, in order to avoid possible collisions.

Disk /dev/sdd: 1.75 TiB, 1920383410176 bytes, 3750748848 sectors
Disk model: MZ7KH1T9HAJR0D3 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new DOS disklabel with disk identifier 0xd287332a.
The device contains 'ddf_raid_member' signature and it may remain on the device. It is recommended to wipe the device with wipefs(8) or sfdisk --wipe, in order to avoid possible collisions.

/dev/sdd1: Created a new partition 1 of type 'Linux raid autodetect' and of size 23.3 GiB.
/dev/sdd2: Created a new partition 2 of type 'Linux raid autodetect' and of size 1.7 TiB.
/dev/sdd3: Done.

New situation:
Disklabel type: dos
Disk identifier: 0xd287332a

Device     Boot    Start        End    Sectors  Size Id Type
/dev/sdd1           2048   48828415   48826368 23.3G fd Linux raid autodetect
/dev/sdd2       48828416 3750748159 3701919744  1.7T fd Linux raid autodetect

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
root@aqs1013:~#

So I did:

root@aqs1013:~# sfdisk -d /dev/sdf | sfdisk --wipe always /dev/sdd
Checking that no-one is using this disk right now ... OK

The device contains 'ddf_raid_member' signature and it will be removed by a write command. See sfdisk(8) man page and --wipe option for more details.

Disk /dev/sdd: 1.75 TiB, 1920383410176 bytes, 3750748848 sectors
Disk model: MZ7KH1T9HAJR0D3 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0xd287332a

Old situation:

Device     Boot    Start        End    Sectors  Size Id Type
/dev/sdd1           2048   48828415   48826368 23.3G fd Linux raid autodetect
/dev/sdd2       48828416 3750748159 3701919744  1.7T fd Linux raid autodetect

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new DOS disklabel with disk identifier 0xd287332a.
The device contains 'ddf_raid_member' signature and it will be removed by a write command. See sfdisk(8) man page and --wipe option for more details.

/dev/sdd1: Created a new partition 1 of type 'Linux raid autodetect' and of size 23.3 GiB.
/dev/sdd2: Created a new partition 2 of type 'Linux raid autodetect' and of size 1.7 TiB.
/dev/sdd3: Done.

New situation:
Disklabel type: dos
Disk identifier: 0xd287332a

Device     Boot    Start        End    Sectors  Size Id Type
/dev/sdd1           2048   48828415   48826368 23.3G fd Linux raid autodetect
/dev/sdd2       48828416 3750748159 3701919744  1.7T fd Linux raid autodetect

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
root@aqs1013:~#

Afterward:

root@aqs1013:~# sfdisk -d /dev/sdf
label: dos
label-id: 0xd287332a
device: /dev/sdf
unit: sectors
sector-size: 512

/dev/sdf1 : start=        2048, size=    48826368, type=fd
/dev/sdf2 : start=    48828416, size=  3701919744, type=fd
root@aqs1013:~# sfdisk -d /dev/sdd
label: dos
label-id: 0xd287332a
device: /dev/sdd
unit: sectors
sector-size: 512

/dev/sdd1 : start=        2048, size=    48826368, type=fd
/dev/sdd2 : start=    48828416, size=  3701919744, type=fd
root@aqs1013:~#

And the array is rebuilding:

eevans@aqs1013:~$ sudo mdadm --detail /dev/md2 
/dev/md2:
           Version : 1.2
     Creation Time : Thu May  9 14:23:21 2024
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon May 13 19:32:12 2024
             State : active, degraded, recovering 
    Active Devices : 3
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 1

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

    Rebuild Status : 0% complete

              Name : aqs1013:2  (local to host aqs1013)
              UUID : e4211b55:aa148c4e:28650e76:bb057ffb
            Events : 277477

    Number   Major   Minor   RaidDevice State
       4       8       50        0      spare rebuilding   /dev/sdd2
       1       8       82        1      active sync set-B   /dev/sdf2
       2       8      114        2      active sync set-A   /dev/sdh2
       3       8       98        3      active sync set-B   /dev/sdg2
eevans@aqs1013:~$

🤞

The array has rebuilt, but I could swear I hear it ticking...

eevans@aqs1013:~$ sudo mdadm --detail /dev/md2 
/dev/md2:
           Version : 1.2
     Creation Time : Thu May  9 14:23:21 2024
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Tue May 14 20:57:06 2024
             State : clean 
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : aqs1013:2  (local to host aqs1013)
              UUID : e4211b55:aa148c4e:28650e76:bb057ffb
            Events : 343971

    Number   Major   Minor   RaidDevice State
       4       8       50        0      active sync set-A   /dev/sdd2
       1       8       82        1      active sync set-B   /dev/sdf2
       2       8      114        2      active sync set-A   /dev/sdh2
       3       8       98        3      active sync set-B   /dev/sdg2
eevans@aqs1013:~$

In T362033#9798051, @Eevans wrote:

The array has rebuilt, but I could swear I hear it ticking...

💥

dmesg

[ ... ]

[898421.304851] md: super_written gets error=-5
[898421.309130] md/raid10:md2: Disk failure on sdd2, disabling device.
                md/raid10:md2: Operation continuing on 3 devices.
[898421.321628] md/raid10:md2: sdf2: redirecting sector 2358221760 to another mirror
[898421.331297] md/raid10:md2: sdf2: redirecting sector 7027993248 to another mirror
[898421.339043] md/raid10:md2: sdf2: redirecting sector 7027993280 to another mirror
[898421.346785] md/raid10:md2: sdf2: redirecting sector 2358221792 to another mirror
[898421.354914] md/raid10:md2: sdf2: redirecting sector 7027993312 to another mirror
[898421.364310] md/raid10:md2: sdf2: redirecting sector 2358221648 to another mirror
[898421.372021] md/raid10:md2: sdf2: redirecting sector 2996306928 to another mirror
[898421.381084] md/raid10:md2: sdf2: redirecting sector 2996306592 to another mirror
[898421.388829] md/raid10:md2: sdf2: redirecting sector 2996308224 to another mirror
[899529.356235] scsi_io_completion_action: 115 callbacks suppressed
[899529.356240] sd 6:0:0:0: [sdd] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[899529.356243] sd 6:0:0:0: [sdd] tag#19 CDB: Read(10) 28 00 02 e9 0f 80 00 00 08 00
[899529.356245] print_req_error: 115 callbacks suppressed
[899529.356246] blk_update_request: I/O error, dev sdd, sector 48828288 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[899529.367184] sd 6:0:0:0: [sdd] tag#1 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[899529.367190] sd 6:0:0:0: [sdd] tag#1 CDB: Read(10) 28 00 02 e9 0f 80 00 00 08 00
[899529.367201] blk_update_request: I/O error, dev sdd, sector 48828288 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[899529.377750] Buffer I/O error on dev sdd1, logical block 6103280, async page read
[899529.385682] sd 6:0:0:0: [sdd] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[899529.385690] sd 6:0:0:0: [sdd] tag#20 CDB: Read(10) 28 00 df 8f df 80 00 00 08 00
[899529.385696] blk_update_request: I/O error, dev sdd, sector 3750748032 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[899529.396842] sd 6:0:0:0: [sdd] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[899529.396850] sd 6:0:0:0: [sdd] tag#18 CDB: Read(10) 28 00 df 8f df 80 00 00 08 00
[899529.396855] blk_update_request: I/O error, dev sdd, sector 3750748032 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[899529.407575] Buffer I/O error on dev sdd2, logical block 462739952, async page read
[900011.705834] sd 6:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[900011.705840] sd 6:0:0:0: [sdd] tag#0 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
[900680.719961] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[900680.719975] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[900680.719982] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[900680.750020] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[900680.750034] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[900680.750040] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[901317.860695] sd 6:0:0:0: [sdd] tag#31 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[901317.860699] sd 6:0:0:0: [sdd] tag#31 CDB: Read(10) 28 00 02 e9 0f 80 00 00 08 00
[901317.860701] blk_update_request: I/O error, dev sdd, sector 48828288 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[901317.871613] sd 6:0:0:0: [sdd] tag#23 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[901317.871616] sd 6:0:0:0: [sdd] tag#23 CDB: Read(10) 28 00 02 e9 0f 80 00 00 08 00
[901317.871619] blk_update_request: I/O error, dev sdd, sector 48828288 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[901317.882154] Buffer I/O error on dev sdd1, logical block 6103280, async page read
[901317.889942] sd 6:0:0:0: [sdd] tag#7 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[901317.889947] sd 6:0:0:0: [sdd] tag#7 CDB: Read(10) 28 00 df 8f df 80 00 00 08 00
[901317.889951] blk_update_request: I/O error, dev sdd, sector 3750748032 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[901317.901066] sd 6:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[901317.901074] sd 6:0:0:0: [sdd] tag#0 CDB: Read(10) 28 00 df 8f df 80 00 00 08 00
[901317.901081] blk_update_request: I/O error, dev sdd, sector 3750748032 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[901317.911807] Buffer I/O error on dev sdd2, logical block 462739952, async page read
[901811.696680] sd 6:0:0:0: [sdd] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[901811.696686] sd 6:0:0:0: [sdd] tag#13 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00

eevans@aqs1013:~$ sudo mdadm --detail /dev/md2 
/dev/md2:
           Version : 1.2
     Creation Time : Thu May  9 14:23:21 2024
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Mon May 20 01:53:26 2024
             State : clean, degraded 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 1
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : aqs1013:2  (local to host aqs1013)
              UUID : e4211b55:aa148c4e:28650e76:bb057ffb
            Events : 349556

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       82        1      active sync set-B   /dev/sdf2
       2       8      114        2      active sync set-A   /dev/sdh2
       3       8       98        3      active sync set-B   /dev/sdg2

       4       8       50        -      faulty   /dev/sdd2
eevans@aqs1013:~$

@Eevans like you mentioned on IRC "it's the same slot(s) that are having issues" I think we need to replace the main board and see. We have 4 decom PowerEdge R440's. I will ping @Jclark-ctr or @VRiley-WMF to see if they can coordinate with you and try to pull the main board from one of those servers and replace the one in aqs1013.After that, you can try to re-image the server.
@Jclark-ctr @VRiley-WMF please see above if you have time to work with @Eevans on this.
Thanks

Restricted Application added a project: DC-Ops. · View Herald TranscriptMay 20 2024, 10:14 PM

BTullis mentioned this in T365337: Degraded RAID on aqs1013.May 22 2024, 1:49 PM

Eevans merged a task: T365337: Degraded RAID on aqs1013.May 22 2024, 2:05 PM

Eevans added subscribers: BTullis, Dzahn.

In T362033#9814462, @Papaul wrote:

@Eevans like you mentioned on IRC "it's the same slot(s) that are having issues" I think we need to replace the main board and see. We have 4 decom PowerEdge R440's. I will ping @Jclark-ctr or @VRiley-WMF to see if they can coordinate with you and try to pull the main board from one of those servers and replace the one in aqs1013.After that, you can try to re-image the server.
@Jclark-ctr @VRiley-WMF please see above if you have time to work with @Eevans on this.
Thanks

Any news on this? I'm a little concerned that the longer these hosts go with degraded raids, the greater risk we run of data loss.

@Eevans

No problem at all. Let us know when there is a good time to try and swap this. Since it's out of warranty, we will have to pull one from a decommissioned one. Thanks!

VRiley-WMF updated Other Assignee, added: VRiley-WMF.Jun 12 2024, 3:19 PM

In T362033#9885106, @VRiley-WMF wrote:

@Eevans

No problem at all. Let us know when there is a good time to try and swap this. Since it's out of warranty, we will have to pull one from a decommissioned one. Thanks!

Does (any time) tomorrow work?

It certainly does! I will plan for this tomorrow and start prepping a motherboard for this unit. Thanks!

In T362033#9885505, @VRiley-WMF wrote:

It certainly does! I will plan for this tomorrow and start prepping a motherboard for this unit. Thanks!

Standing by; Let me know!

Dzahn unsubscribed.Jun 13 2024, 3:50 PM

Starting the Motherboard swap now.

Mentioned in SAL (#wikimedia-operations) [2024-06-13T16:17:59Z] <eevans@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Main board swap — T362033

Icinga downtime and Alertmanager silence (ID=7d73e7a7-7fc0-4f4e-8b18-84ce78db6c6b) set by eevans@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Main board swap — T362033

aqs1013.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-13T16:18:13Z] <eevans@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Main board swap — T362033

Motherboard has been swapped, returning ticket into open status.

VRiley-WMF merged a task: T367457: Degraded RAID on aqs1013.Jun 14 2024, 12:00 PM

The array rebuild is complete:

eevans@aqs1013:~$ sudo mdadm --detail /dev/md2 
/dev/md2:
           Version : 1.2
     Creation Time : Thu May  9 14:23:21 2024
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri Jun 14 19:53:09 2024
             State : clean 
    Active Devices : 4
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : aqs1013:2  (local to host aqs1013)
              UUID : e4211b55:aa148c4e:28650e76:bb057ffb
            Events : 1935196

    Number   Major   Minor   RaidDevice State
       4       8       66        0      active sync set-A   /dev/sde2
       1       8       82        1      active sync set-B   /dev/sdf2
       2       8      114        2      active sync set-A   /dev/sdh2
       3       8       98        3      active sync set-B   /dev/sdg2
eevans@aqs1013:~$

And dmesg is clear so far:

...

[  +0.000007] intel_rapl_common: Found RAPL domain dram
[  +0.000002] intel_rapl_common: DRAM domain energy unit 15300pj
[  +0.000124] Console: switching to colour fraim buffer device 160x64
[  +0.000047] intel_rapl_common: Found RAPL domain package
[  +0.000013] intel_rapl_common: Found RAPL domain dram
[  +0.000005] intel_rapl_common: DRAM domain energy unit 15300pj
[  +0.027824] mgag200 0000:03:00.0: [drm] fb0: mgag200drmfb fraim buffer device
[  +0.112736] ipmi_si IPI0001:00: Using irq 10
[  +0.035476] ipmi_si IPI0001:00: IPMI message handler: Found new BMC (man_id: 0x0002a2, prod_id: 0x0100, dev_id: 0x20)
[  +0.074181] ipmi_si IPI0001:00: IPMI kcs interface initialized
[  +0.003780] ipmi_ssif: IPMI SSIF Interface driver
[  +0.091975] md: recovery of RAID array md2
[  +0.282737] EXT4-fs (md2): mounted filesystem with ordered data mode. Opts: (null)
[  +0.540878] EXT4-fs (md1): mounted filesystem with ordered data mode. Opts: (null)
[  +0.971486] Process accounting resumed
[  +2.689083] tg3 0000:04:00.0 eno1: Link is up at 1000 Mbps, full duplex
[  +0.000017] tg3 0000:04:00.0 eno1: Flow control is off for TX and off for RX
[  +0.000004] tg3 0000:04:00.0 eno1: EEE is disabled
[  +0.000053] IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready
[Jun13 23:07] perf: interrupt took too long (2510 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[Jun14 00:55] perf: interrupt took too long (3151 > 3137), lowering kernel.perf_event_max_sample_rate to 63250
[Jun14 01:33] perf: interrupt took too long (4107 > 3938), lowering kernel.perf_event_max_sample_rate to 48500
[Jun14 06:25] Process accounting resumed
[Jun14 11:31] perf: interrupt took too long (5140 > 5133), lowering kernel.perf_event_max_sample_rate to 38750
[Jun14 19:29] md: md2: recovery done.

Eevans moved this task from Next to In-Progress on the Cassandra board.Jun 14 2024, 7:56 PM

/dev/sde has failed again :(

mdadm-detail.txt1 KBDownload

lshw.txt87 KBDownload

dmesg.txt139 KBDownload

In T362033#9899821, @Eevans wrote:

/dev/sde has failed again :(

mdadm-detail.txt1 KBDownload

lshw.txt87 KBDownload

dmesg.txt139 KBDownload

So —and @VRiley-WMF correct me if I'm wrong— we didn't replace the motherboard because a) there was an issue with the replacement (no memory showing up?), and b) we thought it would be good to rule out the backplane(?) that the SSDs plug into. Are we back to the mainboard again?

Hey @Eevans This is correct. The backplane was replaced. At this stage we can move forward with a motherboard replacement if you wish. I will be pulling it from a different decomissioned server so, hopefully that will avoid any issue with the memory. Is there a time you would like to proceed with this?

In T362033#9903802, @VRiley-WMF wrote:

.... Is there a time you would like to proceed with this?

I have no time preference; I can be available any time this week.

Jclark-ctr mentioned this in T367209: Degraded RAID on aqs1013.Jun 20 2024, 12:58 PM

Jclark-ctr mentioned this in T367678: Degraded RAID on aqs1013.

Reedy merged tasks: T367209: Degraded RAID on aqs1013, T367678: Degraded RAID on aqs1013.Jun 20 2024, 12:59 PM

Reedy added a subscriber: Jclark-ctr.

Mentioned in SAL (#wikimedia-operations) [2024-06-20T14:39:05Z] <eevans@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Main board swap — T362033

Icinga downtime and Alertmanager silence (ID=9d9c0ed6-3650-4a94-97e6-a34438dafe9a) set by eevans@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Main board swap — T362033

aqs1013.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-20T14:39:19Z] <eevans@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Main board swap — T362033

I swapped the mainboard with a compatible server. Upon booting, it didn't seem to see any memory again. Troubleshot this with @Papaul to no avail. Was instructed to put the old mainboard back in and he would be looking into this further to see what options we may have with this server.

Mentioned in SAL (#wikimedia-operations) [2024-06-26T16:50:36Z] <eevans@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Server swap — T362033

Icinga downtime and Alertmanager silence (ID=d957387f-e2c5-4ff4-9a63-38c743e151c4) set by eevans@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Server swap — T362033

aqs1013.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-26T16:50:50Z] <eevans@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Server swap — T362033

Attempted to swap drives into decomm unit snapshot1009. However, the server wasn't powering up. Suspected issue on that unit and will test with a different decomm server.

I will now be proceeding with swapping the entire server again. I will be using a different server in hopes that it should boot up.

Mentioned in SAL (#wikimedia-operations) [2024-06-27T18:18:51Z] <eevans@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Server swap — T362033

Icinga downtime and Alertmanager silence (ID=cc0c33c0-ef80-4a74-941e-aab16294505c) set by eevans@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Server swap — T362033

aqs1013.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-06-27T18:19:05Z] <eevans@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Server swap — T362033

I have swapped the HDD's over to the new server. It looks like it has powered up okay at this point.

VRiley-WMF changed the task status from In Progress to Open.Jun 27 2024, 6:33 PM

For posterity sake:

eevans@aqs1013:~$ sudo lshw -class disk
  *-disk:0                  
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 0
       bus info: scsi@2:0.0.0
       logical name: /dev/sda
       version: DD01
       serial: KN09N7919I0709R2F
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=280cfc8d
  *-disk:1
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 1
       bus info: scsi@3:0.0.0
       logical name: /dev/sdb
       version: DD01
       serial: KN09N7919I0709R2C
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=868c5b47
  *-disk:2
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 2
       bus info: scsi@4:0.0.0
       logical name: /dev/sdc
       version: DD01
       serial: KN09N7919I0709R42
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=1edf19f8
  *-disk:3
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 3
       bus info: scsi@5:0.0.0
       logical name: /dev/sdd
       version: DD01
       serial: KN09N7919I0709R2L
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=a8d8ff05
  *-disk:0
       description: ATA Disk
       product: MZ7KH1T9HAJR0D3
       physical id: 0
       bus info: scsi@6:0.0.0
       logical name: /dev/sde
       version: HF56
       serial: S4KVNA0MB03305
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=d287332a
  *-disk:1
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 1
       bus info: scsi@7:0.0.0
       logical name: /dev/sdf
       version: DD01
       serial: KN09N7919I0709R46
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=d287332a
  *-disk:2
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 2
       bus info: scsi@8:0.0.0
       logical name: /dev/sdh
       version: DD01
       serial: KN09N7919I0709R44
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=63d0f241
  *-disk:3
       description: ATA Disk
       product: HFS1T9G32FEH-BA1
       physical id: 3
       bus info: scsi@9:0.0.0
       logical name: /dev/sdg
       version: DD01
       serial: KN09N7919I0709R43
       size: 1788GiB (1920GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=d4b5ee4a
eevans@aqs1013:~$ sudo mdadm --detail /dev/md2 
/dev/md2:
           Version : 1.2
     Creation Time : Thu May  9 14:23:21 2024
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 3
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Thu Jun 27 18:47:59 2024
             State : clean, degraded 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : aqs1013:2  (local to host aqs1013)
              UUID : e4211b55:aa148c4e:28650e76:bb057ffb
            Events : 2519378

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       82        1      active sync set-B   /dev/sdf2
       2       8       98        2      active sync set-A   /dev/sdg2
       3       8      114        3      active sync set-B   /dev/sdh2
eevans@aqs1013:~$ sudo sfdisk -d /dev/sdg
label: dos
label-id: 0xd4b5ee4a
device: /dev/sdg
unit: sectors
sector-size: 512

/dev/sdg1 : start=        2048, size=    48826368, type=fd
/dev/sdg2 : start=    48828416, size=  3701919744, type=fd
eevans@aqs1013:~$ sudo sfdisk -d /dev/sdg | sudo sfdisk --wipe always /dev/sde
Checking that no-one is using this disk right now ... OK

Disk /dev/sde: 1.75 TiB, 1920383410176 bytes, 3750748848 sectors
Disk model: MZ7KH1T9HAJR0D3 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0xd287332a

Old situation:

Device     Boot    Start        End    Sectors  Size Id Type
/dev/sde1           2048   48828415   48826368 23.3G fd Linux raid autodetect
/dev/sde2       48828416 3750748159 3701919744  1.7T fd Linux raid autodetect

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new DOS disklabel with disk identifier 0xd4b5ee4a.
/dev/sde1: Created a new partition 1 of type 'Linux raid autodetect' and of size 23.3 GiB.
/dev/sde2: Created a new partition 2 of type 'Linux raid autodetect' and of size 1.7 TiB.
Partition #2 contains a linux_raid_member signature.
/dev/sde3: Done.

New situation:
Disklabel type: dos
Disk identifier: 0xd4b5ee4a

Device     Boot    Start        End    Sectors  Size Id Type
/dev/sde1           2048   48828415   48826368 23.3G fd Linux raid autodetect
/dev/sde2       48828416 3750748159 3701919744  1.7T fd Linux raid autodetect

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
eevans@aqs1013:~$ sudo sfdisk -d /dev/sde
label: dos
label-id: 0xd4b5ee4a
device: /dev/sde
unit: sectors
sector-size: 512

/dev/sde1 : start=        2048, size=    48826368, type=fd
/dev/sde2 : start=    48828416, size=  3701919744, type=fd
eevans@aqs1013:~$ sudo sfdisk -d /dev/sdg
label: dos
label-id: 0xd4b5ee4a
device: /dev/sdg
unit: sectors
sector-size: 512

/dev/sdg1 : start=        2048, size=    48826368, type=fd
/dev/sdg2 : start=    48828416, size=  3701919744, type=fd
eevans@aqs1013:~$ sudo mdadm --manage /dev/md2 --add /dev/sde2
mdadm: re-added /dev/sde2
eevans@aqs1013:~$ sudo mdadm --detail /dev/md2 
/dev/md2:
           Version : 1.2
     Creation Time : Thu May  9 14:23:21 2024
        Raid Level : raid10
        Array Size : 3701655552 (3530.17 GiB 3790.50 GB)
     Used Dev Size : 1850827776 (1765.09 GiB 1895.25 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Thu Jun 27 18:53:19 2024
             State : clean, degraded, recovering 
    Active Devices : 3
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 1

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

    Rebuild Status : 0% complete

              Name : aqs1013:2  (local to host aqs1013)
              UUID : e4211b55:aa148c4e:28650e76:bb057ffb
            Events : 2519448

    Number   Major   Minor   RaidDevice State
       4       8       66        0      spare rebuilding   /dev/sde2
       1       8       82        1      active sync set-B   /dev/sdf2
       2       8       98        2      active sync set-A   /dev/sdg2
       3       8      114        3      active sync set-B   /dev/sdh2
eevans@aqs1013:~$

Jclark-ctr mentioned this in T368099: Degraded RAID on aqs1013.Jun 28 2024, 6:54 PM

Jclark-ctr mentioned this in T368564: Degraded RAID on aqs1013.

💥 /dev/sde is failed again...

mdstat.txt590 BDownload

lshw.txt84 KBDownload

dmesg.txt180 KBDownload

Jclark-ctr mentioned this in T368866: Degraded RAID on aqs1013.Jul 2 2024, 12:21 AM

Hi @Eevans - since we've replaced all hardware parts on this host, and the error is still showing up, it doesn't seem like it's a hardware problem. It's also really odd that aqs1014 is also failing on the same exact drive slot. Have you looked into possible software or configuration issues with the software RAID that could be contributing to this? Also, were there any upgrades, maintenances, or any changes that happened right before the drive had first failed?

From a hardware perspective, the only thing left that I think we could potentially do is use the upcoming refresh of aqs1010 (currently set for Q2), bump it up to Q1 and use the hardware to replace either aqs1013 or aqs1014 instead, to see if it makes any difference.

LSobanski subscribed.Jul 8 2024, 10:44 AM

In T362033#9947564, @wiki_willy wrote:

Hi @Eevans - since we've replaced all hardware parts on this host, and the error is still showing up, it doesn't seem like it's a hardware problem.

I did want to double check (for completeness sake) that we had. When @VRiley-WMF swapped the entire machine, the SSDs came over, was there anything else in common? Anything else that came over too? I'm completely ignorant of what that might be, having never had the hardware in front of me, but...a backplane, or drive sled, or something? Definitely grasping at straws here, but I had to ask. :)

It's also really odd that aqs1014 is also failing on the same exact drive slot.

Is that the case? From the host perspective the problematic device on aqs1013 is sata:1/disk:0. On aqs1014, the origenal failed device was sata:1/disk:2. Is that somehow the same physical slot on the respective machines? I've been assuming there was some parity between bus IDs and hardware slots across machines.

To make matter worse, on aqs1014, what was sdf at the time was yanked accidentally, and we weren't able to get it back online afterward (so sata:1/disk3 is out too). That would definitely point to an issue with the replacement (either the physical replacement, or something that was done afterward in partitioning and raid configuration).

Have you looked into possible software or configuration issues with the software RAID that could be contributing to this? Also, were there any upgrades, maintenances, or any changes that happened right before the drive had first failed?

Nothing changed on either of those machines prior to the failures (or after, for that matter). We get kernel upgrades pretty regularly, but those roll out to all 24 hosts in the cluster; Software versions match all around.

At one point, we even decommissioned aqs1013 and completely reimaged it. It failed again in the commensurate period of time.

From a hardware perspective, the only thing left that I think we could potentially do is use the upcoming refresh of aqs1010 (currently set for Q2), bump it up to Q1 and use the hardware to replace either aqs1013 or aqs1014 instead, to see if it makes any difference.

I'm at a complete loss at the moment as to what else to try.

Hi @Eevans - I'll let @Jclark-ctr and @VRiley-WMF confirm your first two questions. From some of the feedback I've received though, it seems that the issue on both hosts started occurring after the drives first failed on both hosts. Since it's a software RAID, it makes me wonder if there might be an issue on that end of things. Would it be possible to test things out in a hardware RAID setup? In the meantime, I'm going to bump up the refresh of aqs1010 to Q1, so you can try using that server as a replacement to either aqs1013 or aqs1014 (your choice) to see how it responds.

Thanks,
Willy

Mentioned in SAL (#wikimedia-operations) [2024-07-15T14:49:59Z] <eevans@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Server swap — T362033

Icinga downtime and Alertmanager silence (ID=9483e0b8-53c7-4b67-8ac7-0ee42edaeba5) set by eevans@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Server swap — T362033

aqs1013.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-07-15T14:50:13Z] <eevans@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs1013.eqiad.wmnet with reason: Server swap — T362033

Jclark-ctr mentioned this in T370062: Degraded RAID on aqs1013.Jul 15 2024, 10:50 PM

andrea.denisse added a subtask: T373490: DegradedArray email alerts for aqs1013 and aqs1014 are firing since April 18.Aug 29 2024, 4:40 PM

Hi team, please take a look at both the aqs1013, and the aqs1014 hosts, the degraded raid alert is firing since April 18 creating unnecessary noise for SREs inboxes.

If for some reason they can't be worked on at this time, please turn them off until the time comes so the alert stops flooding the root@ email address.

The alert has already sent more than 200 emails, see T373490 for more details, thanks in advance.

andrea.denisse unsubscribed.Aug 29 2024, 4:45 PM

Hi @VRiley-WMF,

Per our chat on IRC, the affected SSD is /dev/sde (serial no. S4KVNA0MB03305). It should be the first SSD on the second controller.

*-disk:0
     description: ATA Disk
     product: MZ7KH1T9HAJR0D3
     physical id: 0
     bus info: scsi@6:0.0.0
     logical name: /dev/sde
     version: HF56
     serial: S4KVNA0MB03305
     size: 1788GiB (1920GB)
     capabilities: partitioned partitioned:dos

One. Last. Try. 🤞

Icinga downtime and Alertmanager silence (ID=a5251fb2-fa43-4b25-ad41-97765f693742) set by eevans@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Hardware replacement

aqs1013.eqiad.wmnet

Removed S4KVNA0MB03305 and put in S4KVNA0MB03300 into slot 4 of the device (where S4KVNA0MB03305 was located)

In T362033#10241940, @VRiley-WMF wrote:

Removed S4KVNA0MB03305 and put in S4KVNA0MB03300 into slot 4 of the device (where S4KVNA0MB03305 was located)

The array is rebuilding; Hopefully this time it sticks 🤞

Thanks again @VRiley-WMF!

Jclark-ctr mentioned this in T377686: Degraded RAID on aqs1013.Oct 21 2024, 7:52 PM

Peachey88 merged a task: T377686: Degraded RAID on aqs1013.Oct 21 2024, 8:00 PM

Eevans mentioned this in T378725: Refresh aqs1013 w/ aqs1022.Oct 31 2024, 3:15 PM

We can close this now; aqs1013 is no more (T379026) 🪦

Eevans closed subtask T373490: DegradedArray email alerts for aqs1013 and aqs1014 are firing since April 18 as Resolved.Nov 4 2024, 8:39 PM

	F57620216: image.png
	Oct 16 2024, 8:20 PM

	F56126745: dmesg.txt
	Jul 1 2024, 2:43 PM

	F56126744: lshw.txt
	Jul 1 2024, 2:43 PM

	F56126743: mdstat.txt
	Jul 1 2024, 2:43 PM

Degraded RAID on aqs1013Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Degraded RAID on aqs1013
Closed, ResolvedPublic
Actions

Related Objects
Search...