We had a disk failure on one of our Xen servers at work last week, and what we thought would be a quick disk replace, turned into a small nightmare.

Our setup is fairly “simple”: 2 x raid1’s consisting of sda1/sdb1 (/dev/md0 mounted at /) and sda3/sdb3 (/dev/md1 with LVM on top of it).

mdadm reported that sdb1 and sdb3 had failed, so we just had to identify which disk was sdb in the server and replace it. Well it wasn’t easy to see which disk has which after we had opened the server, so we decided to boot the server again to look up the drives’ serial number (using hdparm -I /dev/sda, and the small barcode on the front of the disk).

Now the fun part starts. The contents of /proc/mdstat showed something like this after the reboot:

Personalities : [raid1]
md1 : active raid1 sdb3[0] sda3[1] (F)
      235400320 blocks [2/1] [U_]

md0 : active raid1 sdb1[0] (F) sda1[1]
      7815488 blocks [1/2] [_U]

unused devices: <none>

On md0 sdb1 is failed, and on md1 it’s sda3, so one partition is marked failed on each drive. Here we made the big mistake: we decided to readd sdb1 to md0 and sdb3 to md1.

While the raid was syncing there was a lot of disk errors on sda1 and sda3, so we identified sda using its serial number, shutdown the server, replaced the disk, booted and everything looked fine.

Fast forward to the next day: we started receiving e-mails from customers saying data was missing from their sites, and they where missing data from the day the drive failed… then it dawned on us: when we readded sda3 it was overridden with the old data from sdb3 :(. Only one thing to do: restore from backup.

Now the question is: why the hell was sda3 marked as failed after the reboot? It was on the good drive…