Undoing RAID With mdadm Without Reinstalling Operating System

Sunday, May 8, 2022
Reading time 7 minutes

Currently, I have a dedicated server at Hetzner where I’ve been hosting some personal things. A couple of days ago, I found myself needing to remove one of the two hard drives that the server has from the Software raid that’s configured by default in all Hetzner images, at least from Linux systems based on Debian that can be installed through Hetzner robot.

Normally, the Raid configuration that servers include is good (keeping in mind that Raid doesn’t offer backups, but redundancy), but I needed that second disk for something else, so I was forced to modify the raid to stop including the server’s second disk. The interesting thing about all this is that unlike what they suggest in their official guide, in this case I had to modify the raid, instead of destroying it, to avoid entering Hetzner’s rescue mode and likewise avoid testing my luck with having to install the system from scratch if I broke the entire Software Raid.

This post is simply a write-up about how I achieved this, in case it’s useful in the future.

Note: if you read this and want to try, it’s important to know that a bad configuration or execution of commands can end up with a software raid that doesn’t work as it should, especially if a mistake is made when removing partitions from a different disk than the one you’re always working on. Hardware damage is not expected in this type of case, but you can ruin the entire Raid, making it difficult to rescue the operating system.

Reviewing the Software Raid on the system

Before starting to touch anything, it’s important to find out where we are, in terms of Raid configuration. With the following command we can get some information about our situation:

root@Debian-1100-bullseye-amd64-base ~ # lsblk
NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
loop1     7:1    0 110.6M  1 loop  /snap/core/12834
loop2     7:2    0 111.6M  1 loop  /snap/core/12941
loop3     7:3    0  43.9M  1 loop  /snap/certbot/2035
loop4     7:4    0  43.9M  1 loop  /snap/certbot/1952
loop5     7:5    0  61.9M  1 loop  /snap/core20/1405
loop6     7:6    0  61.9M  1 loop  /snap/core20/1434
sda       8:0    0   3.6T  0 disk
├─sda1    8:1    0    16G  0 part
│ └─md0   9:0    0    16G  0 raid1 [SWAP]
├─sda2    8:2    0   512M  0 part
│ └─md1   9:1    0   511M  0 raid1 /boot
├─sda3    8:3    0     2T  0 part
│ └─md2   9:2    0     2T  0 raid1 /
├─sda4    8:4    0   1.7T  0 part
│ └─md3   9:3    0   1.7T  0 raid1 /home
└─sda5    8:5    0     1M  0 part
sdb       8:16   0   3.6T  0 disk
├─sdb1    8:17   0    16G  0 part
│ └─md0   9:0    0    16G  0 raid1 [SWAP]
├─sdb2    8:18   0   512M  0 part
│ └─md1   9:1    0   511M  0 raid1 /boot
├─sdb3    8:19   0     2T  0 part
│ └─md2   9:2    0     2T  0 raid1 /
├─sdb4    8:20   0   1.7T  0 part
│ └─md3   9:3    0   1.7T  0 raid1 /home
└─sdb5    8:21   0     1M  0 part

From this we can extract the following information:

We have two disks (sda and sdb), which have the same number of partitions.
In our software raid, there are 4 partitions: md0 (Swap space), md1 (boot partition), md2 (system partition, mounted at /) and md3 (user directory partition, mounted at /home).
The partitions we mentioned in the previous point are configured in raid 1. This means that on both disks, SDA and SDB, the same data is synchronized every time there’s a change, to have, practically, one disk being the mirror of the other.

Strategy to follow

In this case, what we’re going to do is remove the second disk from the raid (SDB). To do this, we have to mark the disk as faulty and remove it from the raid, then resize the raid so it’s configured to use a single disk instead of the two expected, and finally, that will have to be repeated on the 4 partitions (md0, md1, md2 and md3). At the end of these steps, you’ll be able to restart the server and have the second disk available, which won’t be linked to the raid in any way, although it must be partitioned and formatted before it can be used.

Step 1. Reviewing the partition in the raid

These steps must be repeated for each existing partition in the raid. In this example the procedure for partition md0 is shown, but the same must be done for the other 3. Pay attention to the raid partitions (mdx) versus the physical disk partitions (sdbx). Normally, partition md0 is assigned to the first partition of the disk (sdb1), partition md1 is assigned to sdb2, and so on.

The first thing to do is check the status of the partition we’ll be working on in the raid with the following command:

root@Debian-1100-bullseye-amd64-base ~ # mdadm -D /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Fri Nov 26 21:01:16 2021
        Raid Level : raid1
        Array Size : 16759808 (15.98 GiB 17.16 GB)
     Used Dev Size : 16759808 (15.98 GiB 17.16 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Sun May  8 09:32:50 2022
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              Name : rescue:0
              UUID : 252735b5:26dfa166:e5010586:ff4ac61e
            Events : 66

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1

Here the most important information is in the last two lines of this command’s response, as it shows us the partitions on both disks where md0 lives. Since we’re going to separate the SDB disk from the raid, we’re interested in knowing that the partition to be removed from the raid will be sdb1.

Step 2. Marking the partition as faulty

If the raid works well (without smart errors having been detected), Mdadm, which is the program that creates the software raid, doesn’t allow uncoupling a partition just like that. If we want to remove a physical partition from a raid system made with mdadm, we first have to mark the partition as failed. This will then allow us to remove it from the raid and uncouple it. The game will end when we manage to uncouple all partitions associated with the raid. At that moment, the disk will be free.

To mark a partition as failed, we must use the following command, replacing the raid partition (in this case md0) and the disk partition we want to uncouple (in this case sdb1):

root@Debian-1100-bullseye-amd64-base ~ # mdadm /dev/md0 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0

Step 3. Remove failed partition from raid

Once we’ve indicated to mdadm that partition sdb1 is failed, we can proceed to remove it from the raid, which will prepare everything to definitively uncouple it later. To remove it, we’ll use this command:

root@Debian-1100-bullseye-amd64-base ~ # mdadm /dev/md0 --remove /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md0

Step 4. Configure the raid to use only one disk

The final step for this partition is to resize the raid to indicate that it should only use one device, so it won’t mark any degraded state. In mdadm terms, what we’ll do is “grow” the raid, so it takes into account that it will only use one device. At first glance it looks weird, but let’s go with the command:

root@Debian-1100-bullseye-amd64-base ~ # mdadm --grow /dev/md0 --raid-devices=1 --force
raid_disks for /dev/md0 set to 1

Conclusion

Once these steps have been performed on each raid partition, it’s safe to restart the computer. Now we have a single-device raid which although it seems strange, at least it has meant that we don’t have to reinstall the operating system. To check the status of the new raid, we can use either the mdadm -D /dev/md0 command (changing the partition number to see them all), or the lsblk command, to display all storage devices on the system. In both cases we should notice that the sdb disk is no longer associated with the raid, which means it can now be formatted and used at will.

Source

Linux Administration Tutorials

mdadm Raid Debian