A Bad Day for BTRFS Users

Posted on: 2022-02-01

About 10 days ago, on January 20th, I had updated my arch installation before going to bed. This update brings my kernel version up to 5.16.1.arch1-1.

After booting up my pc the next day morning, I immediately experienced a horrible system-wide stutter. htop shows that there is a kernel process called btrfs-cleaner eating up 100% of one core of my CPU. The first thing came to my mind is my snapshots. I had Timeshift running in the background, taking disk snapshots on a daily and hourly basis. A peek into the Timeshift GUI shown that indeed I have 20+ snapshots piling up. I deleted most of them with the timeshift CLI. After a reboot, the constant lag is no longer there. There’s now only lag spikes lasting only a few seconds about every 15 minutes.

Looking into this

Sometime later, I decided to look further into the strange btrfs-cleaner process. I also noticed there are some other processes like btrfs-transaction and btrfs-endio occupying abnormal amounts of CPU time. Google search shows a bunch of old posts, some with disk fragmentation problems, some solved the high CPU usage with rebalancing etc. I thought, okay, this isn’t a new issue and the fix should be relatively simple. I’ll just try them one by one later.

Additionally, I also find a newer email on the Linux mailing list with the subject of Massive I/O usage from btrfs-cleaner after upgrading to 5.16 which definitely fits what is happening. After skimming through it, the focus seems to be that, there is a bug in the defragmentation code, which gets stuck attempting to defrag a 1 byte file.

My attempted fixes

Combined with the knowledge I acquired from reading all these old posts, somehow I interpreted this as the autodefrag option doesn’t work, and a manual defrag is needed. I remembered it’s around 9PM when I typed btrfs filesystem defragment -rvf / into the console. After a few lines of output, it seemed stuck, but the process is still running with some CPU usage. Figuring it’s probably just doing its thing, I left the computer on overnight.

The next day I woke up at 7:30 for classes. The defrag process is still running. I’m forced to do a hard poweroff because both ctrl+c and SIGKILL does nothing. After powering on again, nothing has really changed. Class lasted a few hours and I continued my investigation. I tried running btrfs filesystem balance start -dusage=10 / and it actually did something. The balance finished after about 30 seconds and the lag spikes is half gone, lasting much shorter and even less frequent. Running balance with -dusage=15 completely stops the lag spikes and kept the CPU usage of btrfs-cleaner under 5%, which was acceptable for me. I pretty much expected another system update will fix this issue completely.

The horror

Last night, on January 31st, u/TueOct5 posted “PSA: Linux 5.16 has major regression in btrfs causing extreme IO load” on the self-hosted subreddit. In which he stated that the bug combined with autodefrag has caused his system to constantly write to his SSD and accumulated 188 TB of writes in 10 days. My heart literally dropped the first time I read this, because I didn’t even think of this when I tried fixing this. And I know autodefrag is on in my setup. And what even worst is that, the manual defrag I had run before probably did even more damage than autodefrag. But it was already 12:00PM, and I was exhausted from all the Chinese Lunar New Year stuff. Laying in bed, knowing my SSD probably lost like half of its lifespan, I figured thinking about this is not going to do anything good, and I should just rest. It was not a good sleep.

And it’s now February 1st, I woke up in the morning and checked my /etc/fstab before even having breakfast.

# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a device; this may
# be used with UUID= as a more robust way to name devices that works even if
# disks are added and removed. See fstab(5).
#
# <file system>             <mount point>  <type>  <options>  <dump>  <pass>
UUID=2bc561d8-84b3-42b7-acef-b2f826854ffb /              btrfs   subvol=/@,defaults,noatime,space_cache,autodefrag,compress=lzo 0 1

And autodefrag sure is on. I removed it from all mount points, and after reboot, everything came to normal. No more btrfs-cleaner eating up my CPU. Now, the most important question is: how much damage did it actually cause?

Damage assessment

It took me about 15 minutes figuring out how to interpret the “Total LBAs Written” field of the SMART data. According to this StackExchange post, I should multiply my Total LBAs Written value, by the Logical Block/Sector Size. Running smartctl -a /dev/sdc gives me those values.

=== START OF INFORMATION SECTION ===
Sector Size:      512 bytes logical/physical
...

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       79405462234

So multiplying 512 bytes by 79405462234, the result is about 37878 TB, which doesn’t make sense. In the end, I used Samsung’s own SSD management software(AUR), and it seems to display the correct values.

---------------------------------------------------------------------------------------------------------------------------------------------
| Disk   | Path     | Model                     | Serial               | Firmware | Optionrom | Capacity | Drive  | Total Bytes | NVMe Driver |
| Number |          |                           | Number               |          | Version   |          | Health | Written     |             |
---------------------------------------------------------------------------------------------------------------------------------------------
| 2      | /dev/sdc | Samsung SSD 860 EVO 500GB | S4XDNF0MB39562P      | RVT03B6Q | N/A       |   465 GB | GOOD   | 36.98 TB    | N/A         |
---------------------------------------------------------------------------------------------------------------------------------------------

36 TB written, for a SSD I’ve been using for almost 2 years, this definitely falls into the range of normal usage. I somehow stopped the bug from wearing out my drive and saved it from its fate. Just wow.

Takeaways

BTRFS is probably not the most stable file system out there Based on my experience playing with SBCs(Raspberry Pi and Chinese knockoffs), ext4 is certainly the best file system in terms of stability. For RAID setups, ZFS is a good choice, albeit resizing the pool could be a pain. The point is, BTRFS is not the best in terms of stability, so unless you actively use the extra features, just use something else.
Cutting-edge updates probably isn’t the best if you want stability I’m absolutely moving to the LTS kernel after all this horror. In fact, I don’t even know why I haven’t done that in the first place. Not like I’m actively playing with new kernel features or anything. So LTS kernel, definitely a must for stability. Maybe even consider the hardened kernel for additional security.

Thanks for reading and happy new year.

2022-02-01T13:12:54.087Z