Bcache - Awesome and terrible at the same time

Should I shell out thousands of dollars for a decently big SSD, suffer terrible hard drive performance, or have to live with both and manually moving data between them?

Bcache gives me the best of both! Or at least so it claims…

That is until you actually start to use bcache, and realise that despite being stable for almost a decade, it isn’t really stable. Unless you like delving into the source code, you’re unlikley to find the pitfalls until after it’s eaten your data.

Data Eating Method 1

Use bcache on a desktop PC (for example Debian). Cleanly reboot PC. Yay - your data might be gone.

Since bcache is built on your hard disk and ssd, it requires both to work. However, the bcache default udev scripts allow your filesystem to be mounted and used while the SSD isn’t yet initialized on bootup. They helpfully mark the cache as ‘stale’. That doesn’t just slow down booting - it means data written on bootup goes straight to your disk, and then when the cache is added later, the cache might have a copy of the same out-of-date blocks. If your filesystem driver sees those blocks, it will barf, try to ‘correct’ the errors, and corrupt all your data. All because you just rebooted.

Data Eating Method 2

You check the man pages. They clearly state:

If a backing device has data in a cache somewhere, the /dev/bcache device won't be created until the cache shows up - particularly important if you have writeback caching turned on.

Excellent - Data Eating Method 1 above can’t happen. I must have imagined it. Lets restore from backup and try again.

NOPE! The documentation is wrong. The bcache device will be created in the STALE state

Data Eating Method 3

You know about data eating method 1 & 2, so you’re careful. You read the man pages and make sure the stop_when_cache_set_failed is set to AUTO. You need to do that before udev initializes or your data will be eaten - and no, it’s not a persistent option, so it’s going to need to be set on every boot. That should mean if the cache isn’t working and theres any dirty data in the cache, the device can’t be used so-as not to do any damage.

WRONG! AUTO confusingly means ‘gobble my data, but do it later rather than now’. Since AUTO doesn’t stop the cache device being reattached later (automatically through udev), you have the same issue as #1.

Data Eating Method 4

You read the source code of linux-kernel/drivers/md/bcache/super.c to find out about #3, and set the option to STOP_ALWAYS. That should stop it gobbling my data!

Nope - The whole option only applies after IO errors occur, not when the cache set is failed due to not yet being attached.

Data Eating Method 5

Use the extra safe writethrough mode. That way, all writes will always go to your hard disk, which will always be consistent. Performance is ruined, but it’s better than lost data.

NO! Even in writethrough mode, the cache can end up not being invalidated on writes (if it isn’t attached at mount time). So you have the same issue as #1.

Data Eating Method 6

Patch the code to never allow any reads and writes to stale cached devices unless a new special eat_my_data flag is set. Thats the only safe way. Build your own kernel module.

Ha Ha… Even that isn’t enough. There are a pile of race conditions that mean if IO to the cache device or backing device is already in progress when either device vanishes (for example a power failure), there is no way to track state. When the system comes back up, an in-progress write can appear to have completed, only to ‘undo’ again to a previous version of that block when the cached version is deleted. Your filesystem will hate that.

The solution

I don’t have one… I’m torn between “give up and spend the $2000 to move all my data to SSD’s that I should have done a week ago” and “patch bcache to invalidate the whole cache on any unclean mount, and use writeback mode despite bad performance”

Lesson learned - just cos Linus OK’ed it, doesn’t mean its good to use.

Olivers Place

Random stuff I do...