TNS
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
NEW! Try Stackie AI
Databases / Linux / Storage

Scaling Btrfs to petabytes in production: a 74% cost reduction story

Learn how Chronosphere reduced storage costs by 74% by scaling Btrfs to petabytes of time-series data in production on Google Cloud Platform.
Mar 18th, 2026 5:00am by
Featued image for: Scaling Btrfs to petabytes in production: a 74% cost reduction story
Eva Wahyuni for Unsplash+

At Chronosphere, we saved 74% of our storage costs by moving petabytes of time-series data from ext4 to Btrfs, the copy-on-write filesystem for Linux. And if you’re interested, you can experimentally run Btrfs on Google Cloud Platform today. Below, in an expanded blog post based on a talk we gave at FOSDEM 2026, we unpack how we did it — without blowing up production.

What is Btrfs?

Linux is an operating system that supports multiple file systems. A filesystem is part of the operating system; it governs file organization and access. Btrfs is one of the few supported in Linux. Beyond offering file organization and access, Btrfs has features less commonly found in other file systems, such as transparent checksumming, transparent compression, copy-on-write, and others. We will be discussing the Btrfs file system in this blog post.

Step 1: savings from compression

To understand how compression works in Btrfs, let’s go through an example. The first 10^9 (one trillion) bytes of an English Wikipedia dump takes 1000MB (953MiB), but, once transparently compressed by Btrfs, takes only 340MiB on disk:

The file on disk is 35% of its actual size! To generalize, if we stored mostly the English Wikipedia on this drive, I would need a disk approximately 3x smaller for the same amount of data.

Although enwik9 is compressed on disk, we can look for useful content in it without decompressing it:

At Chronosphere, we store many time-series metrics. Time-series metrics generally consist of two parts:

  1. Labels. E.g. service=roar_labels, environment=production. These compress well, since those are text and often repeated (e.g., every metric has an environment label).
  2. The actual metric values, i.e., floats or integers. They are already well-compressed in the application via float/integer encoding.

We downloaded a subset of our internal cluster data and did a similar compsize check to the above. The potential disk savings were in the 65-70% ballpark. Since we store petabytes of data on disks and disks were the single biggest expense in our cloud bill at the time, we decided Btrfs was worth a careful evaluation.

“At Chronosphere, we saved 74% of our storage costs by moving petabytes of time-series data from ext4 to Btrfs.”

Our disks could be shrunk similarly if we also compressed from within the application itself. However, transparent filesystem compression was advantageous for us, because:

  1. The compression ratio would be similar if we compressed it directly (i.e., no significantly higher space savings from application compression).
  2. We rely heavily on mmap for reading; changing this to use compressed data is doable, but would require significant changes to the database.
  3. A change in file format is disruptive, especially if all historical data needs to be read and written — that’s quite a lot of tooling to prepare and CPU cycles to process!

Given that Btrfs showed significant savings potential and required almost no changes to the database itself, we opted to evaluate it thoroughly.

Btrfs reputation

Once we saw the compression ratios and the potential mind-boggling savings, we started looking more deeply into Btrfs (and getting asked about it quite a bit). Btrfs was merged into the Linux kernel in 2008, but had a poor reputation for reliability from the early days; the author can attest to that. On the bright side, over the last decade, Btrfs development has picked up lots of good stories of large companies using Btrfs at scale. For me and many others at Chronosphere, Josef Bacik’s talk “Btrfs at Facebook” from the 2020 Open Source Summit
Online comments on Brtfs's current reputation
highlighted Btrfs’s current reputation and helped change the organization’s perspective on it.

Stage 2: Google/k8s/database support

Chronosphere runs on Google-managed Kubernetes on GCP. At the time of writing, GCP supported only ext4 and XFS on Linux. To run a filesystem not on the official list requires a separate Container Storage Interface (CSI) driver deployed on the hosts and registered with the k8s control plane.

We used two approaches to get a Btrfs volume for the database:

  1. Initially, we took a shortcut and managed the block device ourselves. On startup, the database would pick up a raw block device, format it, mount it, and use it.
  2. Once we were comfortable with the prototype, we forked the GCP CSI driver, added Btrfs support, and deployed the fork across our fleet as a separate provisioner.

Once moved to a compressed Btrfs filesystem, the database did not blow up. We technically had something that could work. Compression ratio on the database was consistent with the original compsize experiments: 65-70%.

File System conversion

Once the CSI driver was deployed, we needed a way to convert the existing disks to Btrfs. Initially, we considered btrfs-convert, but it has the following warning in its documentation:

Always consider whether mkfs and a file copy would be a better option than the in-place conversion, given what was said above.

To convert a filesystem, we used a workflow that:

  1. Creates a new target volume of the same parameters as the source, except fsType: btrfs.
  2. Copies all files from source to target.
  3. Shuts down the database.
  4. Synchronizes all files between the source and the target.
  5. Swaps the source and target, removes the “old” one.

In fact, we already had a very similar workflow to shrink the database disks (because neither ext4 nor Google’s block device storage supports online disk shrinking). We reused the disk shrink workflow to convert between the file systems; it was a relatively straightforward change.

Step 2a: risks and potential issues

When we had the infrastructure ready and proved it works on a real database, a few possible risks were raised:

    1. Poor reputation for reliability from the early days. For anyone who was concerned about this, it was mostly addressed by watching Josef Bacik’s talk from 2020.
    2. Different IO behavior may have performance implications. We had a decade of experience running the database on ext4, but zero experience with anything else. Early performance testing did not show significant performance changes, so we had a gradual rollout plan to observe performance changes.
    3. Compression/decompression will consume extra CPU cycles, leaving less available for the database to perform other tasks.
    4. Handling of remaining disk space. Btrfs tracks available disk space differently than other filesystems, and statfs cannot be as trusted as, say, on ext4. We absolutely cannot ever run out of disk space, as this will cause data loss. This was a major concern to us, so we had to be extra conservative.
    5. Google does not support Btrfs. Why? All of the following are plausible:
      1. A Google customer tried Btrfs; it behaved poorly with Google’s block device offering, so it was not added to the “officially supported” list.
      2. Google’s kernel team advised against it due to its poor reliability track record.
      3. Nobody ever seriously asked for Btrfs support, so Google just never did it.
    6.  There were surely more unknown unknowns.

Before detailing those unknown unknowns, let’s walk through our configuration.

Our configuration

We use btrfs, with the following settings:

  1. discard=async
  2. compress=zstd:1
  3. btrfs-allocation-data-bg_reclaim_threshold=90
  4. btrfs-bdi-read_ahead_kb=128

discard=async and compress=zstd:1 are documented in the btrfs manual. The other two settings are implemented in the CSI driver. They write the configured values to /sys/fs/btrfs//allocation/data/bg_reclaim_threshold and to /sys/fs/btrfs//bdi/read_ahead_kb, respectively.

We set bg_reclaim_threshold to 90, because that makes it easy to account for free space tracking. This value is aggressive: with it, the difference between Device Unallocated and statfs (via btrfs-filesystem-usage) is less than 2%. This helps us safely account for free space.

The unknown unknowns

Even with careful planning, we encountered surprises.

Surprise 1: Disk snapshot costs ballooned

Our first major surprise came from an unexpected direction: backup costs. We rely on block-level snapshots provided by GCP. With ext4, these incremental snapshots were small and cheap relative to the disk size. After moving to Btrfs, our snapshot costs exploded.

“While Btrfs reduced the disks by >50%, snapshot costs grew by more than 6x. … we were paying more for disk snapshots than for the actual disks!”

Chart of snapshot costs
We didn’t narrow down the culprit, but had to speed up a different project. Over the course of a few months, we transitioned from snapshot-based backups to file-based backups, making the backup (and backup costs) completely filesystem agnostic; that brought the cost of backups down to even pre-Btrfs levels:

Chart of backup costs

Surprise 2: Reclaim causing massive IO on large deletes

A more alarming issue surfaced on a production tenant. We noticed that deleting a large volume of files would trigger a massive IO storm, causing our commit log queue to grow uncontrollably and threatening service stability.

Long commit log queue following large volume of file deletion
This became a production incident. The cause was traced back to Btrfs’s background space reclamation logic.

Production incident following large delete
As shown in the screenshots above, some downsampled data were deleted. Once many files are deleted in Btrfs, reclamation kicks in: Btrfs shuffles data on disk, roughly proportional to the number of bytes removed. Today, we configure this mechanism via bg_reclaim_threshold, which is quite aggressive.

To mitigate IO storms during reclaim, we are eagerly awaiting the dynamic_reclaim tunable, which was merged into Linux v6.11, to propagate to LTS kernels (where we can turn it on). In the meantime, we will throttle the deletes on the application side.

Surprise 3: Read ahead!

One day, we hit a performance wall with our read workloads. We saw p99 latency spikes that directly correlated with the disk’s read throughput being saturated.

P99 Latency spikes
The issue was a subtle but critical kernel setting: read-ahead. There are two different read-ahead settings that matter: one for the generic block device and one specifically for the Btrfs filesystem via its Backing Device Info (BDI) interface.

  • /sys/block/<…>/queue/read_ahead_kb (default: 128KB)
  • /sys/fs/btrfs/<…>/bdi/read_ahead_kb (default: 4096KB)

The Btrfs-specific read-ahead was 32 times larger than the default for the block device. This setting is more intelligent than the block-device one — it understands the logical layout of files, not just the physical block layout. However, for our workload, which involves many random-like reads, this massive up to 4 MB read-ahead was causing extreme read amplification, pulling huge amounts of data into memory that we would never use.

Running the project & timeline

We created a single “Btrfs master plan” that includes all the information for interested parties: savings potential, a rough timeline, the necessary development to productionize it, a migration path for existing databases, and risks. The “master plan” was very helpful in showing others that we understand, acknowledge, and are taking steps to mitigate the risks. Rough project timeline:

  1. 2024Q2: The compsize test on our data, plus ballpark calculations on how much it could save, was presented in a team offsite. This step was documented in the previous section.
  2. 2024Q3: hacked together the database support just enough to run on a development cluster. This proved the database did not fall over using our standard synthetic benchmarks.
    1. The success of this step elevated the project from “a team experiment” to something that a wider organization began to keep an eye on.
    2. Discussions started about Btrfs reliability, long-term support, and maintenance.
  3. 2025Q4: The infrastructure team productionized Btrfs support.
  4. 2025Q1: moved the smallest internal production cluster (meta-meta) to Btrfs.
  5. 2025Q2: moved the first production cluster to Btrfs. Mass migration commenced.
  6. 2025Q4:
    1. moved the last production cluster to Btrfs.
    2. Google lands the last Btrfs-enabling patch to their CSI driver. Btrfs is in production on the GCP side for everyone.

Where are we now?

Today, all time-series databases run Btrfs. That’s petabytes worth of storage (after compression!). This journey required significant engineering effort, and our infrastructure is now forked from standard GCP offerings. We maintain our own builds of Google’s Container Storage Interface (CSI) to include Btrfs support. We continue to work with Google to upstream our changes, and they have done so. Btrfs has been enabled since Container Optimized OS 125, and all of our suggested changes to the CSI driver have been merged upstream.

As it stands today, you should be able to utilize it from Google Kubernetes Engine 1.35 or later fully.

As a result of the migration, we saved 74% of disk costs due to compression alone. Now that we are fully on Btrfs, we are removing some in-application checksums, which, we believe, will bring additional cost savings for compute (CPUs).

Takeaways

Our experience provides a few key lessons. First, Btrfs is a viable and trustworthy filesystem for large-scale, single-volume enterprise deployments. Second, not only is it a good filesystem, thanks to transparent compression, it shaved 74% off our disk cost; it can also save your disk too, if you store lots of uncompressed data, which is tricky for the application to compress.

Errata

If you watched the FOSDEM talk, there was one last question towards the end of the presentation:

Do you use deduplication?

When on stage, I thought the question was about replication. The answer is still “no,” but for other reasons — we do not use deduplication because we have a decent way to deduplicate shared data in the application without major changes and with barely any CPU or IO overhead.

Created with Sketch.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.