Showing posts with label fsync. Show all posts
Showing posts with label fsync. Show all posts

Wednesday, March 18, 2026

MariaDB innovation: binlog_storage_engine, 32-core server, Insert Benchmark

MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB). This work was done by Small Datum LLC and sponsored by the MariaDB Foundation.

My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a large (32-core) server. Both servers use an SSD that has has high fsync latency. This is probably a best-case comparison for the feature. If you really care, then get enterprise SSDs with power loss protection. But you might encounter high fsync latency on public cloud servers.

While throughput improves with the InnoDB doublewrite buffer disabled, I am not suggesting people do that for production workloads without understanding the risks it creates.

tl;dr for a CPU-bound workload

  • throughput for write-heavy steps is larger with the InnoDB doublewrite buffer disabled
  • throughput for write-heavy steps is much larger with the binlog storage engine enabled
  • throughput for write-heavy steps is largest with both the binlog storage engine enabled and the InnoDB doublewrite buffer disabled. In this case it was up to 8.9X larger.
tl;dr for an IO-bound workload
  • see the tl;dr above
  • the best throughput comes from enabling the binlog storage engine and disabling the InnoDB doublewrite buffer and was 3.26X.
Builds, configuration and hardware

I compiled MariaDB 12.3.1 from source.

The server has 32-cores and 128G of RAM. Storage is 1 NVMe device with ext-4 and discard enabled. The OS is Ubuntu 24.04. AMD SMT is disabled. The SSD has high fsync latency.

I tried 4 my.cnf files:
The Benchmark

The benchmark is explained here. It was run with 12 clients for two workloads:
  • CPU-bound - the database is cached by InnoDB, but there is still much write IO
  • IO-bound - most, but not all, benchmark steps are IO-bound
The benchmark steps are:

  • l.i0
    • insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 10M for CPU-bound and 300M for IO-bound.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 16M for CPU-bound and 4M for IO-bound.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 4M for CPU-bound and 1M for IO-bound.
    • Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 1800 seconds.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: summary

The performance reports are here for CPU-bound and IO-bound.

The summary sections from the performance reports have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA. And from the third table for the IO-bound workload I see that there were failures to meet the SLA for qp500, qr500, qp1000 and qr1000.

I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from the base version.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

I often use context switch rates as a proxy for mutex contention.

Results: CPU-bound

The summary is here.

Some of the improvements here are huge courtesy of storage with high fsync latency.

Throughput is much better with the binlog storage engine enabled when the InnoDB doublewrite buffer is also enabled. Comparing z12b_sync and z12c_sync (z12c_sync uses the binlog storage engine):
  • throughput for l.i0 (load in PK order) is 3.63X larger for z12c_sync
  • throughput for l.i1 (write-only, larger transactions) is 2.80X larger for z12c_sync
  • throughput for l.i2 (write-only, smaller transactions) is 8.13X larger for z12c_sync
There is a smaller benefit from only disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12b_sync_dw0:
  • throughput for l.i0 (load in PK order) is the same for z12b_sync and z12b_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 1.14X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 1.93X larger for z12b_sync_dw0
The largest benefits come from using the binlog storage engine and disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12c_sync_dw0:
  • throughput for l.i0 (load in PK order) is 3.61X larger for z12c_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 3.03X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 8.90X larger for z12b_sync_dw0
Results: IO-bound

The summary is here.

For the read-write steps the insert SLA was not met for qr500, qp500, qr1000 and qp1000 as those steps needed more IOPs than the storage devices can provide. So I ignore those steps.

Some of the improvements here are huge courtesy of storage with high fsync latency.

Throughput is much better with the binlog storage engine enabled when the InnoDB doublewrite buffer is also enabled. Comparing z12b_sync and z12c_sync (z12c_sync uses the binlog storage engine):
  • throughput for l.i0 (load in PK order) is 3.05X larger for z12c_sync
  • throughput for l.i1 (write-only, larger transactions) is 1.22X larger for z12c_sync
  • throughput for l.i2 (write-only, smaller transactions) is 1.58X larger for z12c_sync
There is a smaller benefit from only disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12b_sync_dw0:
  • throughput for l.i0 (load in PK order) is the same for z12b_sync and z12b_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 2.06X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 1.59X larger for z12b_sync_dw0
The largest benefits come from using the binlog storage engine and disabling the InnoDB doublewrite buffer. Comparing z12b_sync and z12c_sync_dw0:
  • throughput for l.i0 (load in PK order) is 3.01X larger for z12c_sync_dw0
  • throughput for l.i1 (write-only, larger transactions) is 3.26X larger for z12b_sync_dw0
  • throughput for l.i2 (write-only, smaller transactions) is 2.78X larger for z12b_sync_dw0









Tuesday, February 17, 2026

MariaDB innovation: binlog_storage_engine, small server, Insert Benchmark

MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB).

My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a similar small server. Both servers use an SSD that has has high fsync latency. This is probably a best-case comparison for the feature. If you really care, then get enterprise SSDs with power loss protection. But you might encounter high fsync latency on public cloud servers.

tl;dr for a CPU-bound workload

  • Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2.
  • When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 1.75X larger for l.i0 (load) and 4X or more larger on the random write steps (l.i1, l.i2)
tl;dr for an IO-bound workload
  • Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2. It also has a large impact on qp1000, which is the most write-heavy of the query+write steps.
  • When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 4.74X larger for l.i0 (load), 1.50X larger for l.i1 (random writes) and 2.99X larger for l.i2 (random writes)
Builds, configuration and hardware

I compiled MariaDB 12.3.0 from source.

The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here. The storage device has high fsync latency.

I used 4 my.cnf files:
  • z12b
    • my.cnf.cz12b_c8r32 (z12b) is my default configuration. Sync-on-commit is disabled for both the binlog and InnoDB so that write-heavy benchmarks create more stress.
  • z12c
  • z12b_sync
  • z12c_sync
    • my.cnf.cz12c_sync_c8r32 (z12c_sync) is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.
The Benchmark

The benchmark is explained here. It was run with 1 client for two workloads:
  • CPU-bound - the database is cached by InnoDB, but there is still much write IO
  • IO-bound - most, but not all, benchmark steps are IO-bound
The benchmark steps are:

  • l.i0
    • insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 30M for CPU-bound and 800M for IO-bound.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 40M for CPU-bound and 4M for IO-bound.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 10M for CPU-bound and 1M for IO-bound.
    • Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 1800 seconds.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: summary

The performance reports are here for:
  • CPU-bound
    • all-versions - results for z12b, z12c, z12b_sync and z12c_sync
    • sync-only - results for z12b_sync vs 12c_sync
  • IO-bound
    • all-versions - results for z12b, z12c, z12b_sync and z12c_sync
    • sync-only - results for z12b_sync vs 12c_sync
The summary sections from the performance reports have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from the base version. 

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

I often use context switch rates as a proxy for mutex contention.

Results: CPU-bound

The summaries are here for all-versions and sync-only.
  • Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2.
  • When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 1.75X larger for l.i0 (load) and 4X or more larger on the random write steps (l.i1, l.i2)
The second table from the summary section has been inlined below. That table shows relative throughput which is:
  • all-versions: (QPS for my config / QPS for z12b)
  • sync-only: (QPS for my config / QPS for z12b)
For all-versions
dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
ma120300_rel_withdbg.cz12b_c8r321.001.001.001.001.001.001.001.001.001.00
ma120300_rel_withdbg.cz12c_c8r321.031.011.001.031.000.991.001.001.011.00
ma120300_rel_withdbg.cz12b_sync_c8r320.041.020.070.011.011.011.001.011.001.00
ma120300_rel_withdbg.cz12c_sync_c8r320.081.030.280.061.021.011.011.021.021.01

And for sync-only the relative QPS is:
  • all-versions: (QPS for my config / QPS for z12b_sync)
  • sync-only: (QPS for my config / QPS for z12b_sync)
dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
ma120300_rel_withdbg.cz12b_sync_c8r321.001.001.001.001.001.001.001.001.001.00
ma120300_rel_withdbg.cz12c_sync_c8r321.751.013.996.831.011.011.011.011.031.01

Results: IO-bound

The summaries are here for all-versions and sync-only.
  • Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2. It also has a large impact on qp1000, which is the most write-heavy of the query+write steps.
  • When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 4.74X larger for l.i0 (load), 1.50X larger for l.i1 (random writes) and 2.99X larger for l.i2 (random writes)
The second table from the summary section has been inlined below. That table shows relative throughput which is:
  • all-versions: (QPS for my config / QPS for z12b)
  • sync-only: (QPS for my config / QPS for z12b)
For all-versions
dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
ma120300_rel_withdbg.cz12b_c8r321.001.001.001.001.001.001.001.001.001.00
ma120300_rel_withdbg.cz12c_c8r321.010.990.991.011.011.011.011.071.011.04
ma120300_rel_withdbg.cz12b_sync_c8r320.041.000.550.101.020.971.000.800.950.55
ma120300_rel_withdbg.cz12c_sync_c8r320.181.000.830.311.021.011.020.961.020.86

And for sync-only the relative QPS is:
  • all-versions: (QPS for my config / QPS for z12b_sync)
  • sync-only: (QPS for my config / QPS for z12b_sync)
dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
ma120300_rel_withdbg.cz12b_sync_c8r321.001.001.001.001.001.001.001.001.001.00
ma120300_rel_withdbg.cz12c_sync_c8r324.741.001.502.991.001.041.021.201.081.57












Monday, February 16, 2026

MariaDB innovation: binlog_storage_engine

MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB).

In this post I have results for the performance benefit from this when using storage that has a high fsync latency. This is probably a best-case comparison for the feature. A future post will cover the benefit on servers that don't have high fsync latency.

tl;dr

  • the performance benefit from this is excellent when storage has a high fsync latency
  • there is a small improvement (up to 6%) for write throughput when binlog_storage_engine is enabled but sync-on-commit is not enabled
  • my mental performance model needs to be improved. I gussed that throughput would increase by ~2X when using binlog_storage_engine relative to not using it but using sync_binlog=1 and innodb_flush_log_at_trx_commit=1. However the improvement is larger than 4X.
Some history

MongoDB has done this for years -- the replication log is stored in WiredTiger. 

Long ago there were requests for this feature from the Galera team, and I wonder if they will benefit from this now. I have been curious about the benefit of the feature, but long ago I was also wary of it because it can increase stress on InnoDB and back in the day InnoDB already struggled with high-concurrency workloads.

Long ago group commit didn't work for the binlog. The Facebook MySQL team did some work to fix that, and eventually. A Google search describes our work as the first and I found an old Facebook note that I probably wrote about the effort.

Builds, configuration and hardware

I compiled MariaDB 12.3.0 from source.

The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here. The storage device has high fsync latency.

I used 4 my.cnf files:
  • z12b
    • my.cnf.cz12b_c8r32 is my default configuration. Sync-on-commit is disabled for both the binlog and InnoDB so that write-heavy benchmarks create more stress.
  • z12c
  • z12b_sync
  • z12c_sync
    • my.cnf.cz12c_sync_c8r32 is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks 
and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

The benchmark is run with 1 client, 1 table and 50M rows. 

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation.  

But here I only report results for the write-heavy tests.

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for base version)
When the relative QPS is > 1 then some version is faster than base version.  When it is < 1 then there might be a regression. 

I present results for:
  • z12b, z12c, z12b_sync and z12c_sync with z12b as the base version
  •  z12b_sync and z12c_sync with z12b_sync as the base version
Results: z12b, z12c, z12b_sync, z12c_sync

Summary:
  • z12c gets up to 6% more throughput than z12b but the CPU overhead per operation are similar for z12b and z12c
  • z12b_sync has the worst performance thanks to 2 fsyncs per commit
  • z12c_sync gets more than 4X the throughput vs z12b_sync. If fsync latency were the only thing that determined performance then I would expect the difference to be ~2X. There is more going on here and in the next section I mention that enabling binlog_storage_engine also reduces the CPU overhead.
  • some per-test data from iostat and vmstat is here
  • a representative sample of iostat collected at 1-second intervals during the update-inlist test is here. When comparing z12b_sync with z12c_sync
    • the fsync rate (f/s) is ~2.5X larger for z12c_sync vs z12b_sync (~690/s vs ~275/s) but fsync latency (f_await) is similar. So with binlog_storage_engine enabled MySQL is more efficient, and perhaps thanks to a lower CPU overhead, there is less work to do in between calls to fsync
Relative to: z12b
col-1 : z12c
col-2 : z12b_sync
col-3 : z12c_sync

col-1   col-2   col-3
1.06    0.01    0.05    delete
1.05    0.01    0.05    insert
1.01    0.12    0.47    read-write_range=100
1.01    0.10    0.44    read-write_range=10
1.03    0.01    0.11    update-index
1.02    0.02    0.12    update-inlist
1.05    0.01    0.06    update-nonindex
1.05    0.01    0.06    update-one
1.05    0.01    0.06    update-zipf
1.01    0.03    0.20    write-only

Results: z12b_sync, z12c_sync

Summary:
  • z12c_sync gets more than 4X the throughput vs z12b_sync. If fsync latency were the only thing that determined performance then I would expect the difference to be ~2X. There is more going on here and below I mention that enabling binlog_storage_engine also reduces the CPU overhead.
  • some per-test data from iostat and vmstat is here and the CPU overhead per operation is much smaller with binlog_storage_engine -- see here for the update-inlist test. In general, when sync-on-commit is enabled then the CPU overhead with binlog_storage_engine enabled is between 1/3 and 2/3 of the overhead without it enabled.
Relative to: z12b_sync
col-1 : z12c_sync

col-1
6.40    delete
5.64    insert
4.06    read-write_range=100
4.40    read-write_range=10
7.64    update-index
7.17    update-inlist
5.73    update-nonindex
5.82    update-one
5.80    update-zipf
6.61    write-only

CPU-bound sysbench on a large server: Postgres 12 to 19 beta1

This has results from sysbench on a small server with Postgres versions 12 through 19 beta1. Sysbench is run with high concurrency (40 conne...