Small Datum: myrocks

Showing posts with label myrocks. Show all posts

Saturday, April 19, 2025

Battle of the Mallocators: part 2

This post addresses some of the feedback I received from my previous post on the impact of the malloc library when using RocksDB and MyRocks. Here I test:

MALLOC_ARENA_MAX with glibc malloc

see here for more background on MALLOC_ARENA_MAX. By default glibc can use too many arenas for some workloads (8 X number_of_CPU_cores) so I tested it with 1, 8, 48 and 96 arenas.

compiling RocksDB and MyRocks with jemalloc specific code enabled

In my previous results I just set malloc-lib in my.cnf which uses LD_LIBRARY_PATH to link with your favorite malloc library implementation.

tl;dr: jemalloc

For mysqld with jemalloc enabled via malloc-lib (LD_LIBRARY_PATH) versus mysqld with jemalloc specific code enabled

performance, VSZ and RSS were similar

After setting rocksdb_cache_dump=0 in the binary with jemalloc specific code

performance is slightly better (excluding the outlier, the benefit is up to 3%)
peak VSZ is cut in half
peak RSS is reduced by ~9%

tl;dr: glibc malloc on a 48-core server

With 1 arena performance is lousy but the RSS bloat is mostly solved
With 8, 48 or 96 arenas the RSS bloat is still there
With 48 arenas there are still significant (5% to 10%) performance drops
With 96 arenas the performance drop was mostly ~2%

Building MyRocks with jemalloc support

This was harder than I expected. The first step was easy -- I added these to the CMake command line, the first is for MyRocks and the second is for RocksDB. When the first is set then HAVE_JEMALLOC is defined in config.h. When the second is set then ROCKSDB_JEMALLOC is defined on the compiler command line.

-DHAVE_JEMALLOC=1
-DWITH_JEMALLOC=1

The hard part is that there were linker errors for unresolved symbols -- the open-source build was broken. The fix that worked for me is here. I removed libunwind.so and added libjemalloc.so in its place.

Running mysqld with MALLOC_ARENA_MAX

I wasn't sure if it was sufficient for me to set an environment variable when invoking mysqld_safe, so I just edited the mysqld_safe script to do that for me:

182a183,184
> cmd="MALLOC_ARENA_MAX=1 $cmd"
> echo Run :: $cmd

Results: jemalloc

The jemalloc specific code in MyRocks and RocksDB is useful but most of it is not there to boost performance. The jemalloc specific code most likely to boost performance is here in MyRocks and is enabled when rocksdb_cache_dump=0 is added to my.cnf.

Results are here for 3 setups:

fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128

This is the base case in the table below
this is what I used in my previous post and jemalloc is enabled via setting malloc-lib in my.cnf which uses LD_LIBRARY_PATH

fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128

This is col-1 in the table below
MySQL with jemalloc specific code enabled at compile time

fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128

This is col-2 in the table below
MySQL with jemalloc specific code enabled at compile time and rocksdb_cache_dump=0 added to my.cnf

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.

(QPS with $allocator) / (QPS with glibc malloc)

From the results below:

results in col-1 are similar to the base case. So compiling in the jemalloc specific code didn't help performance.
results in col-2 are slightly better than the base case with one outlier (hot-points). So consider setting rocksdb_cache_dump=0 in my.cnf after compiling in jemalloc specific code.

Relative to: fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128

col-1 : fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128

col-2 : fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128

col-1 col-2

0.92 1.40 hot-points_range=100

1.00 1.01 point-query_range=100

1.01 1.02 points-covered-pk_range=100

0.94 1.03 points-covered-si_range=100

1.01 1.02 points-notcovered-pk_range=100

0.98 1.02 points-notcovered-si_range=100

1.01 1.03 random-points_range=1000

1.01 1.02 random-points_range=100

0.99 1.00 random-points_range=10

0.98 1.00 range-covered-pk_range=100

0.96 0.97 range-covered-si_range=100

0.98 0.98 range-notcovered-pk_range=100

1.00 1.02 range-notcovered-si_range=100

0.98 1.00 read-only-count_range=1000

1.01 1.01 read-only-distinct_range=1000

0.99 0.99 read-only-order_range=1000

1.00 1.00 read-only_range=10000

0.99 0.99 read-only_range=100

0.99 1.00 read-only_range=10

0.98 0.99 read-only-simple_range=1000

0.99 0.99 read-only-sum_range=1000

0.98 0.98 scan_range=100

1.01 1.02 delete_range=100

1.01 1.03 insert_range=100

0.99 1.01 read-write_range=100

1.00 1.01 read-write_range=10

1.00 1.02 update-index_range=100

1.02 1.02 update-inlist_range=100

1.01 1.03 update-nonindex_range=100

0.99 1.01 update-one_range=100

1.01 1.03 update-zipf_range=100

1.00 1.01 write-only_range=10000

The impact on VSZ and RSS is interesting. The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). To save space I use abbreviated names for the binaries.

jemalloc.1

base case, fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_jemalloc_c32r128

jemalloc.2

col-1 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za4_c32r128
This has little impact on VSZ and RSS

jemalloc.3

col-2 above, fbmy8032_rel_o2nofp_end_je_241023_ba9709c9_971.za5_c32r128
This cuts peak VSZ in half and reduces peak RSS by 9%

Peak values for MyRocks with 10G buffer pool

alloc VSZ RSS RSS/10

jemalloc.1 45.6 12.2 1.22

jemalloc.2 46.0 12.5 1.25

jemalloc.3 20.2 11.6 1.16

Results: MALLOC_ARENA_MAX

The binaries tested are:

fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_c32r128

base case in the table below

fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_1arena_c32r128

col-1 in the table below
uses MALLOC_ARENA_MAX=1

fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_8arena_c32r128

col-2 in the table below
uses MALLOC_ARENA_MAX=8

fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_48arena_c32r128

col-3 in the table below
uses MALLOC_ARENA_MAX=48

fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_96arena_c32r128

col-4 in the table below
uses MALLOC_ARENA_MAX=48

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.

(QPS with $allocator) / (QPS with glibc malloc)

From the results below:

performance with 1 or 8 arenas is lousy
performance drops some (often 5% to 10%) with 48 arenas
performance drops ~2% with 96 arenas

Relative to: fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_c32r128

col-1 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_1arena_c32r128

col-2 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_8arena_c32r128

col-3 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_48arena_c32r128

col-4 : fbmy8032_rel_o2nofp_end_241023_ba9709c9_971.za4_glibcmalloc_96arena_c32r128

col-1 col-2 col-3 col-4

0.89 0.78 0.72 0.78 hot-points_range=100

0.23 0.61 0.96 0.98 point-query_range=100

0.31 0.86 0.96 1.01 points-covered-pk_range=100

0.24 0.87 0.95 1.01 points-covered-si_range=100

0.31 0.86 0.97 1.01 points-notcovered-pk_range=100

0.20 0.86 0.97 1.00 points-notcovered-si_range=100

0.35 0.79 0.96 1.01 random-points_range=1000

0.30 0.87 0.96 1.01 random-points_range=100

0.23 0.67 0.96 0.99 random-points_range=10

0.06 0.48 0.92 0.96 range-covered-pk_range=100

0.14 0.52 0.97 0.99 range-covered-si_range=100

0.13 0.46 0.91 0.97 range-notcovered-pk_range=100

0.23 0.87 0.96 1.01 range-notcovered-si_range=100

0.23 0.76 0.97 0.99 read-only-count_range=1000

0.56 1.00 0.96 0.97 read-only-distinct_range=1000

0.20 0.47 0.90 0.94 read-only-order_range=1000

0.68 1.04 1.00 1.00 read-only_range=10000

0.21 0.76 0.98 0.99 read-only_range=100

0.19 0.70 0.97 0.99 read-only_range=10

0.21 0.58 0.94 0.98 read-only-simple_range=1000

0.19 0.57 0.95 1.00 read-only-sum_range=1000

0.53 0.98 1.00 1.01 scan_range=100

0.30 0.81 0.98 1.00 delete_range=100

0.50 0.92 1.00 1.00 insert_range=100

0.23 0.72 0.97 0.98 read-write_range=100

0.20 0.67 0.96 0.98 read-write_range=10

0.33 0.88 0.99 1.00 update-index_range=100

0.36 0.76 0.94 0.98 update-inlist_range=100

0.30 0.85 0.98 0.99 update-nonindex_range=100

0.86 0.98 1.00 1.01 update-one_range=100

0.32 0.86 0.98 0.98 update-zipf_range=100

0.27 0.80 0.97 0.98 write-only_range=10000

Using 1 arena prevents RSS bloat but comes at a huge cost in performance. If I had more time I would have tested for 2, 4 and 6 arenas but I don't think glibc malloc + RocksDB are meant to be.

Peak values for MyRocks with 10G buffer pool

alloc VSZ RSS RSS/10

default 46.1 36.2 3.62

arena = 1 15.9 14.1 1.41

arena = 8 32.6 27.7 2.77

arena = 48 35.2 29.2 2.92

arena = 96 39.3 32.5 3.25

Friday, April 11, 2025

Battle of the Mallocators

If you use RocksDB and want to avoid OOM then use jemalloc or tcmalloc and avoid glibc malloc. That was true in 2015 and remains true in 2025 (see here). The problem is that RocksDB can be an allocator stress test because it does an allocation (calls malloc) when a block is read from storage and then does a deallocation (calls free) on eviction. These allocations have very different lifetimes as some blocks remain cached for a long time and that leads to much larger RSS than expected when using glibc malloc. Fortunately, jemalloc and tcmalloc are better at tolerating that allocation pattern without making RSS too large.

I have yet to notice a similar problem with InnoDB, in part because it does a few large allocations at process start for the InnoDB buffer pool and it doesn't do malloc/free per block read from storage.

There was a recent claim from a MySQL performance expert, Dimitri Kravtchuk, that either RSS or VSZ can grow too large with InnoDB and jemalloc. I don't know all of the details for his setup and I failed to reproduce his result on my setup. Too be fair, I show here that VSZ for InnoDB + jemalloc can be larger than you might expect but that isn't a problem, it is just an artifact of jemalloc that can be confusing. But RSS for jemalloc with InnoDB is similar to what I get from tcmalloc.

tl;dr

For glibc malloc with MyRocks I get OOM on a server with 128G of RAM when the RocksDB buffer pool size is 50G. I might have been able to avoid OOM by using between 30G and 40G for the buffer pool. On that host I normally use jemalloc with MyRocks and a 100G buffer pool.
With respect to peak RSS

For InnoDB the peak RSS with all allocators is similar and peak RSS is ~1.06X larger than the InnoDB buffer pool.
For MyRocks the peak RSS is smallest with jemalloc, slightly larger with tcmalloc and much too large with glibc malloc. For (jemalloc, tcmalloc, glibc malloc) It was (1.22, 1.31, 3.62) times larger than the 10G MyRocks buffer pool. I suspect those ratios would be smaller for jemalloc and tcmalloc had I used an 80G buffer pool.

For performance, QPS with jemalloc and tcmalloc is slightly better than with glibc malloc

For InnoDB: [jemalloc, tcmalloc] get [2.5%, 3.5%] more QPS than glibc malloc
For MyRocks: [jemalloc, tcmalloc] get [5.1%, 3.0%] more QPS than glibc malloc

Prior art

I have several blog posts on using jemalloc with MyRocks.

October 2015 - MyRocks with glibc malloc, jemalloc and tcmalloc
April 2017 - Performance for large, concurrent allocations
April 2018 - RSS for MyRocks with jemalloc vs glibc malloc
August 2023 - RocksDB and glibc malloc
September 2023 - A regression in jemalloc 4.4.0 and 4.5.0 (too-large RSS)
September 2023 - More on the regression in jemalloc 4.4.0 and 4.5.0
October 2023 - Even more on the regression in jemalloc 4.4.0 and 4.5.0

Builds, configuration and hardware

I compiled upstream MySQL 8.0.40 from source for InnoDB. I also compiled FB MySQL 8.0.32 from source for MyRocks. For FB MySQL I used source as of October 23, 2024 at git hash ba9709c9 with RocksDB 9.7.1.

The server is an ax162-s from Hetzner with 48 cores (AMD EPYC 9454P), 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 with SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

For malloc the server uses:

glibc

version2.35-0ubuntu3.9

tcmalloc

provided by libgoogle-perftools-dev and apt-cache show claims this is version 2.9.1
enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so in my.cnf

jemalloc

provided by libjemalloc-dev and apt-cache show claims this is version 5.2.1-4ubuntu1
enabled by malloc-lib=/usr/lib/x86_64-linux-gnu/libjemalloc.so in my.cnf

The configuration files are here for InnoDB and for MyRocks. For InnoDB I used an 80G buffer pool. I tried to use a 50G buffer pool for MyRocks but with glibc malloc there was OOM so I repeated all tests with a 10G buffer pool. I might have been able avoid OOM with MyRocks and glibc malloc by using a between 30G and 40G for MyRocks -- but I didn't want to spend more time figuring that out when the real answer is to use jemalloc or tcmalloc.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 27 of the 42 microbenchmarks and most test only 1 type of SQL statement.

The tests run with 16 tables and 50M rows/table. There are 256 client threads and each microbenchmark runs for 1200 seconds. Normally I don't run with (client threads / cores) >> 1 but I do so here to create more stress and to copy what I think Dimitri had done.

Normally when I run sysbench I configure it so that the test tables fit in the buffer pool (block cache) but I don't do that here because I want to MyRocks to do IO as allocations per storage read create much drama for the allocator.

The command line to run all tests is: bash r.sh 16 50000000 1200 1200 md2 1 0 256

Peak VSZ and RSS

The tables below show the peak values for VSZ and RSS from mysqld during the benchmark. The last column is the ratio (peak RSS / buffer pool size). I am not sure it is fair to compare these ratios between InnoDB and MyRocks from this work because the buffer pool size is so much larger for InnoDB. Regardless, RSS is more than 3X larger than the MyRocks buffer pool size with glibc malloc and that is a problem.

Peak values for InnoDB with 80G buffer pool

alloc VSZ RSS RSS/80

glibc 88.2 86.5 1.08

tcmalloc 88.1 85.3 1.06

jemalloc 91.5 87.0 1.08

Peak values for MyRocks with 10G buffer pool

alloc VSZ RSS RSS/10

glibc 46.1 36.2 3.62

tcmalloc 15.3 13.1 1.31

jemalloc 45.6 12.2 1.22

Performance: InnoDB

From the results here, QPS is mostly similar between tcmalloc and jemalloc but there are a few microbenchmarks where tcmalloc is much better than jemalloc and those are highlighted.

The results for read-only_range=10000 are an outlier (tcmalloc much faster than jemalloc) and from vmstat metrics here I see that CPU/operation (cpu/o) and context switches /operation (cs/o) are much larger for jemalloc than for tcmalloc.

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.

(QPS with $allocator) / (QPS with glibc malloc)

Relative to results with glibc malloc

col-1 : results with tcmalloc

col-2 : results with jemalloc

col-1 col-2

0.99 1.02 hot-points_range=100

1.05 1.04 point-query_range=100

0.96 0.99 points-covered-pk_range=100

0.98 0.99 points-covered-si_range=100

0.96 0.99 points-notcovered-pk_range=100

0.97 0.98 points-notcovered-si_range=100

0.97 1.00 random-points_range=1000

0.95 0.99 random-points_range=100

0.99 0.99 random-points_range=10

1.04 1.03 range-covered-pk_range=100

1.05 1.07 range-covered-si_range=100

1.04 1.03 range-notcovered-pk_range=100

0.98 1.00 range-notcovered-si_range=100

1.02 1.02 read-only-count_range=1000

1.05 1.07 read-only-distinct_range=1000

1.07 1.12 read-only-order_range=1000

1.28 1.09 read-only_range=10000

1.03 1.05 read-only_range=100

1.05 1.08 read-only_range=10

1.08 1.07 read-only-simple_range=1000

1.04 1.03 read-only-sum_range=1000

1.02 1.02 scan_range=100

1.01 1.00 delete_range=100

1.03 1.01 insert_range=100

1.02 1.02 read-write_range=100

1.03 1.03 read-write_range=10

1.01 1.02 update-index_range=100

1.15 0.98 update-inlist_range=100

1.06 0.99 update-nonindex_range=100

1.03 1.03 update-one_range=100

1.02 1.01 update-zipf_range=100

1.18 1.05 write-only_range=10000

Performance: MyRocks

From the results here, QPS is mostly similar between tcmalloc and jemalloc with a slight advantage for jemalloc but there are a few microbenchmarks where jemalloc is much better than tcmalloc and those are highlighted.

The results for hot-points below are odd (jemalloc is a lot faster than tcmalloc) and from vmstat metrics here I see that CPU/operation (cpu/o) and context switches /operation (cs/o) are both much larger for tcmalloc.

These results use the relative QPS, which is the following where $allocator is tcmalloc or jemalloc. When this value is larger than 1.0 then QPS is larger with tcmalloc or jemalloc.

(QPS with $allocator) / (QPS with glibc malloc)

Relative to results with glibc malloc

col-1 : results with tcmalloc

col-2 : results with jemalloc

col-1 col-2

0.68 1.00 hot-points_range=100

1.04 1.04 point-query_range=100

1.09 1.09 points-covered-pk_range=100

1.00 1.09 points-covered-si_range=100

1.09 1.09 points-notcovered-pk_range=100

1.10 1.12 points-notcovered-si_range=100

1.08 1.08 random-points_range=1000

1.09 1.09 random-points_range=100

1.05 1.10 random-points_range=10

0.99 1.07 range-covered-pk_range=100

1.01 1.03 range-covered-si_range=100

1.05 1.09 range-notcovered-pk_range=100

1.10 1.09 range-notcovered-si_range=100

1.07 1.05 read-only-count_range=1000

1.00 1.00 read-only-distinct_range=1000

0.98 1.04 read-only-order_range=1000

1.03 1.03 read-only_range=10000

0.96 1.03 read-only_range=100

1.02 1.04 read-only_range=10

0.98 1.07 read-only-simple_range=1000

1.07 1.09 read-only-sum_range=1000

1.02 1.02 scan_range=100

1.05 1.03 delete_range=100

1.11 1.07 insert_range=100

0.96 0.97 read-write_range=100

0.94 0.95 read-write_range=10

1.08 1.04 update-index_range=100

1.08 1.07 update-inlist_range=100

1.09 1.04 update-nonindex_range=100

1.04 1.04 update-one_range=100

1.07 1.04 update-zipf_range=100

1.03 1.02 write-only_range=10000

Thursday, January 9, 2025

Sysbench performance over time for InnoDB and MyRocks: part 4

This is part 4 in my (possibly) final series on performance regressions in MySQL using cached sysbench as the workload. For previous posts, see part 1, part 2 and part 3. This post covers performance differences between InnoDB in upstream MySQL 8.0.32, InnoDB in FB MySQL 8.0.32 and MyRocks in FB MySQL 8.0.32 using a server with 32 cores and 128G of RAM.

I don't claim that the MyRocks CPU overhead isn't relevant, but this workload (CPU-bound, database is cached) is a worst-case for it.

tl;dr

InnoDB from FB MySQL is no worse than ~10% slower than InnoDB from upstream
Fixing bug 1506 is important for InnoDB in FB MySQL
MyRocks is ~30% slower than upstream InnoDB at low concurrency and ~45% slower at high, as it uses ~1.5X more CPU/query
For writes, MyRocks does worse at high concurrency than at low

Updates: For writes, MyRocks does worse at high concurrency than at low

I looked at vmstat metrics for the update-nonindex benchmark and the number of context switches per update is about 1.2X larger for MyRocks vs InnoDB at high concurrency.

Then I looked at PMP stacks and MyRocks has more samples for commit processing. The top stacks are here. This should not be a big surprise because MyRocks does more work at commit time (pushes changes from a per-session buffer into the memtable). But I need to look at this more closely.

I browsed the code in Commit_stage_manager::enroll_for, which is on the call stack for the mutext contention, and it is kind of complicated. I am trying to figure out how many mutexes are locked in there and figuring that out will take some time.

Benchmark, Hardware

Much more detail on the benchmark and hardware is here. I am trying to avoid repeating that information in the posts that follow.

Results here are from the c32r128 server with 32 CPU cores and 128G of RAM. The benchmarks were repeated for 1 and 24 threads. On the charts below that is indicated by NT=1 and NT=24.

Builds

The previous post has more detail on the builds, my.cnf files and bug fixes.

The encoded names for these builds is:

my8032_rel_o2nofp

InnoDB from upstream MySQL 8.0.32

fbmy8032_rel_o2nofp_end_241023_ba9709c9_971

FB MySQL 8.0.32 at git hash ba9709c9 (as of 2024/10/23) using RocksDB 9.7.1. This supports InnoDB and MyRocks.

fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506

FB MySQL 8.0.32 at git hash ba9709c9 (as of 2024/10/23) using RocksDB 9.7.1 with patches applied for bugs 1473, 1481, 1482 and 1506, This supports InnoDB and MyRocks.

The my.cnf files are:

my.cnf.cz11a_$x for InnoDB from upstream MySQL for c8r16, c8r32, c24r64, c32r128
my.cnf.cia1_$x for InnoDB from FB MySQL for c8r16, c8r32, c24r64, c32r128
my.cnf.cza2_$x for MyRocks from FB MySQL for c8r16, c8r32, c24r64, c32r128

Relative QPS

The charts and summary statistics that follow use a number that I call the relative QPS (rQPS) where:

rQPS is: (QPS for my version) / (QPS for base version)
base version is InnoDB from upstream MySQL 8.0.32 (my8032_rel_o2nofp)
my version is one of the other versions

Results

The microbenchmarks are split into three groups: point queries, range queries, writes. The tables below have summary statistics for InnoDB and MyRocks using the relative QPS (the same data as the charts).

Results are provided in two formats: charts and summary statistics. The summary statistics table have the min, max, average and median relative QPS per group (group = point, range and writes).

The spreadsheets and charts are also here. I don't know how to prevent the microbenchmark names on the x-axis from getting truncated in the png files I use here but they are easier to read on the spreadsheet.

The charts use NT=1, NT=16 and NT=24 to indicate whether sysbench was run with 1, 16 or 24 threads. The charts and table use the following abbreviations for the DBMS versions:

fbinno-nofix

InnoDB from fbmy8032_rel_o2nofp_end_241023_ba9709c9_971

fbinno-somefix

InnoDB from fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506

myrocks-nofix

MyRocks from fbmy8032_rel_o2nofp_end_241023_ba9709c9_971

myrocks-somefix

MyRocks from fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506

Summary statistics: InnoDB

Summary:

InnoDB from FB MySQL is no worse than ~10% slower than InnoDB from upstream
Fixing bug 1506 is important for InnoDB in FB MySQL

1 thread

fbinno-nofix	min	max	average	median
point	0.89	0.96	0.92	0.91
range	0.63	0.93	0.82	0.82
writes	0.86	0.98	0.89	0.88

fbinno-somefix	min	max	average	median
point	0.92	1.00	0.96	0.95
range	0.89	0.96	0.91	0.91
writes	0.89	0.99	0.92	0.92

24 threads

fbinno-nofix	min	max	average	median
point	0.92	0.96	0.94	0.94
range	0.62	0.96	0.81	0.82
writes	0.84	0.94	0.88	0.87

fbinno-somefix	min	max	average	median
point	0.94	0.99	0.97	0.98
range	0.78	0.99	0.89	0.91
writes	0.86	0.95	0.90	0.88

Summary statistics: MyRocks

Summary:

MyRocks does better at low concurrency than at high. The fix might be as simple as enabling the hyper clock block cache
MyRocks is ~30% slower than upstream InnoDB at low concurrency and ~45% slower at high
For writes, MyRocks does worse at high concurrency than at low

1 thread

myrocks-nofix	min	max	average	median
point	0.52	0.75	0.66	0.68
range	0.37	0.72	0.60	0.60
writes	0.65	1.21	0.79	0.73

myrocks-somefix	min	max	average	median
point	0.51	0.79	0.68	0.70
range	0.43	0.76	0.62	0.61
writes	0.66	1.23	0.80	0.74

24 threads

myrocks-nofix	min	max	average	median
point	0.40	0.76	0.49	0.43
range	0.40	0.71	0.58	0.60
writes	0.44	1.37	0.65	0.55

myrocks-somefix	min	max	average	median
point	0.48	0.77	0.55	0.51
range	0.43	0.71	0.60	0.60
writes	0.45	1.39	0.66	0.55

Results: c32r128 with InnoDB and point queries

Summary

InnoDB from FB MySQL is no worse than 10% slower than upstream

Results: c32r128 with MyRocks and point queries

Summary

at low concurrency the worst case for MyRocks are the tests that do point lookup on secondary indexes because that uses a range scan rather than a point lookup on the LSM tree, which means that bloom filters cannot be used
at high concurrency the difference between primary and secondary index queries is less significant, perhaps this is dominated by mutex contention from the LRU block cache and solved by using hyper clock

Results: c32r128 with InnoDB and range queries

Summary

the worst case for InnoDB from FB MySQL are the long range scans and fixing bug 1506 will be a big deal

Results: c32r128 with MyRocks and range queries

Summary

while long range scans are the worst case here, bug 1506 is not an issue as that is InnoDB-only

Results: c32r128 with InnoDB and writes

Summary

results are stable here, InnoDB from FB MySQL is no worse than ~10% slower than upstream but results at high concurrency are a bit worse than at low

Results: c32r128 with MyRocks and writes

Summary

while MyRocks does much better than InnoDB for update-index because it does blind writes rather than RMW for non-unique secondary index maintenance
MyRocks does worse at high concurrency than at low

Sysbench performance over time for InnoDB and MyRocks: part 3

This is part 3 in my (possibly) final series on performance regressions in MySQL using cached sysbench as the workload. For previous posts, see part 1 and part 2. This post covers performance differences between InnoDB in upstream MySQL 8.0.32, InnoDB in FB MySQL 8.0.32 and MyRocks in FB MySQL 8.0.32 using a server with 24 cores and 64G of RAM.

I don't claim that the MyRocks CPU overhead isn't relevant, but this workload (CPU-bound, database is cached) is a worst-case for it.

tl;dr

InnoDB from FB MySQL is no worse than ~10% slower than InnoDB from upstream
MyRocks is ~35% slower than InnoDB from upstream as it uses ~1.5X more CPU/query
Fixing bug 1506 is important for InnoDB in FB MySQL
For writes, MyRocks does worse at high concurrency than at low
while MyRocks does much better than InnoDB for update-index at 1 thread, that benefit goes away at 16 threads

Benchmark, Hardware

Much more detail on the benchmark and hardware is here. I am trying to avoid repeating that information in the posts that follow.

Results here are from the c24r64 server with 24 CPU cores and 64G of RAM. The benchmarks were repeated for 1 and 16 threads. On the charts below that is indicated by NT=1 and NT=16.

Builds

The previous post has more detail on the builds, my.cnf files and bug fixes.

The encoded names for these builds is:

my8032_rel_o2nofp

InnoDB from upstream MySQL 8.0.32

fbmy8032_rel_o2nofp_end_241023_ba9709c9_971

FB MySQL 8.0.32 at git hash ba9709c9 (as of 2024/10/23) using RocksDB 9.7.1. This supports InnoDB and MyRocks.

fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506

FB MySQL 8.0.32 at git hash ba9709c9 (as of 2024/10/23) using RocksDB 9.7.1 with patches applied for bugs 1473, 1481, 1482 and 1506, This supports InnoDB and MyRocks.

The my.cnf files are:

my.cnf.cz11a_$x for InnoDB from upstream MySQL for c8r16, c8r32, c24r64, c32r128
my.cnf.cia1_$x for InnoDB from FB MySQL for c8r16, c8r32, c24r64, c32r128
my.cnf.cza2_$x for MyRocks from FB MySQL for c8r16, c8r32, c24r64, c32r128

Relative QPS

The charts and summary statistics that follow use a number that I call the relative QPS (rQPS) where:

rQPS is: (QPS for my version) / (QPS for base version)
base version is InnoDB from upstream MySQL 8.0.32 (my8032_rel_o2nofp)
my version is one of the other versions

Results

Results are provided in two formats: charts and summary statistics. The summary statistics table have the min, max, average and median relative QPS per group (group = point, range and writes).

The charts use NT=1, NT=16 and NT=24 to indicate whether sysbench was run with 1, 16 or 24 threads. The charts and table use the following abbreviations for the DBMS versions:

fbinno-nofix

InnoDB from fbmy8032_rel_o2nofp_end_241023_ba9709c9_971

fbinno-somefix

InnoDB from fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506

myrocks-nofix

MyRocks from fbmy8032_rel_o2nofp_end_241023_ba9709c9_971

myrocks-somefix

MyRocks from fbmy8032_rel_o2nofp_241023_ba9709c9_971_bug1473_1481_1482_1506

Summary statistics: InnoDB

Summary:

InnoDB from FB MySQL is no worse than 9% slower than InnoDB from upstream
Fixing bug 1506 is important for InnoDB in FB MySQL

1 thread

fbinno-nofix	min	max	average	median
point	0.88	1.01	0.94	0.95
range	0.68	0.97	0.83	0.83
writes	0.86	0.95	0.90	0.89

fbinno-somefix	min	max	average	median
point	0.94	1.05	0.97	0.96
range	0.88	1.03	0.92	0.91
writes	0.88	0.96	0.92	0.93

16 threads

fbinno-nofix	min	max	average	median
point	0.93	0.96	0.94	0.94
range	0.65	0.95	0.83	0.85
writes	0.88	0.94	0.91	0.91

fbinno-somefix	min	max	average	median
point	0.94	0.97	0.95	0.95
range	0.85	0.96	0.91	0.91
writes	0.89	0.96	0.92	0.91

Summary statistics: MyRocks

Summary

MyRocks does better at low concurrency than at high. The fix might be as simple as enabling the hyper clock block cache
MyRocks is ~35% slower than upstream InnoDB
For writes, MyRocks does worse at high concurrency than at low

1 thread

myrocks-nofix	min	max	average	median
point	0.46	0.78	0.67	0.70
range	0.48	0.73	0.63	0.64
writes	0.65	1.49	0.81	0.73

myrocks-somefix	min	max	average	median
point	0.46	0.78	0.66	0.69
range	0.51	0.73	0.65	0.64
writes	0.66	1.54	0.82	0.74

16 threads

myrocks-nofix	min	max	average	median
point	0.52	0.77	0.63	0.63
range	0.46	0.73	0.63	0.61
writes	0.51	1.01	0.67	0.61

myrocks-somefix	min	max	average	median
point	0.55	0.79	0.63	0.62
range	0.53	0.74	0.65	0.65
writes	0.50	1.01	0.67	0.62

Results: c24r64 with InnoDB and point queries

Summary

results are stable here, InnoDB from FB MySQL is no worse than 10% slower than upstream

Results: c24r64 with MyRocks and point queries

Summary

the worst case for MyRocks are the tests that do point lookup on secondary indexes because that uses a range scan rather than a point lookup on the LSM tree, which means that bloom filters cannot be used

Results: c24r64 with InnoDB and range queries

Summary

the worst case for InnoDB from FB MySQL are the long range scans and fixing bug 1506 will be a big deal

Results: c24r64 with MyRocks and range queries

Summary

while long range scans are the worst case here, bug 1506 is not an issue as that is InnoDB-only

Results: c24r64 with InnoDB and writes

Summary

results are stable here, InnoDB from FB MySQL is no worse than ~10% slower than upstream

Results: c24r64 with MyRocks and writes

Summary

while MyRocks does much better than InnoDB for update-index at 1 thread, that benefit goes away at 16 threads. It does better at update-index because it does blind writes rather than RMW for non-unique secondary index maintenance. Perhaps the issue at high concurrency is memory system stalls because this server has 2 sockets.

Saturday, April 19, 2025

Battle of the Mallocators: part 2

Friday, April 11, 2025

Battle of the Mallocators

Thursday, January 9, 2025

Sysbench performance over time for InnoDB and MyRocks: part 4

Sysbench performance over time for InnoDB and MyRocks: part 3

CPU-bound sysbench on a large server: Postgres 12 to 19 beta1