Small Datum: leveldb

Showing posts with label leveldb. Show all posts

Wednesday, October 14, 2020

Comments on: The Unwritten Contract of Solid State Drives

The Unwritten Contract of Solid State Drives by Jun He, Sudarsun Kannan, Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau was published in EuroSys 2017. The paper is interesting and worth reading. I hope for more papers on this topic and recently shared a review on another paper from UW-Madison on this topic. This tries to be a brief review of the paper. I only review papers that I think are useful to me and others.

I didn't notice this paper until recently. While I wish I didn't miss interesting papers like that, it is hard for me to keep up with so many great papers getting published. Regardless, I assume that the RocksDB team is open to discussions with people doing research in this space. I am too and this post was half-serious.

5 rules

The paper presents five rules that are frequently true for an SSD and explains that all rules aren't true for all SSDs. Following the rules often leads to better SSD performance.

The objective functions in the paper are throughput and write amplification. Quality of service (QoS) wasn't included and maximizing throughput can hurt QoS. Many applications want maximum throughput that respects QoS where QoS can be defined by p99 and worst-case response times. So I hope that a future paper includes QoS. The paper explains that it focuses on internal states of the SSD rather than end-to-end performance, but QoS can also be measured via internal states because that is where many stalls happen (read delayed because slower program or erase in progress).

There is also a difference between using all device throughput that a workload needs vs using all of the device throughput. Saturating device throughput isn't a goal. Getting as much throughput as needed for a workload is a goal. A device offers several things -- capacity, throughput and low-latency operations. It is hard to saturate both capacity and throughput while not hurting QoS.

The rules are:

Request scale - issue large requests or concurrent small requests to get more throughput and benefit from internal parallelism of the SSD
Locality - access with locality (temporal and/or spatial). Possible benefits include reduced translation cache misses, reduced data cache misses and reduced write-amp from flash GC
Aligned sequentiality - write files sequentially using large write requests. For an FTL that can do hybrid block mapping this requires fewer entries in the translation table.
Grouping by death time - do writes so that all data in an erase block is discarded at the same time.
Uniform lifetime - structure data so that all writes have a similar lifetime. Note that this is stricter version of grouping by death time and unlikely to be followed by a DBMS except for workloads that are write-once and then read-only.

The paper states that an SSD might not require following all of the rules. Alas, it can be difficult to determine which rules should be followed and for a given device, or even a given firmware version.

A summary of the Linux block IO layer is useful background reading.

5 Rules and an LSM

Can an LSM follow these rules?

Request scale

Compaction reads with RocksDB are a (compressed) block at a time and the uncompressed block size is likely ~4k, ~8k or ~16k. The block sizes are approximate, usually < the target size and not aligned. The compaction_readahead_size option in RocksDB can be set to get large read requests otherwise you rely on filesystem readahead and it is risky to make read_ahead_kb large because that is per-device.
Compaction writes with RocksDB are a page at a time. Each SST is written sequentially and fsync is called when the SST is full (usually ~64M). Async write-behind can be enabled via the bytes_per_sync to avoid a storm of writes on fsync and to get large write requests of size bytes_per_sync.
WAL writes can use the wal_bytes_per_sync option
For user reads, many concurrent users are the easy way to get concurrent small reads. I am less certain about options for doing prefetch or getting parallel IO requests for a user query. Back when I did small data at FB getting concurrent IO for one user request wasn't a priority because the focus was sharing a device across many users.

Locality

Compaction reads & writes respect this as they read & write files sequentially.
Whether user reads respect this depends on the workload.

Aligned sequentiality - compaction writes respect this
Grouping by death time

I know this via the multistream effort and even collaborated on an evaluation. The results in my eval weren't significant so I should read papers by others to understand where it was a big deal.
Grouping by space requires an API like multistream. With that it is easy to group for an LSM. The upper/smaller levels of the LSM tree have shorter lifetimes than the lower/larger levels.
It is usually impossible to do this without an API. Because I don't know how many erase blocks are concurrently open. Nor do I know the size of an erase block. I risk using the wrong name as I am not an expert on this but I mean the logical erase block that is managed by the FTL not the physical erase block that is per NAND chip. The logical erase block is striped over many NAND chips for a few reasons, including to prevent the loss of data when a NAND chip fails. This structure is rarely explained by vendors.

Uniform lifetime - I doubt any DBMS can follow this except for data that is write once.

Workloads

The paper then runs workloads for SQLite, LevelDB and RocksDB using SSD simulators to determine whether the rules are followed. The workloads used db_bench for RocksDB however there weren't enough details on the command line options or RocksDB configuration. One of the comments mentions that most of the activity was between the L0 and L1 in the LSM tree and I wonder if the databases were too small and didn't have enough LSM tree levels.

Observations

The paper then has many interesting observations on whether the rules were respected and explains cases where they weren't. Things that prevent the rules from being followed include the DBMS, the filesystem and the workload. I have comments on some of the observations

Observation 1 - Application log structure increases the scale of write size. This is expected with an LSM. I appreciate that this paper distinguishes between IO patterns for the application (LSM writes SST files sequentially) and for the device (concurrent compaction means that a device interleaves large write requests from multiple files).
Observation 2 - The scale of read requests is low. It is hard to know whether this is a function of the system software (bottlenecks & missing features) or the workload (working set cached, not many concurrent users).
Observation 3 - RocksDB doesn't consume enough device bandwidth. This is hard to appreciate without more details on the benchmark setup. Note that I don't care about saturating device throughput unless QoS is also a goal.
Observation 4 - Frequent data barriers limit request scale. The obvious one is fsync/fdatasync although concurrency usually overcomes that. For reads, prefetching and readahead can overcome but this wasn't a priority for the workloads I cared about because they had many concurrent users. There are things that can be done, and might already be in RocksDB, to get more concurrent IO reads to a device when apps have low concurrency.
Observation 5 - Linux buffered I/O limits request scale. Fortunately, RocksDB can also run with O_DIRECT. I wasn't clear on the claim that read_ahead_kb limits the size of a read. Perhaps this is true for reads smaller than read_ahead_kb. Perhaps I can ask the block IO expert. For compaction, the compaction_readahead_size option can get larger reads for compaction. For scans I am not sure if RocksDB has a way to get readahead (multi-page reads) when needed.
Observation 8 - be wary of deferring discard. The variety of discard performance across devices is large and not well documented. I hope more research is done on this.
Observation 9 - SSDs demand accurate and aggressive prefetching. I am wary of inaccurate and aggressive prefetching. An LSM can benefit from prefetching for compaction reads (RocksDB has an option for that). Prefetching for an LSM during user reads is interesting. For a point query prefetching the lower levels of the LSM tree can be wasted work when the data is found in the upper levels of the tree. Prefetching for a scan is easier because data from all levels must be read. However, the amount of readahead to use during a scan is more complicated because the LSM usually isn't told how large the scan will be (first 10 rows vs all rows). There might be options in RocksDB to get parallel reads on behalf of a single user query.
Observation 17 - application log structuring does not reduce garbage collection. I am skeptical about this. Also, the paper shows that it doesn't prevent GC. It does not show that it doesn't reduce it. I have results from 2016 to show that flash GC write-amp from MyRocks was less than from InnoDB. Of course, that is specific to one device. Regardless I appreciate that the paper explains some of the issues including that a device interleaves large write requests from many files with concurrent compaction. I am wary of the misleading sequential writes message that is repeated in too many papers.

Tuesday, January 22, 2019

Less "mark" in MySQL benchmarking

My goal for the year is more time learning math and less time running MySQL benchmarks. I haven't done serious benchmarks for more than 12 months. It was a great experience but I want to learn new things. MySQL 8.0.14 has been released with fixes for a serious bug I found via the insert benchmark. I won't confirm whether it has been fixed. I hope someone else does.

My tests and methodology are described in posts for sysbench, linkbench and the insert benchmark. I hope the upstream distros (MySQL, MariaDB, Percona) repeat my tests and methodology and I am happy to answer questions about that. I even have inscrutable shell scripts that make it easy to run the tests. Despite being a lousy example of how to use Bash, they are portable enough to run on my home and work hardware.

Monday, January 21, 2019

Optimal configurations for an LSM and more

I have been trying to solve the problem of finding an optimal LSM configuration for a given workload. The real problem is larger than that, which is to find the right index structure and the right configuration for a given workload. But my focus is RocksDB so I will start by solving for an LSM.

This link is to slides that summarizes my effort. I have expressed the problem to be solved using differentiable functions to express the cost that is to be minimized. The cost functions have a mix of real and integer valued parameters for which values must be determine to minimize the cost. I have yet to solve the functions, but I am making progress and learning more math. This might be a constrained optimization problem and Lagrange Multipliers might be useful. The slides are from a talk I am about to present at the MongoDB office in Sydney where several WiredTiger developers are based. I appreciate that Henrik Ingo set this up.

My work has things in common with the excellent work by Harvard DASlab lead by Stratos Idreos. I have years of production experience on my side, they have many smart and ambitious people on their side. There will be progress. I look forward to more results from their Data Calculator effort. And I have learned a lot from the Monkey and Dostoevsky papers by Niv Dayan et al.

Tuesday, January 15, 2019

Geek code for LSM trees

This is a link to slides from my 5-minute talk at the CIDR 2019 Gong Show. The slides are a brief overview of the geek code for LSM trees. If you click on the settings icon in the slide show you can view the speaker notes which have links to blog posts that have more details. I also pasted the links below. Given time I might add to this post, but most of the content is in my past blog posts. Regardless I think there is more to be discovered about performant, efficient and manageable LSM trees.

The key points are there are more compaction algorithms to discover, we need to make it easier to describe them and compaction is a property of a level, not of the LSM tree.

Links to posts with more details:

Describing tiered and leveled compaction
Number of levels that minimized write amplification
Combining tiered and leveled compaction
Tiered vs leveled, why not both
Name that compaction algorithm
Original LSM paper that got this started
Review of SlimDB with references to the first tiered compaction, Stepped Merge

Thursday, January 10, 2019

LSM math: fixing mistakes in my last post

My last post explained the number of levels in an LSM that minimizes write amplification using 3 different estimates for the per-level write-amp. Assuming the per-level growth factor is w then the 3 estimates were approximately w, w+1 and w-1 and named LWA-1, LWA-2 and LWA-3 in the post.

I realized there was a mistake in that post for the analysis of LWA-3. The problem is that the per-level write-amp must be >= 1 (and really should be > 1) but the value of w-1 is <= 1 when the per-level growth factor is <= 2. By allowing the per-level write-amp to be < 1 it easy to incorrectly show that a huge number of levels reduces write-amp as I do for curve #3 in this graph. While I don't claim that (w-1) or (w-1)/2 can't be a useful estimate for per-level write-amp in some cases, it must be used with care.

Explaining LWA-3

The next challenge is to explain how LWA-3 is derived. That comes from equation 12 on page 9 of the Dostoevsky paper. Start with the (T-1)/(K+1) term and with K=1 then this is (T-1)/2. T in the paper is the per-level growth factor so this is the same as (w-1)/2. The paper mentions that this is derived using an arithmetic series but does not show the work. I show my work but was not able to reproduce that result.

Assume that the per-level growth factor is w, all-to-all compaction is used and the LSM tree has at least 3 levels. When full L1 has size 1, L2 has size w and L3 has size w*w. There are four derivations below - v1, v2, v3, v4. The results are either w/2 or (w+1)/2 which doesn't match (w-1)/2 from the paper. Fortunately, my previous post shows how to minimize total write-amp assuming the per-level write-amp is w/2 or (w+1)/2. I will contact the author to figure out what I am missing.

The analysis below is for merges from L1 to L2, but it holds for merges from Ln to Ln+1. I think that v1 and v2 are correct and their estimate for per-level write-amp is (w+1)/2. As explained below I don't think that v3 or v4 are correct, their estimate for per-level write-amp is w/2.

I have yet to explain how to get (w-1)/2.

v1

Assume that merges are triggered from Ln to Ln+1 when a level is full -- L1 has size 1, L2 has size w, L3 has size w*w. A level is empty immediately after it is merged into the next level. So L2 gets full, then is merged into L3 and becomes empty, then slowly gets larger as L1 is merged into it w times. The per-level write-amp from this is (w+1)/2.

* merges into L2 write output of size 1, 2, ..., w
* then L2 is full
* sum of that sequence -> w*(w+1)/2
* average value is sum/w -> (w+1)/2

1) Moving data of size 1 from L1 to L2 writes (w+1)/2 on average
2) Therefore per-level write-amp for L1 -> L2 is (w+1)/2

Note that per-level write-amp is (avg merge output to Ln / size of Ln-1)
* avg merge output to L2 is (w+1)/2
* size of Ln-1 is 1

v2

Assume that merges are triggered from Ln to Ln+1 when a level is almost full -- L1 has size 1 * (w-1)/w, L2 has size w * (w-1)/w, L3 has size (w*w) * (w-1)/w. The trigger conditions can be reduced to L1 has size (w-1)/w, L2 has size (w-1) and L3 has size w*(w-1).

This assumes that w merges are done from L1 to L2 for L2 to go from empty to full. Each merge adds data of size (w-1)/w because L1:L2 merge is triggered when L1 has that much data. Thus L2 has size (w-1) after w merges into it at which point L2:L3 merge can be done. The per-level write-amp from this is the same as it was for v1.

* merges into L2 write output of size (w-1)/w * [1, 2, ..., w]
* then L2 is full
* sum of that sequence -> (w-1)/w * w*(w+1)/2 = (w-1)(w+1)/2
* average value is sum/w -> (w-1)(w+1)/(2*w)

As from v1, per-level write-amp is (avg merge output to Ln / size of Ln-1)
* avg merge output to L2 = (w-1)(w+1)/(2*w)
* size of L1 = (w-1)/w

start with: ( (w-1)(w+1)/(2*w) ) / ( (w-1)/w )
simplify to: (w+1)/2

v3

Merges are triggered the same as for v1 but I assume that only w-1 merges are done from Ln to Ln+1 rather than w. Ln+1 won't be full at the end of that, for example L2 would have size w-1 rather than the expected size w. But I was curious about the math. The per-level write-amp is w/2.

* merges into L2 write output of size 1, 2, ..., w-1
* sum of that sequence -> (w-1)*w/2
* average value is sum/(w-1) -> w/2

1) Moving data of size 1 from L1 to L2 writes w/2 on average
2) Therefore per-level write-amp for L1 -> L2 is w/2

v4

Merges are triggered the same as for v2. But as with v3, only w-1 merges are done into a level. Again I don't think this is correct because a level won't have enough data to trigger compaction at that point. The per-level write-amp here is the same as for v3.

* merges into L2 write output of size (w-1)/w * [1, 2, ..., w-1]
* sum of that sequence -> (w-1)/w * (w-1)*w/2 = (w-1)(w-1)/2
* average value is sum/(w-1) -> (w-1)/2

As from v1, per-level write-amp is (avg merge output to Ln / size of Ln-1)
* avg merge output to L2 = (w-1)/2
* size of L1 = (w-1)/w

start with: ( (w-1)/2 ) / ( (w-1)/w )
simplify to: w/2

Wednesday, January 9, 2019

LSM math: revisiting the number of levels that minimizes write amplification

I previously used math to explain the number of levels that minimizes write amplification for an LSM tree with leveled compaction. My answer was one of ceil(ln(T)) or floor(ln(T)) assuming the LSM tree has total fanout = T where T is size(database) / size(memtable).

Then I heard from a coworker that the real answer is less than floor(ln(T)). Then I heard from Niv Dayan, first author of the Dostoevsky paper, that the real answer is larger than ceil(ln(T)) and the optimal per-level growth factor is ~2 rather than ~e.

All of our answers are correct. We have different answers because we use different functions to estimate the per-level write-amp. The graph of the functions for total write-amp using the different cost functions is here and you can see that the knee in the curve occurs at a different x value for two of the curves and the third curve doesn't appear to have a minimum.

While working on this I learned to love the Lambert W function. But I wonder whether I made the math below for LWA-2 harder than necessary. I am happy to be corrected. I appreciate the excellent advice on Quora: here, here and here. The online graphing calculator Desmos is another great resource.

Math

I use differentiable functions to express the total write-amp as a function of the number of levels, then determine the value (number of levels) at which the first derivative is zero as that might be the global minimum. Constants, variables and functions below include:

T - total fanout, = size(database) / size(memtable)
n - number of levels in the LSM tree
LWA, LWA-x - function for the per-level write-amp
TWA, TWA-x - function for the total write-amp, = n * LWA
w - per-level growth factor, = T^(1/n) for all levels to minimize write-amp

The function for total write-amp has the form: TWA = n * LWA where n is the number of levels and LWA is the per-level write-amp. LWA is a function of T and n. The goal is determine the value of n at which TWA is minimized. While n must be an integer the math here doesn't enforce that and the result should be rounded up or down to an integer. T is a constant as I assume a given value for total fanout. Here I use T=1024.

I wrote above that the 3 different answers came from using 3 different estimates for the per-level write-amp and I label these LWA-1, LWA-2 and LWA-3. When w is the per-level growth factor then the per-level write-amp functions are:

LWA-1 = w -- I used this to find that the best n = ceil(ln(T)) or floor(ln(T))
LWA-2 = w + 1 -- with this the best n is less than that found with LWA-1
LWA-3 = (w - 1) / 2 -- with this the best n is greater than that found with LWA-1

I can also state the per-level write-amp functions directly with T and n. I didn't above to make it easier to see the differences.

LWA-1 = T^(1/n)
LWA-2 = T^(1/n) + 1
LWA-3 = (T^(1/n) - 1) / 2

Explaining LWA

First I explain LWA-1 and LWA-2. Compacting 1 SST from Ln to Ln+1 requires merging 1 SST from Ln with ~w SSTs from Ln+1 where w=10 by default with RocksDB. The output will be between w and w+1 SSTs. If the output is closer to w then LWA-1 is correct. If the output is closer to w+1 then LWA-2 is correct. This paper explains why the per level write-amp is likely to be less than w. Were I to use f*w where f < 1 for LWA-1 then the math still holds. Maybe that is a future blog post.

LWA-3 assumes that all-to-all compaction is used rather than some-to-some. I explain the difference here. RocksDB/LevelDB leveled uses some-to-some but all-to-all is interesting. With all-to-all when compaction from Ln to Ln+1 finishes then Ln is empty and slowly gets full after each merge into it. Assume the per-level growth factor is w and Ln-1, Ln and Ln+1 are full at sizes 1, w and w*w. Then Ln becomes full after w merges from Ln-1 and those write output of size 1, 2, ..., w-1, w. The sum of the first w integers is w(w+1)/2. Divide this by w to get the averge -- (w+1)/2. However above LWA-3 is (w-1)/2 not (w+1)/2. I will explain that in another blog post. Note that in LWA-3 the numerator, w-1, is more interesting than the denominator, 2. Dividing by any constant doesn't change where the minimum occurs assuming there is a minimum and that is visible on this graph that shows the impact of dividing by 2 on the total write-amp.

Read on to understand the impact of using w-1, w or w+1 as the function for per-level write-amp. The difference might be more significant than you expect. It surprised me.

Minimizing TWA

This graph shows the total write-amp for LWA-1, LWA-2 and LWA-3. I call the total write-amp TWA-1, TWA-2 and TWA-3. Two of the curves, for TWA-1 and TWA-2, appear to have a minimum. One occurs for x between 4 and 6, the other for x between 6 and 8. The third curve, for TWA-3, doesn't appear to have a minimum and is decreasing as x (number of levels) grows.

The next graph uses the first derivative for the total write-amp functions, so it is for TWA-1', TWA-2' and TWA-3'. A global minimum for TWA-x can occur when TWA-x' = 0 and from the graph TWA-1'=0 when x=6.931 and TWA-2'=0 when x=5.422 which matches the estimate from the previous paragraph. From the graph it appears that TWA-3' approaches zero as x gets large but is never equal to zero.

The next step is to use math to confirm what is visible on the graphs.

Min write-amp for LWA-1

See my previous post where I show that n = ln(T) minimizes total write-amp if n isn't limited to an integer and then the per-level growth factor is e. Since the number of levels must be an integer then one of ceil(ln(T)) or floor(ln(T)) minimized total write-amp.

Min write-amp for LWA-2

I can reuse some of the math from my previous post. But this one is harder to solve.

# wa is the total write-amp
# n is the number of levels
# t is the total fanout
wa = n * ( t^(1/n) + 1 )
wa = n*t^(1/n) + n

# the difference between this and the previous post is '+1'
wa' = t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2) + 1
wa' = t^(1/n) - (1/n) * ln(t) * t^(1/n) + 1

At this point the difference between this and the previous post is '+1'. But wait this starts to get interesting.
# critical point for this occurs when wa' = 0
t^(1/n) - (1/n) * ln(t) * t^(1/n) + 1 = 0

# multiply by t^(-1/n)
1 - (1/n) * ln(t) + t^(-1/n) = 0

# move some terms to RHS
t^(-1/n) = (1/n) ln(t) - 1
# use ln on LHS and RHS to get rid of '^(1/n)'
ln ( t^(-1/n) ) = ln( (1/n) * ln(t) - 1 )
(-1/n) ln(t) = ln( (1/n) * ln(t) - 1

I got stuck here but eventually made progress.

# let a = (1/n) ln(t) and rewrite
-a = ln(a - 1)

# let x=a-1, a=x+1 and rewrite
-(x+1) = ln(x)

# do e^LHS = e^RHS
e^-(x+1) = e^ln(x)
e^-x * e^-1 = x

# multiply LHS and RHS by e^x
e^-1 = e^x * x

# e^-1 -> (1/e)
(1/e) = e^x * x

At last I can use Lambert W function!

# Given: e^x * x = K, then x = W(K)
x = W(e^-1) ~= 0.27846

# because a=x+1
a ~= 1.27846

# a = (1/n) ln(t) -> n = (1/a) ln(t), t=1024
n = 1/1.27846 * ln(1024)

# The value for n that minimizes total write-amp
# from the graph I claimed that n=5.422. this is close
n = 5.4217

Min write-amp for LWA-3

Update-1 - I think I made a few mistakes here. So you can stop reading until update-2 arrives.

Update-2 - this post explains my mistake and uses math to estimate that per-level write-amp = (w+1)/2 when all-to-all compaction is used. I am still unable to derive (w-1)/2.

I started to work on this without paying attention to the curve for LWA-3'. From the graph it appears to converge to 0 but is always less than 0, TWA-3 is decreasing as x, number of levels, gets large. Therefore make the number of levels as large as possible, 2M or 2B, to minimize total write-amp as visible in this graph.

But more levels in the LSM tree comes at a cost -- more read-amp. And the reduction in write-amp is small when the number of levels increases from 20 to 200 to 2000 to 2M. Again, this is visible in the graph. Besides, if you really want less write-amp then use tiered compaction rather than leveled with too many levels.

The other consideration is the minimal per-level growth factor that should be allowed. If the min per-level growth factor is 2. Then then that occurs when the number of levels, n, is:

# assume total fanout is 1024
2^n = 1024
log2(2^n) = log2(1024)
n = log2(1024) = 10

Alas the total fanout isn't always a power of 2. Given that the number of levels must be an integer then the goal is to use the smallest number of levels such that the per-level growth factor >= 2. Therefore when x isn't limited to an integer there is no answer -- just make x as large as possible (1M, 1B, etc) in which case the per-level growth factor converges to 1 but is always greater than 1.

The above can be repeated where the constraint is either the max number of levels or a different value for the min per-level growth factor (either <2 or >2). Regardless, if LWA-3 is the cost function then total write-amp is minimized by using as many levels as possible subject to these constraints.

Below is some math for LWA-3 and LWA-3'.

# wa is the total write-amp
# n is the number of levels
# t is the total fanoutwa = n * ( t^(1/n) - 1 ) / 2
wa = (n*t^(1/n) - n ) / 2

# the big difference between this and the previous post is '+1'
wa' = [ t^(1/n) + n * ln(t) * t^(1/n) * (-1) * (1/n^2) - 1 ] / 2
wa' = [ t^(1/n) - (1/n) * ln(t) * t^(1/n) - 1 ] / 2

# determine when wa' = 0[ t^(1/n) - (1/n) * ln(t) * t^(1/n) - 1 ] / 2 = 0

# multiply LHS and RHS by 2t^(1/n) - (1/n) * ln(t) * t^(1/n) - 1 = 0# multiply LHS and RHS by t^(-1/n)
1 - (1/n) * ln(t) - t^(-1/n) = 0

# move last term to RHS
1 - (1/n) * ln(t) = t^(-1/n)

# probably a good idea to stop here
# LHS is likely to be <0 so can't use ln(LHS) = ln(RHS)

Monday, January 7, 2019

Define "better"

Welcome to my first rant of 2019, although I have written about this before. While I enjoy benchmarketing from a distance it is not much fun to be in the middle of it. The RocksDB project has been successful and thus becomes the base case for products and research claiming that something else is better. While I have no doubt that other things can be better I am wary about the definition of better.

There are at least 3 ways to define better when evaluating database performance. The first, faster is better, ignores efficiency, the last two do not. I'd rather not ignore efficiency. The marginal return of X more QPS eventually becomes zero while the benefit of using less hardware is usually greater than zero.

Optimize for throughput and ignore efficiency (faster is better)
Get good enough performance and then optimize for efficiency
Get good enough efficiency and then optimize for throughput

Call to action

I forgot to include this before publishing. Whether #1, #2 or #3 is followed I hope that more performance results include details on the HW consumed to create that performance. How much memory and disk space were used? What was the CPU utilization? How many bytes were read from and written to storage? How much random IO was used? I try to report both absolute and relative values where relative values are normalized by the transaction rate.

Thursday, January 3, 2019

Review of LSM-based Storage Techniques: A Survey

Chen Luo and Mike Carey published a wonderful survey of research on LSM algorithms. They know about LSM because the AsterixDB project includes an LSM. They did a great job explaining the LSM space, telling a coherent story and summarizing relevant papers. Reading this paper was a good use of my time and I found a few more papers to read in their references.

I have read a few papers, including TRIAD, with ideas on reducing write-amp for the smaller levels of the LSM tree. I think this could be done for RocksDB by merging and remerging immutable memtables -- this is similar in spirit to subcompactions for the L0. With a large immutable memtable there would be one less level in the LSM tree. This is an alternative to having an L0, and maybe an L1, that are not made durable. In all cases the cost is a longer MTTR because WAL replay must be done. In all cases there is an assumption that the non-durable levels (large immutable memtables or L0/L1) are in memory.

This is a small complaint from me that I have made in the past. The paper states that an LSM eliminates random IO when making things durable. I prefer to claim that it reduces random IO. With leveled compaction each step merges N (~11) SSTs to generate one steam of output. So for each step there is likely a need to seek when reading the ~11 input streams and writing the output stream. Then compaction steps usually run concurrently when the ingest rate is high so there are more seeks. Then the WAL must be written -- one more stream and a chance for more seeks. Finally user queries are likely to read from storage causing even more seeks. Fortunately, there will be fewer seeks per insert/update/delete compared to a B-Tree.

The paper has a short history of compaction describing pure-tiered and pure-leveled. But these are rarely used in practice. The original LSM paper implemented pure-leveled. LevelDB and RocksDB use a hybrid approach with tiered for the L0 followed by leveled for the remaining levels. Pure-tiered was introduced by the Stepped Merge paper. Using tiered for all levels has a large space-amplification, much larger than 1, because the max level is tiered and that is too much wasted space for many workloads. Tiered in RocksDB and other popular LSM engines can be configured to use leveled compaction into the max level to get a space-amp less than 2, ignoring transient space-amp during compaction into the max level. Pure-tiered was a great choice for Stepped Merge because that was a cache for bulk-loading a data warehouse rather than a full copy of the database. While I think that RocksDB leveled and RocksDB tiered are examples of tiered+leveled, I don't want to rename them.

I appreciate that the paper makes clear that trade-offs must be considered when evaluating benchmarks. Many things can support higher write rates than RocksDB with leveled compaction, including RocksDB with tiered compaction. But that comes at a cost in memory, read and/or space amplification. Some papers could do a better job of documenting those costs.

The cost analysis in section 2.3 is limited to IO costs. I look forward to coverage of CPU costs in future LSM research. The read penalty for an LSM compared to a B-Tree is usually worse for CPU than for IO. The paper uses partitioned and non-partitioned where I use all-to-all and some-to-some to explain the compaction approaches. RocksDB implements some-to-some for leveled and all-to-all for tiered. The paper does a nice job explaining why the per-level write-amp should be less for all-to-all than some-to-some, ignoring write skew. Note that in production the per-level write-amp is almost always less than the per-level growth factor and this paper from Hyeontaek Lim explains why.

For the read IO costs, the paper counts logical IOs rather than physical IOs. Logical IOs are easier to estimate because caches mean that many logical IOs don't cause a physical IO and smaller levels in the LSM tree are usually in cache. There are two ways to consider the cost for a range query -- long vs short range queries or the cost of range seek vs range next. The paper uses the first, I use the second. Both are useful.

I appreciate that the author noticed this. I realize there is pressure to market research and I am not offering to try and reproduce benchmark results, but I have been skeptical about some of the comparisons I see where the base case is InnoDB or RocksDB.

These improvements have mainly been evaluated against a default (untuned) configuration of LevelDB or RocksDB, which use the leveling merge policy with size ratio 10. It is not clear how these improvements would compare against a well-tuned LSM-tree.

The discussion in 3.3.1 on pipelining compaction is interesting but RocksDB already does pipelining. With buffered IO there is support for async read-ahead and async write-behind. Note that the read and write phases can also be CPU-heavy if the cost for decompression on read and compression on write are included, even when the wonderful zstd and lz4 algorithms are used.

A few more comments:

RocksDB has limited support for fractional cascading (from SST to SST). See 3.4.2.
With key-value separation, GC could merge log segments to generate longer ordered log segments over time. This would reduce the range read penalty. See 3.4.2.
LHAM might be the first time-series optimized compaction strategy. See 3.5.
Non-unique secondary index maintenance is already read-free in MyRocks. It has a copy of the row prior to index maintenance, because SQL semantics or because this was an insert. Write-optimized SQL engines can add support for read-free change statements in some cases but that usually means SQL semantics (like modified row count) will be broken. See 3.7.2.
MyRocks already collects statistics during compaction. See 3.7.3.

Monday, January 30, 2017

Compaction stalls: something to make better in RocksDB

In previous results that I shared for the insert benchmark it was obvious that MyRocks throughput is steady when the workload transitions from in-memory to IO-bound. The reason is that non-unique secondary index maintenance is read-free for MyRocks so there are no stalls for storage reads of secondary index pages. Even with the change buffer, InnoDB eventually is slowed by storage reads and by page writeback.

It was less obvious that MyRocks has more variance on both the in-memory and IO-bound insert benchmark tests. I try to be fair when explaining storage engine performance so I provide a few more details here and results for InnoDB in MySQL 5.7.10 & 5.6.26 along with MyRocks from our fork of MySQL 5.6. The binlog was enabled for all tests, fsync-on-commit was disabled and 16 clients inserted 500m or 2b rows into 16 tables in PK order. Each table has 3 secondary indexes. For MyRocks I made one configuration change from the earlier result. I changed level0_slowdown_writes_trigger from 10 to 20 to reduce compaction stalls. This has a potential bad side effect of making queries slower, but this test was insert-only.

For both graphs the y-axis is the average insert rate per 5-second interval and the x-axis is the interval number. I used mstat to collect the data.

The goal is to match InnoDB in MySQL 5.7 in quality of service while providing much better throughput. We have some work to do. On the bright side, this is an opportunity for someone to make RocksDB better.

In-memory

The first graph is for the in-memory workload where all data is cached by the storage engine, nothing is read from storage, but many storage writes are done to persist the changes. InnoDB with MySQL 5.7 is much faster than with 5.6. It also has the least variance. MyRocks has the most variance and that is largely from compaction stalls when there are too many files in level 0 of the LSM tree.

IO-bound

The second graph is for the IO-bound workload. The graph ends first for MyRocks because it sustains the highest average insert rate. But it also has a thick line because of variance. InnoDB from MySQL 5.6 also has a lot of variance.

Tuesday, June 9, 2015

RocksDB & ForestDB via the ForestDB benchmark: cached database

For this test I use a database smaller than RAM so it should remain cached even after space-amplification occurs. Tests were repeated with both a disk array and SSD as the database still needs to be made durable and some engines do more random IO for that. Tests were also run for N=10 and N=20 where that is the number of threads to use for the tests with concurrent readers or writers. The test server has 24 hyperthread cores. All tests used a database with 56M documents and ~100G block cache. All tests also set the ForestDB compaction threshold to 25%.

Disk array, 10 threads

This configuration used a disk array and 10 user threads (N=10) for the concurrency tests. Unlike the IO-bound/disk test the load here was faster for RocksDB. Had more documents been inserted eventually ForestDB would be faster.

RocksDB continues to be faster on the write-only tests (ows.1, ows.n, owa.1, owa.n). I did not spend much time trying to explain the difference.

For the point query test ForestDB is faster at 1 thread but much slower at 10 threads. I think the problem is mutex contention on the block cache. I present stack traces at the end of the post to explain this. For the range query tests RocksDB is always faster because ForestDB has to do more work to get the data and RocksDB benefits from a clustered index.

operations/second for each step

RocksDB ForestDB

load 133336 86453

ows.1 4649 2623

ows.n 11479 8339

pqw.1 63387 102204

pqw.n 531653 397048

rqw.1 503364 24404

rqw.n 4062860 205627

pq.1 99748 117481

pq.n 829935 458360

rq.1 774101 292723

rq.n 5640859 1060490

owa.1 75059 28082

owa.n 73922 27092

The command lines are below. The config used 8 files for ForestDB.

bash rall.sh 56000000 log /disk 102400 64 10 600 3600 1000 1 rocksdb 20 no 1

bash rall.sh 56000000 log /disk 102400 64 10 600 3600 1000 1 fdb 20 no 8

SSD, 10 threads

This configuration used an SSD and 10 user threads (N=10) for the concurrency tests. The results are similar to the results above for the disk array with a few exceptions. RocksDB does worse on the load and write-only tests because the disk array has more IO throughput.

operations/second for each step

RocksDB ForestDB

load 46895 86328

ows.1 2899 2131

ows.n 10054 6665

pqw.1 63371 102881

pqw.n 525750 389205

rqw.1 515309 23648

rqw.n 4032487 203822

pq.1 99894 115806

pq.n 819258 454507

rq.1 756546 294490

rq.n 5708140 1074295

owa.1 30469 22305

owa.n 29563 20671

The command lines are below. The config used 8 files for ForestDB.

bash rall.sh 56000000 log /ssd1 102400 64 10 600 3600 1000 1 rocksdb 20 no 1

bash rall.sh 56000000 log /ssd1 102400 64 10 600 3600 1000 1 fdb 20 no 8

SSD, 20 threads

This configuration used an SSD and 20 user threads (N=20) for the concurrency tests. RocksDB makes better use of the extra concurrency in the workload. In some cases throughput for ForestDB was already limited by mutex contention with N=10 and did not improve here.

operations/second for each step

RocksDB ForestDB

load 46357 85053

ows.1 2987 2082

ows.n 13595 7263

pqw.1 62684 102648

pqw.n 708154 354919

rqw.1 510009 24122

rqw.n 5958109 253565

pq.1 100403 117666

pq.n 1227031 387373

rq.1 761143 294078

rq.n 8337013 1013277

owa.1 30487 22219

owa.n 28972 21487

The command lines are below. The config used 8 files for ForestDB.

bash rall.sh 56000000 log /ssd1 102400 64 20 600 3600 1000 1 rocksdb 20 no 1

bash rall.sh 56000000 log /ssd1 102400 64 20 600 3600 1000 1 fdb 20 no 8

Stack traces and nits

I used PMP to get stack traces to explain performance for some tests. I have traces mutex contention and one other problem. I ended up reproducing one problem by reloading 1B docs into a database.

This thread stack shows mutex contention while creating iterators for range queries. This stack trace was common during the range query tests.

This thread stack shows mutex contention on the block cache during range queries. I am not certain, but I think this was from the point query tests.

This has 3 stack traces to show the stall on the commit code path where disk reads are done. That was a problem for ows.1, ows.n, owa.1 and owa.n.

Monday, June 8, 2015

RocksDB & ForestDB via the ForestDB benchmark: IO-bound and SSD

This test is similar to the previous result except the database was on an SSD device rather than disk. The SSD is a 400G Intel s3700 with 373G of usable space. The device can do more random IO and less sequential IO than the disk array. I ran an extra process to take 82G of RAM via mlock from the 144G RAM server to make it easier to have a database larger than RAM but fit on the SSD.

The test used a database with 600M documents. I reduced the compaction threshold for ForestDB from 50% to 25% to reduce the worst case space-amplification from 2 to 4/3 and get more data into the test database. This change isn't reflected in the configuration template I published in github. RocksDB was configured with an 8G block cache versus 16G for ForestDB. Otherwise the configuration was similar to the IO-bound/disk test.

Results

The difference in load performance is much wider here than on the disk array. I assume that write-amplification was the problem for RocksDB.

The difference in ows.1 and ows.n here is smaller than on the disk array. If ForestDB is doing random disk reads on the commit code path than the impact is much less for SSD because disk read latency is smaller. But RocksDB is still much faster for ows.1, ows.n, owa.1 and owa.n.

RocksDB continues to be faster for the point query tests (pqw.1, pqw.n, pq.1, pq.n). The difference is larger for the single threaded tests and I assume that ForestDB continues to do more disk reads per query. RocksDB is still much faster on the range query tests as explained in the previous post.

Unlike the test with the disk-array, the ForestDB tests with 1 writer thread were able to sustain 1000 writes/second as configured via the rate limit.

operations/second for each step

RocksDB ForestDB

load 24540 81297

ows.1 3616 1387

ows.n 10727 2029

pqw.1 3601 1805

pqw.n 22448 14069

rqw.1 30477 1419

rqw.n 214060 13134

pq.1 3969 2878

pq.n 24562 19133

rq.1 30621 3805

rq.n 230673 23009

owa.1 24742 1967

owa.n 22692 2319

I had to repeat this test several times to find good values for the number of documents in the database and the compaction threshold for ForestDB. I was using 50% at first for the threshold and the database was doubling in size. That doubling, 2X space amplification, is expected with the threshold set to 50% so I reduced it to 25% which should have a worst case space amp of 4/3.

Unfortunately, with 64 database files and one compaction thread the worst case space amplification can be worse than theory predicts. All database files can trigger compaction at the same point in time, but only one will be compacted at a time by the one compaction thread. So others will get much more dead data than configured by the threshold.

I expect ForestDB to do much better when it supports concurrent compaction threads. The results below show the database size per test. There is more variance in the database size with ForestDB especially during the ows.1 and ows.n tests. This can make it harder to most of the available space on a storage device.

Size in GB after each step

RocksDB ForestDB

load 151 228

ows.1 149 340

ows.n 155 353

pqw.1 155 316

pqw.n 155 290

rqw.1 156 262

rqw.n 156 277

pq.1 156 276

pq.n 156 277

rq.1 156 277

rq.n 156 277

owa.1 166 282

owa.n 177 288

Command lines

Command lines for the tests are:

bash rall.sh 600000000 log /ssd1 8192 64 10 600 3600 1000 1 rocksdb 20 no 1

bash rall.sh 600000000 log /ssd1 16384 64 10 600 3600 1000 1 fdb 20 no 64

RocksDB & ForestDB via the ForestDB benchmark: IO-bound and disks

This has performance results for RocksDB and ForestDB using the ForestDB benchmark. The focus for this test is an IO-bound workload with a disk array. The database is about 3X larger than RAM. The server has 24 hyperthread cores, 144G of RAM and 6 disks (10k RPM SAS) using HW RAID 0. Background reading is in a previous post.

While RocksDB does much better in the results here I worked on this to understand differences in performance rather than to claim that RocksDB is superior. Hopefully the results here will help make ForestDB better.

Test setup

The test pattern was described in the previous post. Here I use shorter names for each of the tests:

load - Load
ows.1 - Overwrite-sync-1
ows.n - Overwrite-sync-N
pqw.1 - Point-query-1-with-writer
pqw.n - Point-query-N-with-writer
rqw.1 - Range-query-1-with-writer
rqw.n - Range-query-N-with-writer
pq.1 - Point-query-1
pq.n - Point-query-N
rq.1 - Range-query-1
rq.n - Range-query-N
owa.1 - Overwrite-async-1
owa.n - Overwrite-async-N

I used these command lines with my fork of the ForestDB benchmark:

bash rall.sh 2000000000 log data 32768 64 10 600 3600 1000 1 rocksdb 20 no 1

bash rall.sh 2000000000 log data 32768 64 10 600 3600 1000 1 fdb 20 no 64

The common options include:

load 2B documents
use 32G for the database cache. The server has 144G of RAM.
use N=10 for the tests with concurrency
use a 600 second warmup and then run for 3600 seconds
limit the writer thread to 1000/second for the with-writer tests
range queries fetch ~20 documents
do not use periodic_commit for the load

The RocksDB specific options include:

use a 64M write buffer for all tests
use one LSM tree

The ForestDB specific options include:

use 64 database files to reduce the max file size. This was done to give compaction a better chance of keeping up and to avoid temporarily doubling the size of the database during compaction.

Test results

The first result is the average throughput during the test as the operations/second rate. I have written previously about benchmarketing vs benchmarking and average throughput leaves out the interesting bits like response time variance. Alas, my time to write this is limited too.

ForestDB is slightly faster for the load. Even with rate limiting RocksDB incurs too much IO debt during this load. I don't show it here but the compaction scores for levels 0, 1 and 2 in the LSM were higher than expected given the rate limits I used. We have work-in-progress to fix that.

For the write-only tests (ows.1, ows.n, owa.1, owa.n) RocksDB is much faster than ForestDB. From the rates below it looks like ForestDB might be doing a disk read per write because I can get ~200 disk reads / second from 1 thread. I collected stack traces from other tests that showed disk reads in the commit code path so I think that is the problem here. I will share the stack traces in a future post.

RocksDB does much better on the range query tests (rqw.1, rqw.n, rq.1, rq.n). With ForestDB data for adjacent keys is unlikely to be adjacent in the database file unless it was loaded in that order and not updated after the load. So range queries might do 1 disk seek per document. With RocksDB we can assume that all data was in cache except for the max level of the LSM. And for the max level data for adjacent keys is adjacent in the file. So RocksDB is unlikely to do more than 1 disk seek per short range scan.

I don't have a good explanation for the ~2X different in point query QPS (pqw.1, pqw.n, pq.1, pq.n). The database is smaller with RocksDB, but not small enough to explain this. For pq.1, the single-threaded point-query test, both RocksDB and ForestDB were doing ~184 disk reads/second with similar latency of ~5ms/read. So ForestDB was doing almost 2X more disk reads / query. I don't understand ForestDB file structures well enough to explain that.

It is important to distinguish between logical and physical IO when trying to explain RocksDB IO performance. Logical IO means that a file read is done but the data is in the RocksDB block cache or OS cache. Physical IO means that a file read is one and the data is not in cache. For this configuration all levels before the max level of the LSM are in cache for RocksDB and some of the max level is in cache as the max level has 90% of the data.

For the tests that used 1 writer thread limited to 1000 writes/second RocksDB was able to sustain that rate. For ForestDB the writer thread only did ~200 writes/second.

operations/second for each step

RocksDB ForestDB

load 58137 69579

ows.1 4251 289

ows.n 11836 295

pqw.1 232 123

pqw.n 1228 654

rqw.1 3274 48

rqw.n 17770 377

pq.1 223 120

pq.n 1244 678

rq.1 2685 206

rq.n 16232 983

owa.1 56846 149

owa.n 49078 224

I looked at write-amplification for the ows.1 test. I measured the average rates for throughput and write-KB/second from iostat and divide the IO rate by the throughput as write-KB/update. The IO write-rate per update is about 2X higher with RocksDB.

throughput write-KB/s write-KB/update

RocksDB 4252 189218 44.5

ForestDB 289 6099 21.1

The next result is the size of the database at the end of each test step. Both were stable for most tests but RocksDB had trouble with the owa.1 and owa.n tests. These tests used threshold=50 for ForestDB which allows for up to 2X space amplification per database file. There were 64 database files. But we don't see 2X growth in this configuration.

Size in GB after each step

RocksDB ForestDB

load 498 776

ows.1 492 768

ows.n 500 810

pqw.1 501 832

pqw.n 502 832

rqw.1 502 832

rqw.n 503 832

pq.1 503 832

pq.n 503 832

rq.1 503 832

rq.n 503 832

owa.1 529 832

owa.n 560 832

RocksDB & ForestDB via the ForestDB benchmark, part 1

ForestDB was announced with some fanfare last year including the claim that it was faster than RocksDB. I take comparisons to RocksDB as a compliment, and then get back to work. This year I met the author of the ForestDB algorithm. He is smart and there are more talented people on the ForestDB effort. It would be great to see a MongoDB engine for ForestDB.

ForestDB

I used the ForestDB benchmark to evaluate RocksDB and ForestDB. Results from the tests will be shared in future posts. Here I describe my experience running the tests. I forked the ForestDB benchmark as I made general and RocksDB-specific improvements to it.

ForestDB is tree-structured for the index and log-structured for the data. Everything is copy-on-write so there is one database and no log. This is kind of like Bitcask except Bitcask is hash structured for the index and the index isn't persistent. The write-amplification vs space-amplification tradeoff is configurable via the threshold parameter that sets the percentage of dead data allowed in the database file before the compaction thread will copy out live data into a new file. For a uniform distribution of writes to keys, space-amp is 2 and write-amp is 2 with threshold=50%. Reducing threshold to 25% means that space-amp is 4/3 and write-amp is 4. With leveled compaction in RocksDB we usually get higher write-amplification and lower space-amplification compared to ForestDB.

The ForestDB database can be sharded into many files per process. I am not sure if transactions can span files. There is only one compaction thread regardless of the number of files and during my tests it was easy for compaction to fall behind. Single-threaded compaction might be stalled by file reads, decompression after file reads and compression before file writes. I think that for every key read during compaction, the index must be checked to confirm the key is live, so this can be another reason for compaction to fall behind. I assume work can be done to make this better.

ForestDB benchmark client

I forked the ForestDB benchmark client. When I get more time I will try to push changes upstream. Here I describe changes I made to it using the git commit log:

e30129 - link with tcmalloc to get better performance and change the library link order
1c4ecb - accept benchmark config filename on the command line
3b3a3c - add population:compaction_wait option to determine whether to wait for IO to settle after the load. Their benchmark results include some discussion of whether the test should wait for IO to settle after doing a load. While I don't think the test should wait for all IO to stop, I think it is important to penalize an engine that incurs too much IO debt during the load phase which will make queries that follow much slower. This is still work-in-progress for RocksDB although setting the options hard_rate_limit and soft_rate_limit make this much less of an issue.
aac5ba - add population:load option to determine whether the population (load) step should be done. This replaces a command line flag.
aa467b - adds the function couchstore_optimize_for_load to let an engine optimize for load performance and the RocksDB version uses the vector memtable for loads and otherwise uses the skiplist memtable. This also adds the function couchstore_set_wal to enable/disable redo logging. Finally this adds a better configuration in all cases for RocksDB. I know that it isn't easy to configure RocksDB. The changes include:

set max_bytes_for_level_base to 512M. It is 10M by default. This is the size of the L1.
set target_file_size_base to 32M. It is 2M by default. This is the size of the files used for levels L1 and larger.
enable level_compaction_dynamic_level_bytes. This is a big improvement to the algorithm for leveled compaction. We need a blog post to explain it.
set stats_dump_period_sec to 60. The default is 600 and I want more frequent stats.
set block_size to 8kb. The default is 4kb before compression and I don't want to do ~2kb disk reads. Nor do I want a block with only 4 1kb docs as that means the block index is too large.
set format_version to 2. The default is 0 and the RocksDB block cache can waste memory when this is 0. This might deserve a blog post from the RocksDB team.
set max_open_files to 50000. The default is 5000 and I want to cache all database files when possible.

3934e4 - set hard_rate_limit to 3.0 and soft_rate_limit to 2.5. The default is to not use rate limits. When set these limit how much IO debt a level can incur and then user writes are throttled when the debt is too large. Debt in this case is when the level has too much data. The size of L1 should be max_bytes_for_level_base. Setting the soft_limit to 2.5 means that writes are delayed when it gets 2.5X of the target size. Setting the hard rate_limit to 3.0 means that writes are stopped when it gets to 3.0X of the target size. We have work-in-progress to make throttling much better as it can lead to intermittent stalls (bad p99) for write-heavy workloads. This also disables compression for levels 0 and 1 in the LSM to reduce the chance of compaction stalls.
a6f8d8 - this reduces the duration for which a spin lock is held in the benchmark client. I was getting odd hangs while running valgrind and this change fixed the problem. AFAIK the change is correct too.
cfce06 - use a different RNG seed per thread. Prior to this change all threads start with the same seed and can then generate the same sequence of key values. This can be really bad when trying to get IO-bound databases as it inflates the block cache hit rate. The LevelDB benchmark client has a different version of this problem. It uses a different seed per thread, but the seed per thread is constant. So if you restart the benchmark client then thread N generates the same sequence of keys that it generated on the last run. I fixed this in RocksDB (see the --seed option for db_bench).
5a4de1 - add a script to generate benchmark configuration files for different workload patterns. See the next section for a description of the patterns.
2b85fa - disable RocksDB WAL when operation:write_type=async is used. I reverted this part of the diff later. This also sets level0_slowdown_writes_trigger to 12 and level0_stop_writes_trigger to 16. The defaults were larger.
eae3fa - adds a helper script to run the pattern of tests. This is described in Benchmark pattern.
871e63, e9909a - ForestDB wasn't ever doing fsync/fdatasync during the load phase. This fixes that. With these changes write-amplification is much larger for ForestDB when periodic_commit is enabled and write_type=sync. These changes provide the following behavior for existing config options:

population:periodic_commit - for ForestDB when set there is a commit per insert batch and when not set there is a commit at the end of the load. This has not changed. For RocksDB there is always one write batch per insert batch, but when set the WAL is used and when not set the WAL is not used. This does not control whether fsync/fdatasync are used. When periodic_commit is used then a load might be restartable when it fails in the middle. When not used, then a failure means load must start over.
operation:write_type - when set to sync then fsync/fdatasync are done per commit. This is a change for ForestDB but not for RocksDB.

Benchmark pattern

The script rall.sh runs tests using a pattern that is interesting to me. The pattern is listed below. The first step loads N documents and this takes a variable amount of time. The steps that follow run for a fixed amount of time. All of the tests used uniform random RNG to generate keys. The script also uses "numactl --interleave=all" given that I was using dual socket servers and a lot of RAM in many cases. The script also collects iostat and vmstat to help explain some performance differences.

Load - this is the populate step. The PK value is hashed from an increasing counter (from 1 to N) so the load isn't in PK order. The test is configured to not wait for IO to settle when the load ends and if the engine has too much IO debt then the tests that follow can suffer.
Overwrite-sync-1 - uses one writer thread to update docs. It does an fsync on commit.
Overwrite-sync-N - like Overwrite-sync-1 but uses N writer threads.
Point-query-1-with-writer - uses 1 reader thread that does point queries and 1 writer thread that updates docs. There is a rate limit for the writer. Read performance with some writes in progress is more realistic than without writes in progress especially for write-optimized database engines like RocksDB. More info on that topic is here.
Point-query-N-with-writer - like Point-query-1-with-writer but uses N reader threads.
Range-query-1-with-writer - like Point-query-1-with-writer but does range queries
Range-query-N-with-writer - like Range-query-1-with-writer but uses N reader threads
Point-query-1 - like Point-query-1-with-writer, but without a writer thread
Point-query-N - like Point-query-N-with-writer, but without a writer thread
Range-query-1 - like Range-query-1-with-writer, but without a writer thread
Range-query-N - like Range-query-N-with-writer, but without a writer thread
Overwrite-async-1 - like Overwrite-sync-1 but does not do fsync-on-commit. This was moved to the end of the test sequence because it frequently made compaction get too far behind in ForestDB
Overwrite-async-N - like Overwrite-sync-N but does not do fsync-on-commit. Also moved to the end like Overwrite-async-1.

Benchmark configuration

My test scripts use a template for the benchmark configuration that is specialized for each run. The gen_config.sh script has a few test-specific changes especially for the load test. Changes that I made from one of the published configurations include:

set db_config:bloom_bits_per_key to 10 to use bloom filters with RocksDB
set threads:seed_per_thread to 1 to use different RNG seeds per thread
set body_length:compressibility=50
use operation:batch_distribution=uniform. Eventually I will try others.
use operation:batchsize_distribution=uniform with batchsize 1 (read_batchsize_*_bound) for point queries and a configurable batchsize (iterate_batchsize_*_bound) for range queries. In the Benchmark debugging section I will explain why I do this.

Benchmark debugging

I started with an IO-bound configuration and data on disk and ForestDB was getting about 1.5X more QPS than RocksDB for a read-only point-query workload. I had two servers with identical hardware and this result was reproduced on both servers. So it was time to debug. The test used one reader thread, so this result wasn't ruined by using the same RNG seed for all threads.

I started by confirming that the ratio of disk reads per query were similar for RocksDB and ForestDB. Each did about 1.5 queries per disk read courtesy of cached data. However the average disk read latency reported by iostat was about 1.5X larger for RocksDB. Why was the disk faster for ForestDB?

Next I used strace to see the system calls to read blocks. I learned that page reads for data were 4kb and aligned with ForestDB. They were usually less than 4kb and not aligned with RocksDB. Both use buffered IO. From iostat the average read size was 4kb for ForestDB and about 8kb for RocksDB. Many of the unaligned almost 4kb reads crossed alignment boundaries so the unaligned ~4kb read required an 8kb read from disk. This could be a performance problem on a fast SSD but there is not much difference between 4kb and 8kb reads from a disk array and I re-confirmed that via tests with fio. I also used blktrace to get more details about the IO pattern and learned that my servers were using the CFQ IO scheduler. I switched to deadline.

At this point I am stuck. Somehow ForestDB is getting faster reads from disk or the read pattern wasn't as random as I thought it would be. So I read more of the benchmark client and saw that point reads were done in a batch where the key for the first document per batch was selected at random (call it N) and then the keys for the rest of the batch were N+1, N+2, N+3, etc. This gives ForestDB an advantage when only one database file is used because with M files the reads in one batch might use different files. The load and query code both share a function that converts an integer to a key, so N here is an integer and the key is something that includes a hash of the integer. This means that the key for N=3 is not adjacent to the key for N=4. However, the data file for ForestDB is log structured and immediately after the load the data for N=3 is adjacent to the data for N=4 unless updates were done after the load to either of the documents.

At last I understood the cause. ForestDB was benefiting from fetching data that was adjacent in the file. The benefit came from caching either in the HW RAID device or disks. I changed the test configuration to use batchsize=1 for point queries and then ForestDB and RocksDB began to experience the same average read latency from the disk array.