[RFC] Asynchronous MultiGet by anand1976 · Pull Request #7647 · facebook/rocksdb

anand1976 · 2020-11-06T19:52:17Z

This is a draft PR that implements support for MultiGet using asynchronous IO. The goal is to parallelize file reads in an LSM level by issuing the reads asynchronously for all the files overlapping a MultiGet batch in a level. The user MultiGet API still behaves synchronously. However, Version::MultiGet takes care of parallelizing it and waiting for the results. At a high level, the design is as follows -

A new API MultiReadAsync is defined in FSRandomAccessFile that takes a completion callback and calls it with an array of FSReadResponse when the reads are done. The completion could be done either by polling (via a newly defined Poll API in FileSystem), or could be in another thread or signal handler. MultiReadAsync also returns a type erased unique_ptr that can contain a pointer to whatever context the callee wants to maintain. RocksDB guarantees the lifetime of the unique_ptr atleast till the completion callback is called.
To internally track the status of asynchronous reads and wait on them in RocksDB, a set of new primitives are introduced in utilities/async/future.h. The Promise and Future primitives allow functions in RocksDB to return an incomplete value that will be fulfilled at some point in the future. They are similar to std::promise and std::future, but more efficient by minimizing context switches when waiting on multiple futures. This is accomplished by introducing a CollectAll function that's similar to folly::collectAll.
Each actor in the processing of MultiGet in a SST file (TableCache, BlockBasedTableReader, RandomAccessFileReader and PosixRandomAccessFile) needs to maintain some context for the request. In order to minimize the memory allocation overhead for maintaining the context, utilities/async/context_pool.h provides ThreadLocalContextPool that maintains per-thread caches of context objects that can be re-used.
Add a MultiGetAsync interface in TableReader. The async MultiGet is implemented as a chain of continuations. Version::MultiGet initially calls MultiGetAsync, which can return a continuation callback. The callback, in turn, can return the next callback in the chain and so on. As of now, there is only one callback in the chain as BlockbasedTable only uses async reads for data blocks, but it can transparently extend this to partitioned filter blocks and partitioned index blocks in the future.
Split MultiGet and RetrieveMultipleBlocks in BlockBasedTable into MultiGetAsync/MultiGetAsyncStage2 and RetrieveMultipleBlocks/RetrieveMultipleBlocksStage2 respectively. Similarly, split MultiRead in RandomAccessFileReader into MultiReadAsync/MultiReadAsyncStage2.
Define a new option allow_async in ReadOptions to disable async IO. The block based table reader MultiGetAsync can return synchronously or asynchronously depending on the option.

TODO:

Remove the MultiGet implementations in TableCache and BlockBasedTable after running benchmarks to verify no performance regression if allow_async is false.
Fix some MultiGet stats in Version.
Add row cache support to TableCache::MultiGetAsync.
Write some unit tests for BlockBasedTable and RandomAccessFileReader.

darionyaphet · 2020-11-09T02:01:34Z

need alignment ?

ltamasi

Thanks @anand1976 for the PR! I haven't had a chance to do a real deep dive yet but it seems to me that this is exactly the kind of problem coroutines (i.e. folly::coro) could help with. Specifically, I believe coroutines could eliminate quite a bit of complexity here: they would enable us to have a more natural control flow instead of having to use futures, callbacks, and logic split into multiple stages/methods; also, they would eliminate the need to manually manage context objects (by virtue of coroutine frames). The obvious catch is compiler/platform support: language support for coroutines is a C++20 feature, and folly::coro currently only works with clang I believe. Still, I feel it's something to consider.

ghost

From a high level, the threading model is scatter-gather, and each thread is IO heavy.
Using a new thread for each IO does not scale well. If we use a coroutine based solution such as fiber, even if we have N IO requests, we do not need N threads, we may dispatch them into M system threads (M < N).

But the implementation depends on linux aio, so we do not actually manage the threads for asynchronous IOs. According to aio doc: The current Linux POSIX AIO implementation is provided in user space by glibc. This has a number of limitations, most notably that maintaining multiple threads to perform I/O operations is expensive and scales poorly. In this case, the current future based design looks good to me.

@anand1976 Is this why you take the future approach?

anand1976 · 2020-12-01T20:49:39Z

@ltamasi I agree that coroutines fits this problem very well. I don't know if we'll be moving to C++20 anytime soon though, given that RocksDB tries to be compatible with the broader community.

anand1976 · 2020-12-01T21:03:22Z

From a high level, the threading model is scatter-gather, and each thread is IO heavy.
Using a new thread for each IO does not scale well. If we use a coroutine based solution such as fiber, even if we have N IO requests, we do not need N threads, we may dispatch them into M system threads (M < N).

But the implementation depends on linux aio, so we do not actually manage the threads for asynchronous IOs. According to aio doc: The current Linux POSIX AIO implementation is provided in user space by glibc. This has a number of limitations, most notably that maintaining multiple threads to perform I/O operations is expensive and scales poorly. In this case, the current future based design looks good to me.

@anand1976 Is this why you take the future approach?

@Cheng-Chang The underlying FileSystem may implement it in different ways. PosixFileSystem uses Linux IO uring, which is a little simpler to deal with since we can poll for completion. However, remote filesystems like Warm Storage may use the M x N model. Its more efficient than Posix AIO, but may still cause high overhead by resulting in a wake-up and context switch per IO completion.
I used the Future approach as it allows more fine-grained control over how we handle completions. We can collect all completions and issue a single wake-up, or in the future, we could collect N completions for more overlap of IO and compute.

anand1976 · 2020-12-01T21:05:06Z

@ltamasi Maybe another way of looking at it is if there's anything we can do to make the transition easier if/when RocksDB does move to C++20.

ltamasi · 2020-12-03T18:42:15Z

@ltamasi I agree that coroutines fits this problem very well. I don't know if we'll be moving to C++20 anytime soon though, given that RocksDB tries to be compatible with the broader community.

One (admittedly, not so nice) way of dealing with that would be to make this a C++20 only feature (using conditional compilation).

P.S. Seems to me we already have a C++20 build set up on Travis.

anand1976 requested a review from a user November 6, 2020 19:52

facebook-github-bot added the CLA Signed label Nov 6, 2020

anand1976 requested a review from ltamasi November 6, 2020 20:09

darionyaphet reviewed Nov 9, 2020

View reviewed changes

Comment thread src.mk Outdated

darionyaphet Nov 9, 2020

Copy link
Copy Markdown

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need alignment ?

ltamasi reviewed Nov 20, 2020

View reviewed changes

ghost reviewed Dec 1, 2020

View reviewed changes

anand76 added 3 commits January 27, 2021 11:32

Async MultiGet

776d3ef

Fix no IOURING build

c978352

Rebase

bc449db

anand1976 force-pushed the async_read branch from 8f5eff8 to bc449db Compare January 27, 2021 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Asynchronous MultiGet#7647

[RFC] Asynchronous MultiGet#7647
anand1976 wants to merge 3 commits into
facebook:mainfrom
anand1976:async_read

anand1976 commented Nov 6, 2020

darionyaphet Nov 9, 2020

ltamasi left a comment

ghost left a comment

anand1976 commented Dec 1, 2020

anand1976 commented Dec 1, 2020

anand1976 commented Dec 1, 2020

ltamasi commented Dec 3, 2020 •

edited

Loading

Labels

4 participants

Uh oh!

Conversation

anand1976 commented Nov 6, 2020

darionyaphet Nov 9, 2020

Choose a reason for hiding this comment

ltamasi left a comment

Choose a reason for hiding this comment

ghost left a comment

Choose a reason for hiding this comment

anand1976 commented Dec 1, 2020

anand1976 commented Dec 1, 2020

anand1976 commented Dec 1, 2020

ltamasi commented Dec 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Labels

4 participants

ltamasi commented Dec 3, 2020 •

edited

Loading