Skip to content

[RFC] Asynchronous MultiGet#7647

Open
anand1976 wants to merge 3 commits into
facebook:mainfrom
anand1976:async_read
Open

[RFC] Asynchronous MultiGet#7647
anand1976 wants to merge 3 commits into
facebook:mainfrom
anand1976:async_read

Conversation

@anand1976

Copy link
Copy Markdown
Contributor

This is a draft PR that implements support for MultiGet using asynchronous IO. The goal is to parallelize file reads in an LSM level by issuing the reads asynchronously for all the files overlapping a MultiGet batch in a level. The user MultiGet API still behaves synchronously. However, Version::MultiGet takes care of parallelizing it and waiting for the results. At a high level, the design is as follows -

  1. A new API MultiReadAsync is defined in FSRandomAccessFile that takes a completion callback and calls it with an array of FSReadResponse when the reads are done. The completion could be done either by polling (via a newly defined Poll API in FileSystem), or could be in another thread or signal handler. MultiReadAsync also returns a type erased unique_ptr that can contain a pointer to whatever context the callee wants to maintain. RocksDB guarantees the lifetime of the unique_ptr atleast till the completion callback is called.
  2. To internally track the status of asynchronous reads and wait on them in RocksDB, a set of new primitives are introduced in utilities/async/future.h. The Promise and Future primitives allow functions in RocksDB to return an incomplete value that will be fulfilled at some point in the future. They are similar to std::promise and std::future, but more efficient by minimizing context switches when waiting on multiple futures. This is accomplished by introducing a CollectAll function that's similar to folly::collectAll.
  3. Each actor in the processing of MultiGet in a SST file (TableCache, BlockBasedTableReader, RandomAccessFileReader and PosixRandomAccessFile) needs to maintain some context for the request. In order to minimize the memory allocation overhead for maintaining the context, utilities/async/context_pool.h provides ThreadLocalContextPool that maintains per-thread caches of context objects that can be re-used.
  4. Add a MultiGetAsync interface in TableReader. The async MultiGet is implemented as a chain of continuations. Version::MultiGet initially calls MultiGetAsync, which can return a continuation callback. The callback, in turn, can return the next callback in the chain and so on. As of now, there is only one callback in the chain as BlockbasedTable only uses async reads for data blocks, but it can transparently extend this to partitioned filter blocks and partitioned index blocks in the future.
  5. Split MultiGet and RetrieveMultipleBlocks in BlockBasedTable into MultiGetAsync/MultiGetAsyncStage2 and RetrieveMultipleBlocks/RetrieveMultipleBlocksStage2 respectively. Similarly, split MultiRead in RandomAccessFileReader into MultiReadAsync/MultiReadAsyncStage2.
  6. Define a new option allow_async in ReadOptions to disable async IO. The block based table reader MultiGetAsync can return synchronously or asynchronously depending on the option.

TODO:

  • Remove the MultiGet implementations in TableCache and BlockBasedTable after running benchmarks to verify no performance regression if allow_async is false.
  • Fix some MultiGet stats in Version.
  • Add row cache support to TableCache::MultiGetAsync.
  • Write some unit tests for BlockBasedTable and RandomAccessFileReader.
@anand1976 anand1976 requested a review from a user November 6, 2020 19:52
@anand1976 anand1976 requested a review from ltamasi November 6, 2020 20:09
Comment thread src.mk Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need alignment ?

@ltamasi ltamasi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anand1976 for the PR! I haven't had a chance to do a real deep dive yet but it seems to me that this is exactly the kind of problem coroutines (i.e. folly::coro) could help with. Specifically, I believe coroutines could eliminate quite a bit of complexity here: they would enable us to have a more natural control flow instead of having to use futures, callbacks, and logic split into multiple stages/methods; also, they would eliminate the need to manually manage context objects (by virtue of coroutine frames). The obvious catch is compiler/platform support: language support for coroutines is a C++20 feature, and folly::coro currently only works with clang I believe. Still, I feel it's something to consider.

@ghost ghost left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a high level, the threading model is scatter-gather, and each thread is IO heavy.
Using a new thread for each IO does not scale well. If we use a coroutine based solution such as fiber, even if we have N IO requests, we do not need N threads, we may dispatch them into M system threads (M < N).

But the implementation depends on linux aio, so we do not actually manage the threads for asynchronous IOs. According to aio doc: The current Linux POSIX AIO implementation is provided in user space by glibc. This has a number of limitations, most notably that maintaining multiple threads to perform I/O operations is expensive and scales poorly. In this case, the current future based design looks good to me.

@anand1976 Is this why you take the future approach?

@anand1976

Copy link
Copy Markdown
Contributor Author

@ltamasi I agree that coroutines fits this problem very well. I don't know if we'll be moving to C++20 anytime soon though, given that RocksDB tries to be compatible with the broader community.

@anand1976

Copy link
Copy Markdown
Contributor Author

From a high level, the threading model is scatter-gather, and each thread is IO heavy.
Using a new thread for each IO does not scale well. If we use a coroutine based solution such as fiber, even if we have N IO requests, we do not need N threads, we may dispatch them into M system threads (M < N).

But the implementation depends on linux aio, so we do not actually manage the threads for asynchronous IOs. According to aio doc: The current Linux POSIX AIO implementation is provided in user space by glibc. This has a number of limitations, most notably that maintaining multiple threads to perform I/O operations is expensive and scales poorly. In this case, the current future based design looks good to me.

@anand1976 Is this why you take the future approach?

@Cheng-Chang The underlying FileSystem may implement it in different ways. PosixFileSystem uses Linux IO uring, which is a little simpler to deal with since we can poll for completion. However, remote filesystems like Warm Storage may use the M x N model. Its more efficient than Posix AIO, but may still cause high overhead by resulting in a wake-up and context switch per IO completion.
I used the Future approach as it allows more fine-grained control over how we handle completions. We can collect all completions and issue a single wake-up, or in the future, we could collect N completions for more overlap of IO and compute.

@anand1976

Copy link
Copy Markdown
Contributor Author

@ltamasi Maybe another way of looking at it is if there's anything we can do to make the transition easier if/when RocksDB does move to C++20.

@ltamasi

ltamasi commented Dec 3, 2020

Copy link
Copy Markdown
Contributor

@ltamasi I agree that coroutines fits this problem very well. I don't know if we'll be moving to C++20 anytime soon though, given that RocksDB tries to be compatible with the broader community.

One (admittedly, not so nice) way of dealing with that would be to make this a C++20 only feature (using conditional compilation).

P.S. Seems to me we already have a C++20 build set up on Travis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

4 participants