[RFC] Asynchronous MultiGet#7647
Conversation
ltamasi
left a comment
There was a problem hiding this comment.
Thanks @anand1976 for the PR! I haven't had a chance to do a real deep dive yet but it seems to me that this is exactly the kind of problem coroutines (i.e. folly::coro) could help with. Specifically, I believe coroutines could eliminate quite a bit of complexity here: they would enable us to have a more natural control flow instead of having to use futures, callbacks, and logic split into multiple stages/methods; also, they would eliminate the need to manually manage context objects (by virtue of coroutine frames). The obvious catch is compiler/platform support: language support for coroutines is a C++20 feature, and folly::coro currently only works with clang I believe. Still, I feel it's something to consider.
ghost
left a comment
There was a problem hiding this comment.
From a high level, the threading model is scatter-gather, and each thread is IO heavy.
Using a new thread for each IO does not scale well. If we use a coroutine based solution such as fiber, even if we have N IO requests, we do not need N threads, we may dispatch them into M system threads (M < N).
But the implementation depends on linux aio, so we do not actually manage the threads for asynchronous IOs. According to aio doc: The current Linux POSIX AIO implementation is provided in user space by glibc. This has a number of limitations, most notably that maintaining multiple threads to perform I/O operations is expensive and scales poorly. In this case, the current future based design looks good to me.
@anand1976 Is this why you take the future approach?
|
@ltamasi I agree that coroutines fits this problem very well. I don't know if we'll be moving to C++20 anytime soon though, given that RocksDB tries to be compatible with the broader community. |
@Cheng-Chang The underlying |
|
@ltamasi Maybe another way of looking at it is if there's anything we can do to make the transition easier if/when RocksDB does move to C++20. |
One (admittedly, not so nice) way of dealing with that would be to make this a C++20 only feature (using conditional compilation). P.S. Seems to me we already have a C++20 build set up on Travis. |
This is a draft PR that implements support for MultiGet using asynchronous IO. The goal is to parallelize file reads in an LSM level by issuing the reads asynchronously for all the files overlapping a MultiGet batch in a level. The user
MultiGetAPI still behaves synchronously. However,Version::MultiGettakes care of parallelizing it and waiting for the results. At a high level, the design is as follows -MultiReadAsyncis defined inFSRandomAccessFilethat takes a completion callback and calls it with an array ofFSReadResponsewhen the reads are done. The completion could be done either by polling (via a newly definedPollAPI inFileSystem), or could be in another thread or signal handler.MultiReadAsyncalso returns a type erasedunique_ptrthat can contain a pointer to whatever context the callee wants to maintain. RocksDB guarantees the lifetime of theunique_ptratleast till the completion callback is called.utilities/async/future.h. ThePromiseandFutureprimitives allow functions in RocksDB to return an incomplete value that will be fulfilled at some point in the future. They are similar tostd::promiseandstd::future, but more efficient by minimizing context switches when waiting on multiple futures. This is accomplished by introducing aCollectAllfunction that's similar tofolly::collectAll.MultiGetin a SST file (TableCache,BlockBasedTableReader,RandomAccessFileReaderandPosixRandomAccessFile) needs to maintain some context for the request. In order to minimize the memory allocation overhead for maintaining the context,utilities/async/context_pool.hprovidesThreadLocalContextPoolthat maintains per-thread caches of context objects that can be re-used.MultiGetAsyncinterface inTableReader. The async MultiGet is implemented as a chain of continuations.Version::MultiGetinitially callsMultiGetAsync, which can return a continuation callback. The callback, in turn, can return the next callback in the chain and so on. As of now, there is only one callback in the chain asBlockbasedTableonly uses async reads for data blocks, but it can transparently extend this to partitioned filter blocks and partitioned index blocks in the future.MultiGetandRetrieveMultipleBlocksinBlockBasedTableintoMultiGetAsync/MultiGetAsyncStage2andRetrieveMultipleBlocks/RetrieveMultipleBlocksStage2respectively. Similarly, splitMultiReadinRandomAccessFileReaderintoMultiReadAsync/MultiReadAsyncStage2.allow_asyncinReadOptionsto disable async IO. The block based table readerMultiGetAsynccan return synchronously or asynchronously depending on the option.TODO:
MultiGetimplementations inTableCacheandBlockBasedTableafter running benchmarks to verify no performance regression ifallow_asyncisfalse.Version.TableCache::MultiGetAsync.BlockBasedTableandRandomAccessFileReader.