Introduce xxhash and start the migration of the codebase#1805
Introduce xxhash and start the migration of the codebase#1805Adenilson wants to merge 2 commits into
Conversation
|
Hi @Adenilson! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
|
FarmHash, and both spooky hash versions are used elsewhere outside of folly, so can't really be removed. I also suspect that this will be very hard to actually land due to the number of tests internally that are likely inadvertently dependent on the hash value as a result of expecting a specific order from various unordered maps or sets. |
|
Hello there, Orvid! Thanks for reading the patch, I appreciate it. I don't expect the current patch to land as it is now. The primary objective was to kick start a discussion on two subjects: hashing on Folly and JSON parsing. For the first subject, it would be helpful to better define what are the 'internal tests' that may rely on an immutable hash. Could you share a bit more about it? I wonder if it would make sense to add a few unit tests to ensure no breakage of Folly internal client code would happen when changes are made in the upstream project. To help gauge the dimension of the problem, it would be pretty helpful to understand if the internal tests you mentioned call directly on the deprecated hashes or if they rely on using a common API entry point (e.g. hasher::operator()). If a majority of client code falls into the first case, it should be simple to handle (i.e. keep the deprecated hashes in the If instead they use a common entry point, an alternative would be to introduce the performant and potentially mutable hash (i.e. when a new and faster one is published, it would be updated) under a proper class/namespace (e.g. hasher::FastHash()) and keep the deprecated hash around for older client code that can't be updated for some reason. That was the approach that I introduced in Chromium (https://source.chromium.org/chromium/chromium/src/+/main:base/hash/hash.h;l=38;bpv=1) successfully. I would assume that if the dimension of the work is not massive, the best engineering approach is to migrate the 'internal tests' to become invariant when it comes to the selected hash. Is that reasonable? Another valid concern that you shared was related to unordered maps/sets. Before submitting the current patch, I took the care of running all the > 3000 tests (including the 'SLOW' and 'HANGING' tests) and observed no regressions. If there is interest on improving the current state of hashing in Folly, we could implement the migration in a series of patches to avoid any possible breakage and have a smoother integration with 'internal' client code. Starting with hopefully less risky classes (e.g. IOBuf) and moving to some that may have some attrition (e.g. maps/sets). What are you thoughts on the subject? Moving to the next subject, JSON parsing. I started by having a look on the JSON parsing class and noticed that the main unit test was commented out (apparently related to a MSVC bug?). To get it building again was a one liner fix, I proceeded next to use as an input file a JSON file generated by retrieving the comments from my own Facebook account (I named the file 'input.json'). I wonder if there is a corpus of sample JSON files that are relevant for Folly client code?). The reason I ask about a corpus is that I observed diverging ordering of hot spots depending on the input JSON file (e.g. twitter.json vs FB comments sample file mentioned above). A perf profile showed that one of the top 8 hot spots (alongside parseValue, parseString, F14Table, dynamic::operator=) while parsing JSON was relate to hashing. That lead me to have a look on which hashes and where they were deployed in the code base. There may be good potential for performance gains in JSON parsing by leveraging SIMD optimizations. I've done some previous work on vectorizing zlib for Chromium and it is quite surprising the performance gains achievable by working with vectors instead of a scalar approach. The good news is that there is previous work done on the area (e.g. simdjson). Would Folly be open to adopt SIMD specific optimizations in its JSON parser or maybe just migrate to simdjson (if it has all the required features)? |
|
Incremental migration is likely the right way to go, but is not something that is viable to do from outside of FB unfortunately, as the scale is likely massive given just how much depends on Folly, but there are likely very few things that rely on the hash being immutable for functional correctness, so can be fixed up. Evaluating json performance is also a bit difficult to do externally unfortunately, as we don't have any dedicated benchmarks for folly::json as far as I know |
|
Adding some people who have being active on the hashes folder: @yfeldblum, @terrelln Any input on the ideas I shared above? @Orvid: I've just added on this patch an initial perf utest to stress JSON parsing performance. Would something similar be acceptable (independent from any hashing work)? |
|
So the idea of a new interface for a faster hasher would look similar to (draft, incomplete): Meanwhile, I'm moving the changes on the Json unit test (i.e. re-enable compilation + add new unit test) to a new merge request. |
|
The small changes on the json utest is on #1812. |
|
Updated patch, changes: I would appreciate feedback if the selected three classes are a safe choice to start the migration. |
|
Fixed typo in the CL msg. |
|
I'll import this version to see how much breaks, but I am working on another version of this internally that should be much easier to actually roll out. We'll likely want to go a slightly different route to adding the xxhash dependency than we've done in the past for the other hashing algorithms, since xxhash is actually packaged for distribution in various distros, so adding it as a dependency shouldn't be too hard. |
|
@Orvid has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@Orvid touching the base to ask how the internal testing went and how much breakage the current version of the patch caused (hopefully none...). I can next update the patch to leverage the xxhash library available in the system instead of importing it into the repository. That being said, xxhash supports optimizations using latest SIMD instructions on target CPUs (e.g. https://github.com/Cyan4973/xxHash/blob/dev/xxhash.h#L3124). If we know for sure that Folly would be running always on >= Haswell, it could be potentially interesting to import the code into Folly and define the XXH_VECTOR macro to leverage AVX2. The issue I see with using a xxhash library available in the system is that Linux distributions generally build their packages with a lower common denominator (e.g. SSE2). Another idea that may be worth exploring is to use CMake to fetch the latest xxhash version from github and build/link with it (I recall that 'FetchContent' could be use used for this case). What you think? |
|
I failed to realize I hadn't updated our internal build system to account for the new source files from this PR, so the current results are that everything broke, but I've just updated the internal copy of this PR with the changes for that build system to work, so we'll see what is actually broken in a few hours. The more complete version of this swap that I've been working on internally does break a number of things, but I have plans for how to get it rolled out. As for where to get xxhash from, internally we need to ensure everything is using the same version, so we can't have a copy in tree, but that doesn't stop us from having the header inline its definitions and linking it statically. We'll probably want to have getdeps attempt to find the system package for the open source builds, but if xxhash's inline definitions are enabled, it will use whatever instruction sets are enabled when building Folly, so there isn't anything special that's needed to enable AVX. |
26d4f1f to
68f78f3
Compare
|
@Orvid Please see the updated version using the system's provided xxhash library as suggested. By the way, do you have the results of the previous version after the required internal build fixes? |
|
After the previous fixes there are still a couple of errors, but I suspect those are just flaky. |
|
@Orvid did you have time to verify the latest version of the patch (i.e. where we use the xxhash lib available in the system as suggested)? |
|
I got two questions: b) Is there a need to also add a new file in the manifest folder (i.e. build/fbcode_builder/manifests) to get/fetch xxhash (something like: https://gist.github.com/Adenilson/43a4a7e0acdfdaeff38f1fc3e2103241)? |
|
Sorry for the delay. The new build file that finds xxhash doesn't need to be duplicated, but it can live exclusively in Adding support via the manifest is required in order for the open source CI to actually start building with xxhash available. |
|
@Orvid updated patch addressing the placement of the build file plus updating the manifest. Anything else you like to see changed? |
|
Friendly ping on the subject? |
|
|
||
| find_package(Xxhash MODULE) | ||
| set(FOLLY_HAVE_LIBXXHASH ${XXHASH_FOUND}) | ||
| if(XXHASH_FOUND) |
There was a problem hiding this comment.
In a second thought, the build will fail if xxhash is not found. Therefore is preferable to fail/abort on the configure stage, before build.
b92c024 to
d952df9
Compare
The requirements for a good hash is a proper balance between speed and hashing quality (i.e. avoiding collisions), while ideally being performant on varied CPU architectures. Currently one of the best (if not the best) hashes available is xxhash (https://github.com/Cyan4973/xxHash) developed by Yann Collet, being between 1.6x to 2x faster than SpookyHashV2 while keeping similar or superior hashing quality on both x86 and ARM. We will link to the system's provided xxhash library instead of having a static and stale copy of xxhash source code into the repository. A new build file (i.e. findXxhash.cmake) and the manifest were updated accordingly. This patch introduces xxhash and migrates only three classes (IPAddressV6, IOBuf, RecordIO) to use it, the idea is to implement the change gradually in small steps.
Since xxhash is required to build, it makes sense to fail early during the configure stage before starting a build.
|
Sorry about the silence here, this PR appears to be good to go, but it likely isn't worth you spending more time to keep it updated, since there's no reasonable way for me to be able to merge it directly. This change will be happening though, and the changes in this PR will be a part of that, especially the open source build setup, but I need to go about it a slightly different route, since even including xxhash in a public header seems to cause issues, and I had already been considering adding a wrapper anyways since there are faster ways to hash short strings on x86_64 than what xxhash does. |
The requirements for a good hash is a proper balance between speed
and hashing quality (i.e. avoiding collisions), while ideally being
performant on varied CPU architectures.
Currently one of the best (if not the best) hashes available is
xxhash (https://github.com/Cyan4973/xxHash) developed by Yann Collet,
being between 1.6x to 2x faster than SpookyHashV2 while keeping similar
or superior hashing quality on both x86 and ARM.
This patch introduces xxhash and migrates the majority of the code
base to use it, while also removing some unused/dead code (e.g.
FarmHash, SpookyHashV1, etc) along the way.
TODO: the only remaining class to migrate is RecordIO.