Introduce xxhash and start the migration of the codebase by Adenilson · Pull Request #1805 · facebook/folly

Adenilson · 2022-06-27T19:29:34Z

The requirements for a good hash is a proper balance between speed
and hashing quality (i.e. avoiding collisions), while ideally being
performant on varied CPU architectures.

Currently one of the best (if not the best) hashes available is
xxhash (https://github.com/Cyan4973/xxHash) developed by Yann Collet,
being between 1.6x to 2x faster than SpookyHashV2 while keeping similar
or superior hashing quality on both x86 and ARM.

This patch introduces xxhash and migrates the majority of the code
base to use it, while also removing some unused/dead code (e.g.
FarmHash, SpookyHashV1, etc) along the way.

TODO: the only remaining class to migrate is RecordIO.

facebook-github-bot · 2022-06-27T19:29:38Z

Hi @Adenilson!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

facebook-github-bot · 2022-06-27T20:54:35Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

facebook-github-bot · 2022-06-27T20:54:37Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Orvid · 2022-06-28T21:55:55Z

FarmHash, and both spooky hash versions are used elsewhere outside of folly, so can't really be removed.

I also suspect that this will be very hard to actually land due to the number of tests internally that are likely inadvertently dependent on the hash value as a result of expecting a specific order from various unordered maps or sets.

Adenilson · 2022-06-30T00:14:48Z

Hello there, Orvid!

Thanks for reading the patch, I appreciate it.

I don't expect the current patch to land as it is now.

The primary objective was to kick start a discussion on two subjects: hashing on Folly and JSON parsing.

For the first subject, it would be helpful to better define what are the 'internal tests' that may rely on an immutable hash.

Could you share a bit more about it?

I wonder if it would make sense to add a few unit tests to ensure no breakage of Folly internal client code would happen when changes are made in the upstream project.

To help gauge the dimension of the problem, it would be pretty helpful to understand if the internal tests you mentioned call directly on the deprecated hashes or if they rely on using a common API entry point (e.g. hasher::operator()).

If a majority of client code falls into the first case, it should be simple to handle (i.e. keep the deprecated hashes in the
repository and new code could instead use better/faster hashes).

If instead they use a common entry point, an alternative would be to introduce the performant and potentially mutable hash (i.e. when a new and faster one is published, it would be updated) under a proper class/namespace (e.g. hasher::FastHash()) and keep the deprecated hash around for older client code that can't be updated for some reason.

That was the approach that I introduced in Chromium (https://source.chromium.org/chromium/chromium/src/+/main:base/hash/hash.h;l=38;bpv=1) successfully.

I would assume that if the dimension of the work is not massive, the best engineering approach is to migrate the 'internal tests' to become invariant when it comes to the selected hash.

Is that reasonable?

Another valid concern that you shared was related to unordered maps/sets. Before submitting the current patch, I took the care of running all the > 3000 tests (including the 'SLOW' and 'HANGING' tests) and observed no regressions.

If there is interest on improving the current state of hashing in Folly, we could implement the migration in a series of patches to avoid any possible breakage and have a smoother integration with 'internal' client code.

Starting with hopefully less risky classes (e.g. IOBuf) and moving to some that may have some attrition (e.g. maps/sets).

What are you thoughts on the subject?

Moving to the next subject, JSON parsing.

I started by having a look on the JSON parsing class and noticed that the main unit test was commented out (apparently related to a MSVC bug?).

To get it building again was a one liner fix, I proceeded next to use as an input file a JSON file generated by retrieving the comments from my own Facebook account (I named the file 'input.json').

I wonder if there is a corpus of sample JSON files that are relevant for Folly client code?).

The reason I ask about a corpus is that I observed diverging ordering of hot spots depending on the input JSON file (e.g. twitter.json vs FB comments sample file mentioned above).

A perf profile showed that one of the top 8 hot spots (alongside parseValue, parseString, F14Table, dynamic::operator=) while parsing JSON was relate to hashing.

That lead me to have a look on which hashes and where they were deployed in the code base.

There may be good potential for performance gains in JSON parsing by leveraging SIMD optimizations.

I've done some previous work on vectorizing zlib for Chromium and it is quite surprising the performance gains achievable by working with vectors instead of a scalar approach.

The good news is that there is previous work done on the area (e.g. simdjson).

Would Folly be open to adopt SIMD specific optimizations in its JSON parser or maybe just migrate to simdjson (if it has all the required features)?

Orvid · 2022-07-05T21:18:06Z

Incremental migration is likely the right way to go, but is not something that is viable to do from outside of FB unfortunately, as the scale is likely massive given just how much depends on Folly, but there are likely very few things that rely on the hash being immutable for functional correctness, so can be fixed up.

Evaluating json performance is also a bit difficult to do externally unfortunately, as we don't have any dedicated benchmarks for folly::json as far as I know

Adenilson · 2022-07-05T21:26:00Z

Adding some people who have being active on the hashes folder: @yfeldblum, @terrelln

Any input on the ideas I shared above?

@Orvid: I've just added on this patch an initial perf utest to stress JSON parsing performance. Would something similar be acceptable (independent from any hashing work)?

Adenilson · 2022-07-06T00:41:01Z

So the idea of a new interface for a faster hasher would look similar to (draft, incomplete):
https://gist.github.com/Adenilson/2111f3a1f33b671798a380b67dd36c9a

Meanwhile, I'm moving the changes on the Json unit test (i.e. re-enable compilation + add new unit test) to a new merge request.

Adenilson · 2022-07-06T00:41:46Z

The small changes on the json utest is on #1812.

Adenilson · 2022-07-07T18:34:35Z

Updated patch, changes:
a) Keep the deprecated hashes (i.e. Farmhash, SpookyHash, etc).
b) Changes on the JSON unit test were split to the other independent merge request #1812.
c) Migrate only three classes (IPAddressV6, IOBuf, RecordIO) that hopefully should be lower (?) risk.

I would appreciate feedback if the selected three classes are a safe choice to start the migration.

Adenilson · 2022-07-07T18:36:36Z

Fixed typo in the CL msg.

Orvid · 2022-07-08T22:53:30Z

I'll import this version to see how much breaks, but I am working on another version of this internally that should be much easier to actually roll out.

We'll likely want to go a slightly different route to adding the xxhash dependency than we've done in the past for the other hashing algorithms, since xxhash is actually packaged for distribution in various distros, so adding it as a dependency shouldn't be too hard.

facebook-github-bot · 2022-07-08T22:53:43Z

@Orvid has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Adenilson · 2022-07-20T00:18:55Z

@Orvid touching the base to ask how the internal testing went and how much breakage the current version of the patch caused (hopefully none...).
:-)

I can next update the patch to leverage the xxhash library available in the system instead of importing it into the repository.

That being said, xxhash supports optimizations using latest SIMD instructions on target CPUs (e.g. https://github.com/Cyan4973/xxHash/blob/dev/xxhash.h#L3124).

If we know for sure that Folly would be running always on >= Haswell, it could be potentially interesting to import the code into Folly and define the XXH_VECTOR macro to leverage AVX2.

The issue I see with using a xxhash library available in the system is that Linux distributions generally build their packages with a lower common denominator (e.g. SSE2).

Another idea that may be worth exploring is to use CMake to fetch the latest xxhash version from github and build/link with it (I recall that 'FetchContent' could be use used for this case).

What you think?

Orvid · 2022-07-20T01:25:01Z

I failed to realize I hadn't updated our internal build system to account for the new source files from this PR, so the current results are that everything broke, but I've just updated the internal copy of this PR with the changes for that build system to work, so we'll see what is actually broken in a few hours. The more complete version of this swap that I've been working on internally does break a number of things, but I have plans for how to get it rolled out.

As for where to get xxhash from, internally we need to ensure everything is using the same version, so we can't have a copy in tree, but that doesn't stop us from having the header inline its definitions and linking it statically. We'll probably want to have getdeps attempt to find the system package for the open source builds, but if xxhash's inline definitions are enabled, it will use whatever instruction sets are enabled when building Folly, so there isn't anything special that's needed to enable AVX.

Adenilson · 2022-07-21T01:27:05Z

@Orvid Please see the updated version using the system's provided xxhash library as suggested.

By the way, do you have the results of the previous version after the required internal build fixes?

Orvid · 2022-07-21T02:09:57Z

After the previous fixes there are still a couple of errors, but I suspect those are just flaky.

Adenilson · 2022-07-27T20:54:45Z

@Orvid did you have time to verify the latest version of the patch (i.e. where we use the xxhash lib available in the system as suggested)?

Adenilson · 2022-08-04T00:33:42Z

I got two questions:
a) Does the new build file that finds xxhash (i.e. FindXxhash.cmake) got be duplicated in 'build/fbcode_builder/cmake'?

b) Is there a need to also add a new file in the manifest folder (i.e. build/fbcode_builder/manifests) to get/fetch xxhash (something like: https://gist.github.com/Adenilson/43a4a7e0acdfdaeff38f1fc3e2103241)?

Orvid · 2022-08-11T23:03:02Z

Sorry for the delay.

The new build file that finds xxhash doesn't need to be duplicated, but it can live exclusively in build/fbcode_builder/cmake if desired. That directory is synced across every FB repo that is using the getdeps build system for open-source dependency management, and can be used by any project.

Adding support via the manifest is required in order for the open source CI to actually start building with xxhash available.

Adenilson · 2022-08-12T20:49:41Z

@Orvid updated patch addressing the placement of the build file plus updating the manifest.

Anything else you like to see changed?

Adenilson · 2022-08-22T23:37:15Z

Friendly ping on the subject?

Adenilson · 2022-09-07T23:53:53Z


+find_package(Xxhash MODULE)
+set(FOLLY_HAVE_LIBXXHASH ${XXHASH_FOUND})
+if(XXHASH_FOUND)


In a second thought, the build will fail if xxhash is not found. Therefore is preferable to fail/abort on the configure stage, before build.

The requirements for a good hash is a proper balance between speed and hashing quality (i.e. avoiding collisions), while ideally being performant on varied CPU architectures. Currently one of the best (if not the best) hashes available is xxhash (https://github.com/Cyan4973/xxHash) developed by Yann Collet, being between 1.6x to 2x faster than SpookyHashV2 while keeping similar or superior hashing quality on both x86 and ARM. We will link to the system's provided xxhash library instead of having a static and stale copy of xxhash source code into the repository. A new build file (i.e. findXxhash.cmake) and the manifest were updated accordingly. This patch introduces xxhash and migrates only three classes (IPAddressV6, IOBuf, RecordIO) to use it, the idea is to implement the change gradually in small steps.

Since xxhash is required to build, it makes sense to fail early during the configure stage before starting a build.

Orvid · 2022-09-14T22:35:45Z

Sorry about the silence here, this PR appears to be good to go, but it likely isn't worth you spending more time to keep it updated, since there's no reasonable way for me to be able to merge it directly.

This change will be happening though, and the changes in this PR will be a part of that, especially the open source build setup, but I need to go about it a slightly different route, since even including xxhash in a public header seems to cause issues, and I had already been considering adding a wrapper anyways since there are faster ways to hash short strings on x86_64 than what xxhash does.

facebook-github-bot added the CLA Signed label Jun 27, 2022

Adenilson force-pushed the xxhash01 branch from c24459f to 3602eeb Compare July 7, 2022 18:31

Adenilson force-pushed the xxhash01 branch from 3602eeb to a5ee33f Compare July 7, 2022 18:35

Adenilson mentioned this pull request Jul 7, 2022

[json] Re-enable compilation of JsonTest and add two perf tests #1812

Closed

Adenilson changed the title ~~Introduce xxhash and migrate most of the codebase~~ Jul 7, 2022

Adenilson force-pushed the xxhash01 branch 2 times, most recently from 26d4f1f to 68f78f3 Compare July 21, 2022 01:23

Adenilson commented Jul 21, 2022

View reviewed changes

Comment thread folly/io/IOBuf.cpp Outdated

Adenilson force-pushed the xxhash01 branch from 68f78f3 to 2cf36f4 Compare July 21, 2022 01:28

Adenilson force-pushed the xxhash01 branch from 2cf36f4 to 4bfd8ca Compare July 27, 2022 20:49

Adenilson force-pushed the xxhash01 branch from 4bfd8ca to 98dc4b4 Compare August 3, 2022 23:58

Adenilson force-pushed the xxhash01 branch from 98dc4b4 to 517dcd2 Compare August 12, 2022 20:48

Adenilson force-pushed the xxhash01 branch from 517dcd2 to 8470ad4 Compare September 7, 2022 22:39

Adenilson commented Sep 7, 2022

View reviewed changes

Adenilson force-pushed the xxhash01 branch 2 times, most recently from b92c024 to d952df9 Compare September 13, 2022 17:43

Adenilson Cavalcanti added 2 commits September 14, 2022 08:15

[xxhash] Make configure fail if xxhash not found

9b9ac65

Since xxhash is required to build, it makes sense to fail early during the configure stage before starting a build.

Adenilson force-pushed the xxhash01 branch from d952df9 to 9b9ac65 Compare September 14, 2022 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce xxhash and start the migration of the codebase#1805

Introduce xxhash and start the migration of the codebase#1805
Adenilson wants to merge 2 commits into
facebook:mainfrom
Adenilson:xxhash01

Adenilson commented Jun 27, 2022

facebook-github-bot commented Jun 27, 2022

facebook-github-bot commented Jun 27, 2022

facebook-github-bot commented Jun 27, 2022

Orvid commented Jun 28, 2022

Adenilson commented Jun 30, 2022

Orvid commented Jul 5, 2022

Adenilson commented Jul 5, 2022

Adenilson commented Jul 6, 2022

Adenilson commented Jul 6, 2022

Adenilson commented Jul 7, 2022

Adenilson commented Jul 7, 2022

Orvid commented Jul 8, 2022

facebook-github-bot commented Jul 8, 2022

Adenilson commented Jul 20, 2022

Orvid commented Jul 20, 2022

Uh oh!

Adenilson commented Jul 21, 2022

Orvid commented Jul 21, 2022

Adenilson commented Jul 27, 2022

Adenilson commented Aug 4, 2022

Orvid commented Aug 11, 2022

Adenilson commented Aug 12, 2022

Adenilson commented Aug 22, 2022

Adenilson Sep 7, 2022

Adenilson Sep 8, 2022

Orvid commented Sep 14, 2022

Labels

3 participants

Uh oh!

Conversation

Adenilson commented Jun 27, 2022

facebook-github-bot commented Jun 27, 2022

Action Required

Process

facebook-github-bot commented Jun 27, 2022

facebook-github-bot commented Jun 27, 2022

Orvid commented Jun 28, 2022

Adenilson commented Jun 30, 2022

Orvid commented Jul 5, 2022

Adenilson commented Jul 5, 2022

Adenilson commented Jul 6, 2022

Adenilson commented Jul 6, 2022

Adenilson commented Jul 7, 2022

Adenilson commented Jul 7, 2022

Orvid commented Jul 8, 2022

facebook-github-bot commented Jul 8, 2022

Adenilson commented Jul 20, 2022

Orvid commented Jul 20, 2022

Uh oh!

Adenilson commented Jul 21, 2022

Orvid commented Jul 21, 2022

Adenilson commented Jul 27, 2022

Adenilson commented Aug 4, 2022

Orvid commented Aug 11, 2022

Adenilson commented Aug 12, 2022

Adenilson commented Aug 22, 2022

Adenilson Sep 7, 2022

Choose a reason for hiding this comment

Adenilson Sep 8, 2022

Choose a reason for hiding this comment

Orvid commented Sep 14, 2022

Labels

3 participants