Skip to content

Introduce xxhash and start the migration of the codebase#1805

Open
Adenilson wants to merge 2 commits into
facebook:mainfrom
Adenilson:xxhash01
Open

Introduce xxhash and start the migration of the codebase#1805
Adenilson wants to merge 2 commits into
facebook:mainfrom
Adenilson:xxhash01

Conversation

@Adenilson

Copy link
Copy Markdown

The requirements for a good hash is a proper balance between speed
and hashing quality (i.e. avoiding collisions), while ideally being
performant on varied CPU architectures.

Currently one of the best (if not the best) hashes available is
xxhash (https://github.com/Cyan4973/xxHash) developed by Yann Collet,
being between 1.6x to 2x faster than SpookyHashV2 while keeping similar
or superior hashing quality on both x86 and ARM.

This patch introduces xxhash and migrates the majority of the code
base to use it, while also removing some unused/dead code (e.g.
FarmHash, SpookyHashV1, etc) along the way.

TODO: the only remaining class to migrate is RecordIO.

@facebook-github-bot

Copy link
Copy Markdown
Contributor

Hi @Adenilson!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot

Copy link
Copy Markdown
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@facebook-github-bot

Copy link
Copy Markdown
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@Orvid

Orvid commented Jun 28, 2022

Copy link
Copy Markdown
Contributor

FarmHash, and both spooky hash versions are used elsewhere outside of folly, so can't really be removed.

I also suspect that this will be very hard to actually land due to the number of tests internally that are likely inadvertently dependent on the hash value as a result of expecting a specific order from various unordered maps or sets.

@Adenilson

Copy link
Copy Markdown
Author

Hello there, Orvid!

Thanks for reading the patch, I appreciate it.

I don't expect the current patch to land as it is now.

The primary objective was to kick start a discussion on two subjects: hashing on Folly and JSON parsing.

For the first subject, it would be helpful to better define what are the 'internal tests' that may rely on an immutable hash.

Could you share a bit more about it?

I wonder if it would make sense to add a few unit tests to ensure no breakage of Folly internal client code would happen when changes are made in the upstream project.

To help gauge the dimension of the problem, it would be pretty helpful to understand if the internal tests you mentioned call directly on the deprecated hashes or if they rely on using a common API entry point (e.g. hasher::operator()).

If a majority of client code falls into the first case, it should be simple to handle (i.e. keep the deprecated hashes in the
repository and new code could instead use better/faster hashes).

If instead they use a common entry point, an alternative would be to introduce the performant and potentially mutable hash (i.e. when a new and faster one is published, it would be updated) under a proper class/namespace (e.g. hasher::FastHash()) and keep the deprecated hash around for older client code that can't be updated for some reason.

That was the approach that I introduced in Chromium (https://source.chromium.org/chromium/chromium/src/+/main:base/hash/hash.h;l=38;bpv=1) successfully.

I would assume that if the dimension of the work is not massive, the best engineering approach is to migrate the 'internal tests' to become invariant when it comes to the selected hash.

Is that reasonable?

Another valid concern that you shared was related to unordered maps/sets. Before submitting the current patch, I took the care of running all the > 3000 tests (including the 'SLOW' and 'HANGING' tests) and observed no regressions.

If there is interest on improving the current state of hashing in Folly, we could implement the migration in a series of patches to avoid any possible breakage and have a smoother integration with 'internal' client code.

Starting with hopefully less risky classes (e.g. IOBuf) and moving to some that may have some attrition (e.g. maps/sets).

What are you thoughts on the subject?

Moving to the next subject, JSON parsing.

I started by having a look on the JSON parsing class and noticed that the main unit test was commented out (apparently related to a MSVC bug?).

To get it building again was a one liner fix, I proceeded next to use as an input file a JSON file generated by retrieving the comments from my own Facebook account (I named the file 'input.json').

I wonder if there is a corpus of sample JSON files that are relevant for Folly client code?).

The reason I ask about a corpus is that I observed diverging ordering of hot spots depending on the input JSON file (e.g. twitter.json vs FB comments sample file mentioned above).

A perf profile showed that one of the top 8 hot spots (alongside parseValue, parseString, F14Table, dynamic::operator=) while parsing JSON was relate to hashing.

That lead me to have a look on which hashes and where they were deployed in the code base.

There may be good potential for performance gains in JSON parsing by leveraging SIMD optimizations.

I've done some previous work on vectorizing zlib for Chromium and it is quite surprising the performance gains achievable by working with vectors instead of a scalar approach.

The good news is that there is previous work done on the area (e.g. simdjson).

Would Folly be open to adopt SIMD specific optimizations in its JSON parser or maybe just migrate to simdjson (if it has all the required features)?

@Orvid

Orvid commented Jul 5, 2022

Copy link
Copy Markdown
Contributor

Incremental migration is likely the right way to go, but is not something that is viable to do from outside of FB unfortunately, as the scale is likely massive given just how much depends on Folly, but there are likely very few things that rely on the hash being immutable for functional correctness, so can be fixed up.

Evaluating json performance is also a bit difficult to do externally unfortunately, as we don't have any dedicated benchmarks for folly::json as far as I know

@Adenilson

Copy link
Copy Markdown
Author

Adding some people who have being active on the hashes folder: @yfeldblum, @terrelln

Any input on the ideas I shared above?

@Orvid: I've just added on this patch an initial perf utest to stress JSON parsing performance. Would something similar be acceptable (independent from any hashing work)?

@Adenilson

Copy link
Copy Markdown
Author

So the idea of a new interface for a faster hasher would look similar to (draft, incomplete):
https://gist.github.com/Adenilson/2111f3a1f33b671798a380b67dd36c9a

Meanwhile, I'm moving the changes on the Json unit test (i.e. re-enable compilation + add new unit test) to a new merge request.

@Adenilson

Copy link
Copy Markdown
Author

The small changes on the json utest is on #1812.

@Adenilson

Copy link
Copy Markdown
Author

Updated patch, changes:
a) Keep the deprecated hashes (i.e. Farmhash, SpookyHash, etc).
b) Changes on the JSON unit test were split to the other independent merge request #1812.
c) Migrate only three classes (IPAddressV6, IOBuf, RecordIO) that hopefully should be lower (?) risk.

I would appreciate feedback if the selected three classes are a safe choice to start the migration.

@Adenilson

Copy link
Copy Markdown
Author

Fixed typo in the CL msg.

@Adenilson Adenilson changed the title Introduce xxhash and migrate most of the codebase Jul 7, 2022
@Orvid

Orvid commented Jul 8, 2022

Copy link
Copy Markdown
Contributor

I'll import this version to see how much breaks, but I am working on another version of this internally that should be much easier to actually roll out.

We'll likely want to go a slightly different route to adding the xxhash dependency than we've done in the past for the other hashing algorithms, since xxhash is actually packaged for distribution in various distros, so adding it as a dependency shouldn't be too hard.

@facebook-github-bot

Copy link
Copy Markdown
Contributor

@Orvid has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@Adenilson

Copy link
Copy Markdown
Author

@Orvid touching the base to ask how the internal testing went and how much breakage the current version of the patch caused (hopefully none...).
:-)

I can next update the patch to leverage the xxhash library available in the system instead of importing it into the repository.

That being said, xxhash supports optimizations using latest SIMD instructions on target CPUs (e.g. https://github.com/Cyan4973/xxHash/blob/dev/xxhash.h#L3124).

If we know for sure that Folly would be running always on >= Haswell, it could be potentially interesting to import the code into Folly and define the XXH_VECTOR macro to leverage AVX2.

The issue I see with using a xxhash library available in the system is that Linux distributions generally build their packages with a lower common denominator (e.g. SSE2).

Another idea that may be worth exploring is to use CMake to fetch the latest xxhash version from github and build/link with it (I recall that 'FetchContent' could be use used for this case).

What you think?

@Orvid

Orvid commented Jul 20, 2022

Copy link
Copy Markdown
Contributor

I failed to realize I hadn't updated our internal build system to account for the new source files from this PR, so the current results are that everything broke, but I've just updated the internal copy of this PR with the changes for that build system to work, so we'll see what is actually broken in a few hours. The more complete version of this swap that I've been working on internally does break a number of things, but I have plans for how to get it rolled out.

As for where to get xxhash from, internally we need to ensure everything is using the same version, so we can't have a copy in tree, but that doesn't stop us from having the header inline its definitions and linking it statically. We'll probably want to have getdeps attempt to find the system package for the open source builds, but if xxhash's inline definitions are enabled, it will use whatever instruction sets are enabled when building Folly, so there isn't anything special that's needed to enable AVX.

@Adenilson Adenilson force-pushed the xxhash01 branch 2 times, most recently from 26d4f1f to 68f78f3 Compare July 21, 2022 01:23
Comment thread folly/io/IOBuf.cpp Outdated
@Adenilson

Copy link
Copy Markdown
Author

@Orvid Please see the updated version using the system's provided xxhash library as suggested.

By the way, do you have the results of the previous version after the required internal build fixes?

@Orvid

Orvid commented Jul 21, 2022

Copy link
Copy Markdown
Contributor

After the previous fixes there are still a couple of errors, but I suspect those are just flaky.

@Adenilson

Copy link
Copy Markdown
Author

@Orvid did you have time to verify the latest version of the patch (i.e. where we use the xxhash lib available in the system as suggested)?

@Adenilson

Copy link
Copy Markdown
Author

I got two questions:
a) Does the new build file that finds xxhash (i.e. FindXxhash.cmake) got be duplicated in 'build/fbcode_builder/cmake'?

b) Is there a need to also add a new file in the manifest folder (i.e. build/fbcode_builder/manifests) to get/fetch xxhash (something like: https://gist.github.com/Adenilson/43a4a7e0acdfdaeff38f1fc3e2103241)?

@Orvid

Orvid commented Aug 11, 2022

Copy link
Copy Markdown
Contributor

Sorry for the delay.

The new build file that finds xxhash doesn't need to be duplicated, but it can live exclusively in build/fbcode_builder/cmake if desired. That directory is synced across every FB repo that is using the getdeps build system for open-source dependency management, and can be used by any project.

Adding support via the manifest is required in order for the open source CI to actually start building with xxhash available.

@Adenilson

Copy link
Copy Markdown
Author

@Orvid updated patch addressing the placement of the build file plus updating the manifest.

Anything else you like to see changed?

@Adenilson

Copy link
Copy Markdown
Author

Friendly ping on the subject?

Comment thread CMake/folly-deps.cmake Outdated

find_package(Xxhash MODULE)
set(FOLLY_HAVE_LIBXXHASH ${XXHASH_FOUND})
if(XXHASH_FOUND)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a second thought, the build will fail if xxhash is not found. Therefore is preferable to fail/abort on the configure stage, before build.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Adenilson Adenilson force-pushed the xxhash01 branch 2 times, most recently from b92c024 to d952df9 Compare September 13, 2022 17:43
Adenilson Cavalcanti added 2 commits September 14, 2022 08:15
The requirements for a good hash is a proper balance between speed
and hashing quality (i.e. avoiding collisions), while ideally being
performant on varied CPU architectures.

Currently one of the best (if not the best) hashes available is
xxhash (https://github.com/Cyan4973/xxHash) developed by Yann Collet,
being between 1.6x to 2x faster than SpookyHashV2 while keeping similar
or superior hashing quality on both x86 and ARM.

We will link to the system's provided xxhash library instead of having
a static and stale copy of xxhash source code into the repository. A new
build file (i.e. findXxhash.cmake) and the manifest were updated
accordingly.

This patch introduces xxhash and migrates only three classes (IPAddressV6,
IOBuf, RecordIO) to use it, the idea is to implement the change gradually
in small steps.
Since xxhash is required to build, it makes sense to fail early
during the configure stage before starting a build.
@Orvid

Orvid commented Sep 14, 2022

Copy link
Copy Markdown
Contributor

Sorry about the silence here, this PR appears to be good to go, but it likely isn't worth you spending more time to keep it updated, since there's no reasonable way for me to be able to merge it directly.

This change will be happening though, and the changes in this PR will be a part of that, especially the open source build setup, but I need to go about it a slightly different route, since even including xxhash in a public header seems to cause issues, and I had already been considering adding a wrapper anyways since there are faster ways to hash short strings on x86_64 than what xxhash does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

3 participants