Skip to content

switch ddup hashing algorithm to xxHash#538

Open
theAeon wants to merge 5 commits into
hpc:mainfrom
theAeon:fast-hash
Open

switch ddup hashing algorithm to xxHash#538
theAeon wants to merge 5 commits into
hpc:mainfrom
theAeon:fast-hash

Conversation

@theAeon

@theAeon theAeon commented Oct 18, 2022

Copy link
Copy Markdown

improves performance massively and moots #535

Signed-off-by: Andrew Robbins andrew@robbinsa.me

Signed-off-by: Andrew Robbins andrew@robbinsa.me
revert strcpy, fix library linking

properly print hash digest strings

fix segfault on crash

housekeeping

Signed-off-by: Andrew Robbins andrew@robbinsa.me
fixed cmake for vendored xxhash

Signed-off-by: Andrew Robbins andrew@robbinsa.me
@theAeon

theAeon commented Jul 6, 2023

Copy link
Copy Markdown
Author

Just a quick update on the viability of this:

it seems that xxhash runs so quickly that the threads start clobbering each others' output in the outfile. might need a bit of parallelism reworking that wasn't necessary w/ a secure hash like sha256.

@jeking3 jeking3 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what you meant by timing output changes?

jimk@D6JP26W0MF mpifileutils % make docker-shell                                                
docker run -it --rm mpifileutils:latest bash
root@b1e3304c1d0a:/workspace# ls
root@b1e3304c1d0a:/workspace# ls /usr/local/src
dtcmp-1.1.5  libcircle-0.3.0  lwgrp-1.0.6  mpifileutils
root@b1e3304c1d0a:/workspace# ddup /usr/local/src
[2026-06-08T21:25:58] Walking /usr/local/src
[2026-06-08T21:25:58] Walked 1080 items in 0.008 secs (138840.538 items/sec) ...
[2026-06-08T21:25:58] Walked 1080 items in 0.008 seconds (137812.969 items/sec)
/usr/local/src/lwgrp-1.0.6/config/stamp-h1 fff313aac45a7bb1
/usr/local/src/lwgrp-1.0.6/m4/ltversion.m4 03e46b630b4c4c28
/usr/local/src/dtcmp-1.1.5/config/stamp-h1 fff313aac45a7bb1
/usr/local/src/dtcmp-1.1.5/m4/ltversion.m4 03e46b630b4c4c28

It should be okay to have the different files output to stdout like this but separated.

I implemented your changes in a branch in my repo with the docker container changes and your change, showing how someone would use platform-provided libxxhash instead of a submodule. This builds clean.

https://github.com/hpc/mpifileutils/compare/main...jeking3:mpifileutils:fast-hash-docker-build?expand=1

If we can fix the issues up in this PR to match maybe we can get the maintainers to commit this and sever the openssl connection the project currently has. A bigger concern would be that this changes the actual ddup output format from a SHA256 hash to a XXHASH, so that could be a breaking change for some folks.

Comment thread src/ddup/ddup.c
XXH64_hash_t digest = XXH3_64bits_digest(state_ptr);
XXH64_canonical_t digest_canon;
XXH64_canonicalFromHash(&digest_canon, digest);
char digest_string[XXH3_DIGEST_LENGTH];

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stack Corruption: XXH3_DIGEST_LENGTH is 8 but the string is going to be a hex version with a trailing NUL so we need this to use the * 2 + 1 just like the old code.

Comment thread .gitmodules
@@ -0,0 +1,4 @@
[submodule "xxhash"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend sticking to declared external dependencies for an upstream PR. debian-latest has xxhash for development and runtime - this project has no submodules and adding one complicates things. I'm adding a docker container build, for example, and would be able to build your PR including these fixes on debian:latest without the submodule.

Comment thread src/ddup/ddup.c
SHA256_CTX ctx_tmp;
memcpy(&ctx_tmp, ctx_ptr, sizeof(ctx_tmp));
SHA256_Final((unsigned char*)(ptr + 1), &ctx_tmp);
XXH64_hash_t result = XXH3_64bits_digest(state_ptr);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
XXH64_hash_t result = XXH3_64bits_digest(state_ptr);
XXH64_hash_t result = XXH3_64bits_digest(state_ptr);
memcpy(ptr + 1, &result, sizeof(result));

Without storing the result, I don't think anything would de-dupe.

Comment thread src/ddup/CMakeLists.txt
@@ -1,2 +1,2 @@
MFU_ADD_TOOL(ddup)
TARGET_LINK_LIBRARIES(ddup ${OPENSSL_LIBRARIES})
TARGET_LINK_LIBRARIES(ddup xxhash) No newline at end of file

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using inlined, this is unnecessary. There is no link library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants