switch ddup hashing algorithm to xxHash#538
Conversation
Signed-off-by: Andrew Robbins andrew@robbinsa.me
revert strcpy, fix library linking properly print hash digest strings fix segfault on crash housekeeping Signed-off-by: Andrew Robbins andrew@robbinsa.me
fixed cmake for vendored xxhash Signed-off-by: Andrew Robbins andrew@robbinsa.me
|
Just a quick update on the viability of this: it seems that xxhash runs so quickly that the threads start clobbering each others' output in the outfile. might need a bit of parallelism reworking that wasn't necessary w/ a secure hash like sha256. |
jeking3
left a comment
There was a problem hiding this comment.
Is this what you meant by timing output changes?
jimk@D6JP26W0MF mpifileutils % make docker-shell
docker run -it --rm mpifileutils:latest bash
root@b1e3304c1d0a:/workspace# ls
root@b1e3304c1d0a:/workspace# ls /usr/local/src
dtcmp-1.1.5 libcircle-0.3.0 lwgrp-1.0.6 mpifileutils
root@b1e3304c1d0a:/workspace# ddup /usr/local/src
[2026-06-08T21:25:58] Walking /usr/local/src
[2026-06-08T21:25:58] Walked 1080 items in 0.008 secs (138840.538 items/sec) ...
[2026-06-08T21:25:58] Walked 1080 items in 0.008 seconds (137812.969 items/sec)
/usr/local/src/lwgrp-1.0.6/config/stamp-h1 fff313aac45a7bb1
/usr/local/src/lwgrp-1.0.6/m4/ltversion.m4 03e46b630b4c4c28
/usr/local/src/dtcmp-1.1.5/config/stamp-h1 fff313aac45a7bb1
/usr/local/src/dtcmp-1.1.5/m4/ltversion.m4 03e46b630b4c4c28
It should be okay to have the different files output to stdout like this but separated.
I implemented your changes in a branch in my repo with the docker container changes and your change, showing how someone would use platform-provided libxxhash instead of a submodule. This builds clean.
If we can fix the issues up in this PR to match maybe we can get the maintainers to commit this and sever the openssl connection the project currently has. A bigger concern would be that this changes the actual ddup output format from a SHA256 hash to a XXHASH, so that could be a breaking change for some folks.
| XXH64_hash_t digest = XXH3_64bits_digest(state_ptr); | ||
| XXH64_canonical_t digest_canon; | ||
| XXH64_canonicalFromHash(&digest_canon, digest); | ||
| char digest_string[XXH3_DIGEST_LENGTH]; |
There was a problem hiding this comment.
Stack Corruption: XXH3_DIGEST_LENGTH is 8 but the string is going to be a hex version with a trailing NUL so we need this to use the * 2 + 1 just like the old code.
| @@ -0,0 +1,4 @@ | |||
| [submodule "xxhash"] | |||
There was a problem hiding this comment.
I would recommend sticking to declared external dependencies for an upstream PR. debian-latest has xxhash for development and runtime - this project has no submodules and adding one complicates things. I'm adding a docker container build, for example, and would be able to build your PR including these fixes on debian:latest without the submodule.
| SHA256_CTX ctx_tmp; | ||
| memcpy(&ctx_tmp, ctx_ptr, sizeof(ctx_tmp)); | ||
| SHA256_Final((unsigned char*)(ptr + 1), &ctx_tmp); | ||
| XXH64_hash_t result = XXH3_64bits_digest(state_ptr); |
There was a problem hiding this comment.
| XXH64_hash_t result = XXH3_64bits_digest(state_ptr); | |
| XXH64_hash_t result = XXH3_64bits_digest(state_ptr); | |
| memcpy(ptr + 1, &result, sizeof(result)); |
Without storing the result, I don't think anything would de-dupe.
| @@ -1,2 +1,2 @@ | |||
| MFU_ADD_TOOL(ddup) | |||
| TARGET_LINK_LIBRARIES(ddup ${OPENSSL_LIBRARIES}) | |||
| TARGET_LINK_LIBRARIES(ddup xxhash) No newline at end of file | |||
There was a problem hiding this comment.
If using inlined, this is unnecessary. There is no link library.
improves performance massively and moots #535
Signed-off-by: Andrew Robbins andrew@robbinsa.me