Skip to content

fix(prover): make shard and reduce tasks idempotent under re-delivery#2848

Merged
tamirhemo merged 5 commits into
mainfrom
farhad/idempotent-task-redelivery
Jun 19, 2026
Merged

fix(prover): make shard and reduce tasks idempotent under re-delivery#2848
tamirhemo merged 5 commits into
mainfrom
farhad/idempotent-task-redelivery

Conversation

@Farhad-Shabani

@Farhad-Shabani Farhad-Shabani commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Problem

The cluster delivers tasks at-least-once — the same task_id can run again via dead-worker requeue (30s heartbeat timeout) or fair-share preemption requeue. But a prover task deletes its input artifact on success ("no longer needed"). So a re-run downloads an input a prior run already deleted → NotFoundTaskError::Fatalfail_proof → the whole request goes Unfulfillable, even though a valid proof already exists.

Observed on mainnet: a ProveShard uploaded its proof; fair-share preemption requeued the same task a few ms later; the re-run on another worker 404'd on the deleted input and failed the proof. The duplicate-delivery path produced the same failure repeatedly.

Fix

Two parts, so re-execution is harmless and prompt reclamation is kept:

  1. Idempotent consumers. On an input-download failure, if the input is gone and the task's output already exists, a prior run already did the work — report success instead of failing. Applied to ProveShard (prover/core.rs) and recursion reduce (node/full/init.rs, with ReduceTaskRequest::already_reduced / RangeProofs::any_proof_missing).

  2. Delete inputs last. A consumer deletes its input only after every output is durable. ProveShard produces two outputs (output + deferred_output); the input delete was moved after the deferred upload completes, so "input gone" means the task is fully done — not partway. (This also fixes a latent bug: a deferred-upload failure previously returned an error after the input was already deleted, so the retry 404'd.)

The eager delete stays (prompt reclamation — important on memory-bounded stores); (1) and (2) make it safe.

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor
Test Old New Diff
secp256k1_program_test_recover_rand_lte_100 5533240 5532101 -0.0206 %
bn_test_bn_test_g1_double_100 735518 735518 0.0000 %
sha_test_sha3_expected_digest_lte_100_times 1198911 1198895 -0.0013 %
sha_test_sha2_v0_10_6_expected_digest_lte_100_times 1358647 1360552 0.1402 %
k256_test_schnorr_verify 5718420 5716491 -0.0337 %
k256_test_verify_rand_lte_100 11759246 11757190 -0.0175 %
sha_test_sha2_v0_9_9_expected_digest_lte_100_times 1271817 1263972 -0.6168 %
rustcrypto_bigint_test_bigint_mul_mod_special 1789627 1789627 0.0000 %
bls12_381_tests_test_bls_add_100 10376534 10376534 0.0000 %
curve25519_dalek_test_decompressed_noncanonical 7851 7851 0.0000 %
sha_test_sha2_v0_10_8_expected_digest_lte_100_times 1354933 1353540 -0.1028 %
curve25519_dalek_test_add_then_multiply 3151770 3191343 1.2556 %
curve25519_dalek_test_ed25519_verify 13354341 13352552 -0.0134 %
curve25519_dalek_ng_test_zero_mul 107715 107715 0.0000 %
bls12_381_tests_test_inverse_fp2_100 2230063 2230063 0.0000 %
secp256k1_program_test_verify_v0_30_0_rand_lte_100 17127451 17103046 -0.1425 %
sha_test_sha2_v0_10_9_expected_digest_lte_100_times 1360615 1355582 -0.3699 %
k256_test_recover_high_hash_high_recid 2202553 1802907 -18.1447 %
bn_test_bn_test_g1_msm_edge 411941 411941 0.0000 %
curve25519_dalek_test_zero_msm 83313 83313 0.0000 %
curve25519_dalek_ng_test_zero_msm 125094 125094 0.0000 %
bls12_381_tests_test_inverse_fp_100 1205483 1205483 0.0000 %
bn_test_bn_test_g1_add_100 998825 998832 0.0007 %
curve25519_dalek_test_decompressed_expected_value 4641641 4519289 -2.6360 %
rustcrypto_bigint_test_bigint_mul_add_residue 1751130 1751130 0.0000 %
rust_crypto_rsa_test_pkcs_verify_100 29141122 29133390 -0.0265 %
secp256k1_program_test_verify_rand_lte_100 17089466 17086084 -0.0198 %
p256_test_recover_high_hash_high_recid 5788691 5206203 -10.0625 %
bls12_381_tests_test_bls_double_100 6310230 6310230 0.0000 %
bn_test_bn_test_fq_partial_ord 186343 186343 0.0000 %
bn_test_bn_test_g1_mul_zero 48333 48333 0.0000 %
curve25519_dalek_test_zero_mul 71736 71736 0.0000 %
keccack_test_expected_digest_lte_100 1723815 1724263 0.0260 %
p256_test_recover_rand_lte_100 15754964 15764860 0.0628 %
curve25519_dalek_ng_test_decompressed_noncanonical 195347 195347 0.0000 %
bn_test_bn_test_fr_inverse_100 822631 822631 0.0000 %
bn_test_bn_test_fq_sqrt_100 804031 804031 0.0000 %
bn_test_bn_test_g1_add_neg 299591 299549 -0.0140 %
bls12_381_tests_test_sqrt_fp_100 903335 903462 0.0141 %
bn_test_bn_test_fq_inverse_100 805631 805631 0.0000 %
p256_test_verify_rand_lte_100 11994638 12008651 0.1168 %
k256_test_recover_rand_lte_100 4581112 4585784 0.1020 %
p256_test_recover_pubkey_infinity 97138 97138 0.0000 %
secp256k1_program_test_recover_v0_30_0_rand_lte_100 5474094 5484390 0.1881 %
curve25519_dalek_ng_test_add_then_multiply 3669003 3341471 -8.9270 %
bls12_381_tests_test_sqrt_fp2_100 1824892 1938905 6.2477 %
k256_test_recover_pubkey_infinity 102032 102032 0.0000 %
k256_test_point_ops_edge_cases 32652 32652 0.0000 %

@tamirhemo

Copy link
Copy Markdown
Contributor

I see the need for the fix but the implementation seems a bit messy. Maybe there is a cleaner way to do this with some sort of method on the raw proof requests that checks that outputs exist?

deferred_upload_handle.await.map_err(|e| TaskError::Fatal(e.into()))??;
}

// Reclaim the input record only after every output (`output` and the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense

@tamirhemo tamirhemo merged commit e5e2706 into main Jun 19, 2026
11 checks passed
@tamirhemo tamirhemo deleted the farhad/idempotent-task-redelivery branch June 19, 2026 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants