fix(task-graph,storage): cache restart-resume + SharedInMemory sync barrier#552
Merged
Conversation
…nerated ids Restart-resume of run-private cache entries relies on a stable cache identity per task. When config.id is not supplied, Task.constructor mints a fresh v4 UUID per process, so a crash + restart would mint new ids and orphan every prior row. This adds Task.hasDeterministicId() (false for v4-UUID-shaped ids) and makes CacheCoordinator key private entries by task type as a best-effort fallback when the id is autogenerated. RunPrivateCacheRepo logs a single warn per process the first time the fallback engages, pointing operators at pinning config.id to enable exact crash-resume. Adds a regression test that seeds a prior run's row under the type-keyed fallback, then verifies a fresh graph with autogen ids reuses it; also pins the warn-once behavior and confirms pinned-id tasks bypass the fallback.
…sync barrier 64591e3 dropped the await on syncFromOtherTabs() in setupDatabase, making the initial cross-tab sync fire-and-forget. Callers that immediately put/get after setupDatabase() resolves now race other tabs' replayed rows. Restores the await as the default and gates it on a new awaitTabSync option (default true) so latency-sensitive startup paths can still opt out explicitly. The internal sync state now exposes a settle promise that resolves on either a SYNC_RESPONSE apply or the SYNC_TIMEOUT fallback, so no caller can deadlock waiting for a peer that never answers. Adds a regression test that pre-populates a peer tab, then asserts the first get on a freshly initialized tab sees the peer's row. Covers the opt-out and no-peer-timeout paths.
Coverage Report
File CoverageNo changed files found. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two targeted fixes against
origin/mainthat landed in recent refactors and silently degraded restart-resume / cross-tab sync semantics. Each fix ships with a regression test underpackages/test/.Fix 1 — Critical: private cache key broke crash-restart-resume
RunPrivateCacheReporows are keyed__run:${runId}::${taskId}.Task.iddefaults touuid4()whenconfig.idis not supplied (seepackages/task-graph/src/task/Task.ts), so after a crash the same graph mints fresh UUIDs on construction and every prior run-private row is orphaned — the entire point of the private cache tier (restart-survival) silently fails.Changes:
Task.hasDeterministicId()(declared onITaskState) — returnsfalsefor canonical v4-UUID-shaped ids,truefor any other non-empty string id (i.e. the caller pinned one).CacheCoordinator.cacheIdentityKeykeys private entries bytask.idonly whenhasDeterministicId(); otherwise it keys bytask.typeso a restart with new autogenerated ids still finds prior rows for the same task type within the samerunId.RunPrivateCacheRepo.noteFallbackKey()fires a singleconsole.warnper process the first time the fallback is engaged ("pin task.id to enable crash-resume"); pinned-id graphs stay silent. Gated by a static boolean so happy-path operation does not log.RunPrivateCacheRepodoc comment to spell out: durable restart-resume requires deterministic task ids; without one the cache is best-effort intra-process only.packages/test/src/test/task-graph-cache/RunPrivateCacheKeyFallback.test.ts:Fix 2 — High:
SharedInMemoryTabularStorage.setupDatabaselost its tab-sync barrierCommit
64591e3dropped theawaitonsyncFromOtherTabs()insidesetupDatabase, turning startup into fire-and-forget. Any caller that immediately reads/writes aftersetupDatabase()resolves now races other tabs' replayed rows.Changes:
syncFromOtherTabs()now installs asyncSettledpromise that resolves either on aSYNC_RESPONSEapply or on theSYNC_TIMEOUTfallback (so a peerless tab never deadlocks).setupDatabase({ awaitTabSync?: boolean })defaults totrue(restoring previous behaviour) and awaits the sync barrier before applying migrations. Latency-sensitive callers can opt out with{ awaitTabSync: false }.packages/test/src/test/storage-tabular/SharedInMemoryTabularStorageSyncBarrier.test.ts:BroadcastChannel, then asserts the firstgeton a freshly initialized tab sees the peer's row.awaitTabSync: falseopt-out path resolves promptly and still functions.Test plan
bun run build:types(turbo, 36 packages) — all greenbunx vitest run packages/test/src/test/task-graph-cache/— 15 files / 58 tests passbunx vitest run packages/test/src/test/storage-tabular/— 19 files / 1075 tests pass (13 pre-existing skips)bunx vitest run packages/test/src/test/task-graph/ packages/test/src/test/task-graph-output-cache/— 51 files / 637 tests passbunx vitest run packages/test/src/test/task/— 72 files / 1117 tests pass (24 pre-existing skips)origin/mainwithout the fixes.Generated by Claude Code