fix: DRSLTR dispatch, SPAC sponsor span verification, resolver test + RFC-9112 Content-Length#127
Merged
Merged
Conversation
…ot a prospectus) Form_DRS.forms advertised DRSLTR but FORM_TO_EXTRACTOR_ID intentionally did not map it (DRSLTR is a SEC correspondence letter, not a registration prospectus). Every `sec fetch form <cik> DRSLTR` attempt threw "No extractor registered for form 'DRSLTR'" after advertising support. Drop from Form_DRS.forms; add invariant test pinning that DRSLTR is not in ALL_FORMS_MAP. DRSLTR remains a valid EDGAR form name in types/edgar/, just no longer claimed by Form_DRS as a registration prospectus.
…ntegrity bypass
`parseInt(len, 10)` accepts trailing garbage like "123abc" → 123, so a
malformed Content-Length header could either crash the size-mismatch
assertion with a misleading error or, worse, silently match a truncated
body whose advertised length happened to align with what we received.
Replace it with a strict /^\d+$/ regex check on the trimmed header and
clamp values above MAX_SAFE_INTEGER to undefined so the integrity
assertion uses exact integer comparison.
Adds three new tests covering the malformed ("123abc"), whitespace-padded
(" 10 "), and negative ("-5") cases.
The previous implementation only held the per-key mutex around the find-or-create step, then released it before resolving the alias. That left a window where two parallel `resolve()` calls on the same natural key could split: the first caller would observe the canonical row, release the mutex, and start the alias lookup; the second caller would then re-enter, see the same canonical row, and race ahead with its own alias lookup. If an operator added an alias between the create and the first caller's alias lookup, the two callers could legitimately return different ids for the same observation. Move the `canonical*AliasRepo.resolve(candidateId)` call inside the `entry.mutex.lock` callback for both PersonResolver and CompanyResolver, so the queued caller observes both the canonical row and its alias resolution and concurrent resolves converge to the same final id even when one races an alias rewrite. Tightens the JSDoc and adds an "alias resolution is serialised with create" test to each resolver test file that stubs the alias resolver with a 10ms sleep and asserts both parallel resolves return the same alias-target id and only one canonical row is minted.
Copilot review on PR #125 flagged that the prior 'invalid header → totalBytes undefined' fallback silently skipped the size-mismatch integrity check, defeating the very protection the original fix was meant to add. Switch to fail-closed semantics: when a Content-Length header is present but not a pure non-negative integer (or exceeds MAX_SAFE_INTEGER), throw an Invalid Content-Length error instead of treating the response as having no advertised length. The truly-absent case (chunked transfer) still streams to completion with no integrity assertion. Tests updated to assert the throw, matching their 'rejects' naming.
…the race The race tests added in PR #125 stubbed aliasRepo.resolve to return a constant regardless of input, so both pre-fix and post-fix code paths produced the same final ids and the tests signaled nothing. Replace with stubs that track concurrent in-flight aliasRepo.resolve calls. Pre-fix code releases the per-key mutex before the alias lookup, so two parallel resolves drive maxInFlight to 2. Post-fix code holds the alias lookup inside the mutex, so maxInFlight is bounded at 1. Verified by checking out origin/main:src/resolver/PersonResolver.ts and running the rewritten test — it fails with `Expected: 1 / Received: 2` on the pre-fix code, then passes once the fix is reapplied.
PR #125's strict /^\d+$/ regex rejected "100, 100" — the value that Headers.append produces when a real CDN (CloudFront/Akamai/ELB) sends two Content-Length: 100 lines. RFC 9112 §6.3 explicitly allows the recipient to combine equal duplicates. Bootstrap downloads are operator- triggered and not auto-retried, so this surfaced as confusing failures on previously-working downloads. Split on ',', trim each, require each to match the integer regex, then require all parts equal. Mismatched values are still rejected (genuine protocol error). Single-value behavior is byte-for-byte identical.
…re persisting processFormS1's SPAC sponsor extraction passed LLM-returned rows to SpacSponsorLinkRepo with only a confidence-floor check. The extractor input may concatenate management / beneficial-ownership / related-party section text (when the dedicated "The Sponsor" heading is absent), so LLM hallucination of company names from director bios could write permanent fact-claims keyed to the issuer CIK. Extend runSection with an optional verifyRow hook (and matching unverifiedAllDetail / unverifiedPartialDetail) so: - when every confident row fails verification, the section dead-letters as UNVERIFIED_SOURCE_SPAN instead of MODEL_EMPTY / LOW_CONFIDENCE_ALL - when some rows fail verification, the surviving rows persist normally and a "<sectionName>-partial" dead-letter is recorded for triage Sponsor section opts in with spanAppearsIn — NFKC + curly-quote + whitespace + case normalization, 3-char minimum. Bump S-1 extractor 1.0.0 -> 1.1.0; the version-slot machinery will re-run S-1 against existing filings under stricter verification.
dc3b343 to
e0c526e
Compare
…fail)
The verifyRow callback's `r.source_span` access broke TRow inference in
the SPAC sponsor runSection call: contextual typing of the verifyRow
parameter pinned TRow to the bare `{ confidence: number }` constraint,
which then broke the persist callback's `r.legal_name` / `r.common_name`
/ `r.source_span` access. CI (tsc --noEmit) caught it; local `bun test`
runs no type-check so the rebase landed it.
Derive SpacSponsorRow from `Awaited<ReturnType<typeof extractSpacSponsors>>`
and pass it as the explicit type argument to runSection so the persist /
verifyRow callbacks see the full row shape. No runtime change.
11 tasks
sroussey
added a commit
that referenced
this pull request
Jun 8, 2026
…lization (#129) * fix(resolver): serialize alias lookup inside per-key mutex for FamilyResolver The shared FamilyResolver acquired the per-key mutex, ran find-or-create, released, then ran resolveAlias() OUTSIDE the lock. Two parallel resolve() calls on the same family name could split if an alias was installed between the create and the alias-lookup — one returned the alias target, the other the pre-alias id. This is the same race PR #127 fixed for PersonResolver and CompanyResolver in commit 8fe6fd0; applying the same fix here so SponsorFamilyResolver and UnderwriterFamilyResolver benefit. Adds regression tests in both family-resolver test files that wrap aliasRepo.resolve with an in-flight counter and assert maxInFlight === 1 under two parallel resolves on the same key (pre-fix code drives it to 2). * fix(resolver): pin FamilyResolver normalize-case to UPPER; lock in via FamilyResolver.test.ts The shared normalizeFamilyName introduced in commit 960d707 silently flipped the family-name fold from UPPER (the pre-refactor SponsorFamilyResolver convention) to lower. CanonicalSponsorFamilyRepo.findByResolverAndName runs an exact-match storage query with no case fold, so any existing UPPER canonical_sponsor_family.normalized_name rows became unreachable, and operator-installed aliases keyed on those canonical ids would silently orphan. Reverts the fold to UPPER and pins it via a new dedicated FamilyResolver.test.ts that covers UPPER output, case-insensitivity, internal-whitespace collapse, and empty/whitespace-only input. Also updates three test fixtures (CanonicalSponsorFamilyRepo, CanonicalUnderwriterFamilyRepo, underwriterFamily command tests) that hardcoded lowercase normalized_name literals so they match the live convention. * fix(resolver): flip stale lowercase test literal + JSDoc to UPPER (#130) PR #129 pinned FamilyResolver normalization to UPPER but two callers were missed: - Form_S_1.storage.underwriters.test.ts: findByResolverAndName lookup still used "goldman sachs", returning undefined and crashing the next line's family! dereference. Updated to "GOLDMAN SACHS" to match the three other test files already flipped in PR #129. - SponsorFamilyResolver.normalizeSponsorFamilyName JSDoc still described lower-casing; the function delegates to normalizeFamilyName which now UPPER-cases. Updated the doc to reflect actual behavior and the canonical_sponsor_family.normalized_name column it matches against. UnderwriterFamilyResolver was checked and has no equivalent stale doc. Co-authored-by: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four fixes. Two are corrections to PR #125 (cherry-picked here on top of
origin/main); two are new fixes that targetmaindirectly.Drop
DRSLTRfromForm_DRS.forms(targetsmain)Form_DRS.formsadvertisedDRSLTRbutFORM_TO_EXTRACTOR_IDintentionally did not map it (DRSLTR is a SEC correspondence letter, not a registration prospectus). Everysec fetch form <cik> DRSLTRattempt threw"No extractor registered for form 'DRSLTR'"after Form_DRS advertised support. Dropped fromForm_DRS.formsand added an invariant test pinning thatDRSLTRis not inALL_FORMS_MAP.Verify SPAC sponsor
source_spanbefore persisting (targetsmain)processFormS1's SPAC sponsor extraction passed LLM-returned rows toSpacSponsorLinkRepowith only a confidence-floor check. The extractor input concatenates management / beneficial-ownership / related-party section text (no dedicated "Sponsor" section), so LLM hallucination of company names from director bios could write permanent fact-claims keyed to the issuer CIK.Added a
source_spanverbatim verification pass (NFKC + curly-quote + whitespace + case normalization). Drops any returned row whosesource_spanis not present in the sponsor section text:UNVERIFIED_SOURCE_SPAN."spac-sponsors-partial"dead-letter for triage.Bumped S-1 extractor default
1.0.0→1.1.0; version-slot machinery re-runs S-1 against existing filings under stricter verification.Make resolver alias-race test actually discriminate the race (correction to PR #125)
PR #125's alias-serialisation test stubbed
aliasRepo.resolveto return a constant id regardless of input, so both the pre-fix code path (alias lookup outside the mutex) and the post-fix code path (alias lookup inside the mutex) produced identical final ids — the test signaled nothing.Replaced with a stub that tracks concurrent in-flight
aliasRepo.resolvecalls:mutex.release()→ two parallel resolves drivemaxInFlightto 2.maxInFlightis bounded at 1.Verified by checking out
origin/main:src/resolver/PersonResolver.tsand running the rewritten test — fails withExpected: 1 / Received: 2on pre-fix code, passes once the fix is reapplied.Accept RFC 9112 duplicate-equal Content-Length values (correction to PR #125)
PR #125's strict
/^\d+$/regex rejected"100, 100"— the valueHeaders.appendproduces when a real CDN (CloudFront / Akamai / ELB) sends twoContent-Length: 100lines. RFC 9112 §6.3 explicitly allows the recipient to combine duplicate equal values. Because bootstrap downloads are operator-triggered and not auto-retried, this surfaced as confusing failures on previously-working downloads.Split on
,, trim each part, require each to match the integer regex, then require all parts equal. Mismatched values are still rejected (genuine protocol error). Single-value behavior is byte-for-byte identical.Relationship to PR #125
PR #125's underlying intent (strict Content-Length parse + alias-resolution inside mutex) was correct. The three commits from PR #125 are cherry-picked here as the base, with the test correction and the Content-Length tolerance correction layered on top.
Maintainer should choose one:
Test plan
bun test src/storage/versioning/extractorIds.test.ts(17 pass)bun test src/sec/forms/registration-statements/s1/verifySourceSpan.test.ts— utility (12 pass)bun test src/sec/forms/registration-statements/s1/spacSponsor.e2e.test.ts— storage e2e (6 pass)bun test src/resolver/PersonResolver.test.ts src/resolver/CompanyResolver.test.ts(18 pass)bun test src/task/bootstrap/BootstrapDownloadTask.test.ts(13 pass)bun test src/sec/forms/(243 pass)bun test src/storage/(412 pass)bun testwhole repo (1042 pass)origin/main, confirmed new test fails withExpected: 1 / Received: 2.https://claude.ai/code/session_01EswfjUtvDWz8z1wQJspg2v
Generated by Claude Code