Skip to content

fix: DRSLTR dispatch, SPAC sponsor span verification, resolver test + RFC-9112 Content-Length#127

Merged
sroussey merged 9 commits into
mainfrom
claude/wonderful-hypatia-HGrPZ
Jun 7, 2026
Merged

fix: DRSLTR dispatch, SPAC sponsor span verification, resolver test + RFC-9112 Content-Length#127
sroussey merged 9 commits into
mainfrom
claude/wonderful-hypatia-HGrPZ

Conversation

@sroussey

@sroussey sroussey commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Four fixes. Two are corrections to PR #125 (cherry-picked here on top of origin/main); two are new fixes that target main directly.

Drop DRSLTR from Form_DRS.forms (targets main)

Form_DRS.forms advertised DRSLTR but FORM_TO_EXTRACTOR_ID intentionally did not map it (DRSLTR is a SEC correspondence letter, not a registration prospectus). Every sec fetch form <cik> DRSLTR attempt threw "No extractor registered for form 'DRSLTR'" after Form_DRS advertised support. Dropped from Form_DRS.forms and added an invariant test pinning that DRSLTR is not in ALL_FORMS_MAP.

Verify SPAC sponsor source_span before persisting (targets main)

processFormS1's SPAC sponsor extraction passed LLM-returned rows to SpacSponsorLinkRepo with only a confidence-floor check. The extractor input concatenates management / beneficial-ownership / related-party section text (no dedicated "Sponsor" section), so LLM hallucination of company names from director bios could write permanent fact-claims keyed to the issuer CIK.

Added a source_span verbatim verification pass (NFKC + curly-quote + whitespace + case normalization). Drops any returned row whose source_span is not present in the sponsor section text:

  • All rows dropped → dead-letter as UNVERIFIED_SOURCE_SPAN.
  • Some rows dropped → write the verified ones AND record a "spac-sponsors-partial" dead-letter for triage.

Bumped S-1 extractor default 1.0.01.1.0; version-slot machinery re-runs S-1 against existing filings under stricter verification.

Make resolver alias-race test actually discriminate the race (correction to PR #125)

PR #125's alias-serialisation test stubbed aliasRepo.resolve to return a constant id regardless of input, so both the pre-fix code path (alias lookup outside the mutex) and the post-fix code path (alias lookup inside the mutex) produced identical final ids — the test signaled nothing.

Replaced with a stub that tracks concurrent in-flight aliasRepo.resolve calls:

  • Pre-fix: alias lookup runs after mutex.release() → two parallel resolves drive maxInFlight to 2.
  • Post-fix: alias lookup runs inside the lock → maxInFlight is bounded at 1.

Verified by checking out origin/main:src/resolver/PersonResolver.ts and running the rewritten test — fails with Expected: 1 / Received: 2 on pre-fix code, passes once the fix is reapplied.

Accept RFC 9112 duplicate-equal Content-Length values (correction to PR #125)

PR #125's strict /^\d+$/ regex rejected "100, 100" — the value Headers.append produces when a real CDN (CloudFront / Akamai / ELB) sends two Content-Length: 100 lines. RFC 9112 §6.3 explicitly allows the recipient to combine duplicate equal values. Because bootstrap downloads are operator-triggered and not auto-retried, this surfaced as confusing failures on previously-working downloads.

Split on ,, trim each part, require each to match the integer regex, then require all parts equal. Mismatched values are still rejected (genuine protocol error). Single-value behavior is byte-for-byte identical.

Relationship to PR #125

PR #125's underlying intent (strict Content-Length parse + alias-resolution inside mutex) was correct. The three commits from PR #125 are cherry-picked here as the base, with the test correction and the Content-Length tolerance correction layered on top.

Maintainer should choose one:

Test plan

  • bun test src/storage/versioning/extractorIds.test.ts (17 pass)
  • bun test src/sec/forms/registration-statements/s1/verifySourceSpan.test.ts — utility (12 pass)
  • bun test src/sec/forms/registration-statements/s1/spacSponsor.e2e.test.ts — storage e2e (6 pass)
  • bun test src/resolver/PersonResolver.test.ts src/resolver/CompanyResolver.test.ts (18 pass)
  • bun test src/task/bootstrap/BootstrapDownloadTask.test.ts (13 pass)
  • bun test src/sec/forms/ (243 pass)
  • bun test src/storage/ (412 pass)
  • bun test whole repo (1042 pass)
  • Discrimination check: reverted resolver to origin/main, confirmed new test fails with Expected: 1 / Received: 2.

https://claude.ai/code/session_01EswfjUtvDWz8z1wQJspg2v


Generated by Claude Code

claude added 7 commits June 6, 2026 17:47
…ot a prospectus)

Form_DRS.forms advertised DRSLTR but FORM_TO_EXTRACTOR_ID intentionally
did not map it (DRSLTR is a SEC correspondence letter, not a registration
prospectus). Every `sec fetch form <cik> DRSLTR` attempt threw
"No extractor registered for form 'DRSLTR'" after advertising support.

Drop from Form_DRS.forms; add invariant test pinning that DRSLTR is not
in ALL_FORMS_MAP. DRSLTR remains a valid EDGAR form name in
types/edgar/, just no longer claimed by Form_DRS as a registration
prospectus.
…ntegrity bypass

`parseInt(len, 10)` accepts trailing garbage like "123abc" → 123, so a
malformed Content-Length header could either crash the size-mismatch
assertion with a misleading error or, worse, silently match a truncated
body whose advertised length happened to align with what we received.
Replace it with a strict /^\d+$/ regex check on the trimmed header and
clamp values above MAX_SAFE_INTEGER to undefined so the integrity
assertion uses exact integer comparison.

Adds three new tests covering the malformed ("123abc"), whitespace-padded
("  10  "), and negative ("-5") cases.
The previous implementation only held the per-key mutex around the
find-or-create step, then released it before resolving the alias. That
left a window where two parallel `resolve()` calls on the same natural
key could split: the first caller would observe the canonical row,
release the mutex, and start the alias lookup; the second caller would
then re-enter, see the same canonical row, and race ahead with its own
alias lookup. If an operator added an alias between the create and the
first caller's alias lookup, the two callers could legitimately return
different ids for the same observation.

Move the `canonical*AliasRepo.resolve(candidateId)` call inside the
`entry.mutex.lock` callback for both PersonResolver and CompanyResolver,
so the queued caller observes both the canonical row and its alias
resolution and concurrent resolves converge to the same final id even
when one races an alias rewrite.

Tightens the JSDoc and adds an "alias resolution is serialised with
create" test to each resolver test file that stubs the alias resolver
with a 10ms sleep and asserts both parallel resolves return the same
alias-target id and only one canonical row is minted.
Copilot review on PR #125 flagged that the prior 'invalid header →
totalBytes undefined' fallback silently skipped the size-mismatch
integrity check, defeating the very protection the original fix was
meant to add. Switch to fail-closed semantics: when a Content-Length
header is present but not a pure non-negative integer (or exceeds
MAX_SAFE_INTEGER), throw an Invalid Content-Length error instead of
treating the response as having no advertised length. The truly-absent
case (chunked transfer) still streams to completion with no integrity
assertion. Tests updated to assert the throw, matching their 'rejects'
naming.
…the race

The race tests added in PR #125 stubbed aliasRepo.resolve to return a
constant regardless of input, so both pre-fix and post-fix code paths
produced the same final ids and the tests signaled nothing.

Replace with stubs that track concurrent in-flight aliasRepo.resolve
calls. Pre-fix code releases the per-key mutex before the alias lookup,
so two parallel resolves drive maxInFlight to 2. Post-fix code holds
the alias lookup inside the mutex, so maxInFlight is bounded at 1.

Verified by checking out origin/main:src/resolver/PersonResolver.ts and
running the rewritten test — it fails with `Expected: 1 / Received: 2`
on the pre-fix code, then passes once the fix is reapplied.
PR #125's strict /^\d+$/ regex rejected "100, 100" — the value that
Headers.append produces when a real CDN (CloudFront/Akamai/ELB) sends
two Content-Length: 100 lines. RFC 9112 §6.3 explicitly allows the
recipient to combine equal duplicates. Bootstrap downloads are operator-
triggered and not auto-retried, so this surfaced as confusing failures
on previously-working downloads.

Split on ',', trim each, require each to match the integer regex, then
require all parts equal. Mismatched values are still rejected (genuine
protocol error). Single-value behavior is byte-for-byte identical.
…re persisting

processFormS1's SPAC sponsor extraction passed LLM-returned rows to
SpacSponsorLinkRepo with only a confidence-floor check. The
extractor input may concatenate management / beneficial-ownership /
related-party section text (when the dedicated "The Sponsor" heading
is absent), so LLM hallucination of company names from director bios
could write permanent fact-claims keyed to the issuer CIK.

Extend runSection with an optional verifyRow hook (and matching
unverifiedAllDetail / unverifiedPartialDetail) so:
- when every confident row fails verification, the section dead-letters
  as UNVERIFIED_SOURCE_SPAN instead of MODEL_EMPTY / LOW_CONFIDENCE_ALL
- when some rows fail verification, the surviving rows persist normally
  and a "<sectionName>-partial" dead-letter is recorded for triage

Sponsor section opts in with spanAppearsIn — NFKC + curly-quote +
whitespace + case normalization, 3-char minimum.

Bump S-1 extractor 1.0.0 -> 1.1.0; the version-slot machinery will
re-run S-1 against existing filings under stricter verification.
@sroussey sroussey force-pushed the claude/wonderful-hypatia-HGrPZ branch from dc3b343 to e0c526e Compare June 6, 2026 17:55
claude added 2 commits June 6, 2026 18:26
…fail)

The verifyRow callback's `r.source_span` access broke TRow inference in
the SPAC sponsor runSection call: contextual typing of the verifyRow
parameter pinned TRow to the bare `{ confidence: number }` constraint,
which then broke the persist callback's `r.legal_name` / `r.common_name`
/ `r.source_span` access. CI (tsc --noEmit) caught it; local `bun test`
runs no type-check so the rebase landed it.

Derive SpacSponsorRow from `Awaited<ReturnType<typeof extractSpacSponsors>>`
and pass it as the explicit type argument to runSection so the persist /
verifyRow callbacks see the full row shape. No runtime change.
@sroussey sroussey changed the title fix: HIGH-priority — DRSLTR dispatch, SPAC sponsor span verification, resolver test + RFC-9112 Content-Length fix: DRSLTR dispatch, SPAC sponsor span verification, resolver test + RFC-9112 Content-Length Jun 7, 2026
@sroussey sroussey merged commit 8fe6fd0 into main Jun 7, 2026
1 check passed
sroussey added a commit that referenced this pull request Jun 8, 2026
…lization (#129)

* fix(resolver): serialize alias lookup inside per-key mutex for FamilyResolver

The shared FamilyResolver acquired the per-key mutex, ran find-or-create,
released, then ran resolveAlias() OUTSIDE the lock. Two parallel resolve()
calls on the same family name could split if an alias was installed between
the create and the alias-lookup — one returned the alias target, the other
the pre-alias id. This is the same race PR #127 fixed for PersonResolver
and CompanyResolver in commit 8fe6fd0; applying the same fix here so
SponsorFamilyResolver and UnderwriterFamilyResolver benefit.

Adds regression tests in both family-resolver test files that wrap
aliasRepo.resolve with an in-flight counter and assert maxInFlight === 1
under two parallel resolves on the same key (pre-fix code drives it to 2).

* fix(resolver): pin FamilyResolver normalize-case to UPPER; lock in via FamilyResolver.test.ts

The shared normalizeFamilyName introduced in commit 960d707 silently flipped
the family-name fold from UPPER (the pre-refactor SponsorFamilyResolver
convention) to lower. CanonicalSponsorFamilyRepo.findByResolverAndName runs
an exact-match storage query with no case fold, so any existing UPPER
canonical_sponsor_family.normalized_name rows became unreachable, and
operator-installed aliases keyed on those canonical ids would silently
orphan.

Reverts the fold to UPPER and pins it via a new dedicated FamilyResolver.test.ts
that covers UPPER output, case-insensitivity, internal-whitespace collapse,
and empty/whitespace-only input. Also updates three test fixtures
(CanonicalSponsorFamilyRepo, CanonicalUnderwriterFamilyRepo,
underwriterFamily command tests) that hardcoded lowercase normalized_name
literals so they match the live convention.

* fix(resolver): flip stale lowercase test literal + JSDoc to UPPER (#130)

PR #129 pinned FamilyResolver normalization to UPPER but two callers were
missed:

- Form_S_1.storage.underwriters.test.ts: findByResolverAndName lookup still
  used "goldman sachs", returning undefined and crashing the next line's
  family! dereference. Updated to "GOLDMAN SACHS" to match the three other
  test files already flipped in PR #129.
- SponsorFamilyResolver.normalizeSponsorFamilyName JSDoc still described
  lower-casing; the function delegates to normalizeFamilyName which now
  UPPER-cases. Updated the doc to reflect actual behavior and the
  canonical_sponsor_family.normalized_name column it matches against.
  UnderwriterFamilyResolver was checked and has no equivalent stale doc.

Co-authored-by: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants