Skip to content

S-1 SPAC classification + sponsor entities, DRS support, and fetch-layer dead-lettering#124

Merged
sroussey merged 22 commits into
mainfrom
claude/relaxed-shannon-3cdWq
Jun 5, 2026
Merged

S-1 SPAC classification + sponsor entities, DRS support, and fetch-layer dead-lettering#124
sroussey merged 22 commits into
mainfrom
claude/relaxed-shannon-3cdWq

Conversation

@sroussey

@sroussey sroussey commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Implements two related streams on top of the S-1 extraction foundation (#123).

Stream 1 — Fetch-layer dead-lettering

ProcessAccessionDocFormTask now splits three failure domains: primary-doc resolution → PRIMARY_DOC_UNRESOLVED, body fetch → FETCH_ERROR (both filing-level, recorded + swallowed so batch runs continue and the entries join the version-gated retry sweep), and parse/store → unchanged record-and-rethrow. Applies to all forms; the fetch is isolated behind an overridable runFetch() seam so the failure paths are unit-tested without the network.

Stream 2 — SPAC classification + sponsor entities (+ DRS)

  • Shared registration parserForm_S_1.parse()/Form_DRS.parse() parse the full-submission .txt <SEC-HEADER> (SIC/CIK/name/date) and select the primary <DOCUMENT> body. DRS/DRS/A ride the same S-1 extractor; registration forms fetch the .txt. DRSLTR is excluded (correspondence).
  • Deterministic SPAC classificationS1ClassificationRepo records is_spac = (header SIC === 6770) with a classifier_source seam (sgml-header | sic-unknown) for a future AI classifier. The header SIC is immutable/point-in-time, so a de-SPAC'd company still reads as a SPAC from its original filing.
  • Two-tier sponsors — the legal sponsor rides the existing company tier (provenance-tagged); the common name resolves to a new first-class CanonicalSponsorFamily tier (own sponsor-family resolver kind, bootstrapped at 1.0.0, + alias table). SponsorFamilyMembership (legal-sponsor ↔ family) and SpacSponsorLink (issuer CIK → sponsor → family) yield the durable cross-SPAC identifier — "all SPACs backed by X" — including distinct legal sponsors that share a common name.
  • CLIsec canonical sponsor-family alias|alias-list and an alias-aware sec spac by-family <name>.

Notes

  • Blank-DB rollout: no extractor version bump; the sponsor-family resolver kind is bootstrapped fresh.
  • TDD throughout; every new repo registered in DefaultDI + TestingDI + setupAllDatabases. 1,014 tests pass, tsc clean.
  • Includes fixes from a deep review pass (degenerate-sponsor-row handling, safe error extraction, composite indexes, indexed alias lookup, parser hardening).

Specs & plans: workglow-dev/prd companion PR.

https://claude.ai/code/session_01F4rB93F3Ce3Yo31qqDY2Uh


Generated by Claude Code

claude added 20 commits June 4, 2026 19:03
…forms as .txt

- Form_S_1 and Form_DRS now delegate to parseRegistrationSubmission()
- FORM_TO_EXTRACTOR_ID maps DRS and DRS/A to the S-1 extractor
- ProcessAccessionDocFormTask fetches registration prospectus forms (S-1/DRS family) as full-submission .txt so the SGML header and DOCUMENT selection work correctly
- switch block extended to route DRS/DRS/A through processFormS1
- Tests updated: Form_S_1.test checks header+html shape; dead-letter test uses form D (not S-1) for PRIMARY_DOC_UNRESOLVED path; s1 end-to-end test wraps HTML in DOCUMENT envelope and fetches as .txt
Creates src/storage/classification/ with S1ClassificationSchema, S1ClassificationRepo,
and a passing unit test. Registers the token and in-memory/SQLite backends in all
three DI files (DefaultDI, TestingDI, setupAllDatabases).
Inserts a filing-level S1Classification row in processFormS1 immediately
after the issuer observeCompany call. headerSic and isSpac are declared at
function-body scope so subsequent tasks can read them.
…ily' query

Adds src/commands/sponsorFamily.ts with spacIssuersByFamilyName (alias-aware
union across target + all variant family ids) and registerSponsorFamilyCommands
(sec canonical sponsor-family alias|alias-list, sec spac by-family). Wired into
src/commands/index.ts after addCanonicalCommands so the canonical subcommand is
already registered when the sponsor-family subgroup is attached.

https://claude.ai/code/session_01F4rB93F3Ce3Yo31qqDY2Uh
… review

- skip blank-named sponsor rows instead of throwing in the resolver and
  aborting the whole section; dead-letter only when none are usable
- sponsor catch uses safe message extraction (non-Error throws can't escape)
- composite indexes for canonical_sponsor_family + sponsor_family_membership
  (match the company tier and the actual (resolver_version, name) lookups)
- by-family query uses indexed listByTarget instead of scanning all aliases
- parser no-DOCUMENT fallback strips past </SEC-HEADER>
- tests: blank-sponsor skip; registration form + null primary_doc -> FETCH_ERROR

https://claude.ai/code/session_01F4rB93F3Ce3Yo31qqDY2Uh

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds two new capabilities on top of the existing S-1 extractor pipeline: (1) fetch-layer dead-lettering so batch runs can continue past filing-level fetch failures, and (2) registration-prospectus enhancements including SGML <SEC-HEADER> parsing (S-1 + DRS), deterministic SPAC classification, and sponsor-family canonical entities + CLI queries.

Changes:

  • Split ProcessAccessionDocFormTask failure domains into primary-doc resolution vs body fetch (dead-letter + swallow) vs parse/store (record + rethrow), and fetch registration-prospectus forms as full-submission .txt.
  • Add SGML-header parsing + primary <DOCUMENT> selection shared by S-1 and DRS; map DRS/DRS-A to the S-1 extractor.
  • Introduce SPAC sponsor-family canonical tier (families, aliases, memberships, issuer links), deterministic SIC-based SPAC classification storage, and CLI commands for aliasing and “SPACs by sponsor family”.

Reviewed changes

Copilot reviewed 46 out of 46 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/task/forms/ProcessAccessionDocFormTask.ts Fetch-layer dead-lettering + .txt fetch for S-1/DRS
src/task/forms/ProcessAccessionDocFormTask.s1.test.ts Update S-1 task test fixture for .txt submission
src/task/forms/ProcessAccessionDocFormTask.deadletter.test.ts New tests for primary-doc/fetch failure dead-lettering
src/storage/versioning/extractorIds.ts Map DRS/DRS-A to S-1 extractor
src/storage/versioning/extractorIds.test.ts Tests for DRS mapping + DRSLTR exclusion
src/storage/versioning/componentRegistry.test.ts Assert sponsor-family resolver registration
src/storage/versioning/bootstrapComponentVersions.test.ts Seed sponsor-family resolver version 1.0.0
src/storage/classification/S1ClassificationSchema.ts New table schema for filing-level SPAC classification
src/storage/classification/S1ClassificationRepo.ts Repo wrapper for classification storage
src/storage/classification/S1ClassificationRepo.test.ts Repo save/get test coverage
src/storage/canonical/SponsorFamilyMembershipSchema.ts Schema for sponsor-company ↔ family membership
src/storage/canonical/SponsorFamilyMembershipRepo.ts Repo for membership record/list/delete
src/storage/canonical/SponsorFamilyMembershipRepo.test.ts Membership repo idempotency + list test
src/storage/canonical/SpacSponsorLinkSchema.ts Schema for issuer → sponsor → family per filing
src/storage/canonical/SpacSponsorLinkRepo.ts Repo for link save/clear/list-by-family
src/storage/canonical/SpacSponsorLinkRepo.test.ts Link repo save/clear/list tests
src/storage/canonical/CanonicalSponsorFamilySchema.ts Canonical sponsor-family entity schema
src/storage/canonical/CanonicalSponsorFamilyRepo.ts Repo for family create/find/list/delete
src/storage/canonical/CanonicalSponsorFamilyRepo.test.ts Repo create/find test coverage
src/storage/canonical/CanonicalSponsorFamilyAliasRepo.ts Alias repo for sponsor-family merges + target lookups
src/storage/canonical/CanonicalSponsorFamilyAliasRepo.test.ts Alias resolve + chain rejection test
src/storage/canonical/CanonicalAliasSchemas.ts Add sponsor-family alias schema + token
src/sec/html/mock_data/s1/SOURCES.md Document fixture scope (S-1/DRS, .htm + synthetic .txt)
src/sec/html/mock_data/s1/drs_1848507_000119312521066104.txt Synthetic DRS full-submission fixture
src/sec/forms/registration-statements/s1/spacSponsorSchema.ts Structured-output schema for sponsor extraction
src/sec/forms/registration-statements/s1/spacSponsor.e2e.test.ts End-to-end SPAC sponsor → family → issuer tests
src/sec/forms/registration-statements/s1/sectionExtractors.ts Add sponsor extractor prompt + structured call
src/sec/forms/registration-statements/s1/sectionExtractors.test.ts Unit test for sponsor extractor wiring
src/sec/forms/registration-statements/s1/parseSubmission.ts Parse <SEC-HEADER> + primary <DOCUMENT> selection
src/sec/forms/registration-statements/s1/parseSubmission.test.ts Tests for header parsing + DRS document selection
src/sec/forms/registration-statements/s1/drsFixture.test.ts Fixture-based DRS .txt parsing test
src/sec/forms/registration-statements/Form_S_1.ts S-1 parse now uses full-submission parser
src/sec/forms/registration-statements/Form_S_1.test.ts Update S-1 parse test expectations (header + body)
src/sec/forms/registration-statements/Form_S_1.storage.ts Persist SIC classification; resolve sponsors/families/links
src/sec/forms/registration-statements/Form_S_1.storage.test.ts Update tests for new parsed shape + classification
src/sec/forms/registration-statements/Form_DRS.ts DRS parse uses shared submission parser; doc updates
src/resolver/SponsorFamilyResolver.ts New resolver for sponsor-family IDs + alias following
src/resolver/SponsorFamilyResolver.test.ts Sponsor-family resolution + alias-following tests
src/resolver/resolverIds.ts Add sponsor-family to resolver id registry
src/resolver/resolverIds.test.ts Update resolverIds test
src/config/TestingDI.ts Register new storages/tokens for tests (classification/family/link)
src/config/setupAllDatabases.ts Ensure new tables are setup in DB bootstrap
src/config/DefaultDI.ts Register persistent storages for new schemas
src/commands/sponsorFamily.ts CLI: sponsor-family alias/list + SPAC by-family query
src/commands/sponsorFamily.test.ts Tests for alias-aware by-family query helper
src/commands/index.ts Register sponsor-family command group

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +310 to +329
if (parseError === undefined) {
try {
await runRepo.recordRun({
cik: cik!,
accession_number: accessionNumber,
form: form!,
extractor_id: extractorId,
extractor_version: extractorVersion,
slot_at_run: slotAtRun,
success: true,
error: null,
});
} catch (recordErr) {
console.error(
`Failed to record extractor_runs row for ${cik}/${accessionNumber}@${extractorId}:${extractorVersion}:`,
recordErr
);
}
return { success: true };
}
Comment on lines +19 to +25
/**
* The single source of truth for the sponsor-family natural key. Strips
* suffixes / collapses whitespace via {@link normalizeCompanyName}, then
* upper-cases so matching is case-insensitive. Every caller that looks up a
* family by name (resolver, CLI query, alias commands) MUST use this so keys
* line up. Returns "" when the name normalizes to nothing.
*/
Comment on lines +81 to +87
export const CanonicalSponsorFamilyAliasSchema = Type.Object({
alias_canonical_id: Type.String({ maxLength: 36 }),
target_canonical_id: Type.String({ maxLength: 36 }),
reason: TypeNullable(Type.String({ maxLength: 1024 })),
created_at: Type.String(),
created_by: TypeNullable(Type.String({ maxLength: 128 })),
});
Comment on lines +18 to +24
export const CanonicalSponsorFamilySchema = Type.Object({
canonical_sponsor_family_id: Type.String({ maxLength: 36, description: "UUID v4" }),
resolver_version: Type.String({ maxLength: 32 }),
display_name: TypeNullable(Type.String({ maxLength: 512 })),
normalized_name: Type.String({ maxLength: 512 }),
created_at: Type.String(),
});
Comment on lines +14 to +23
export const S1ClassificationSchema = Type.Object({
extractor_id: Type.String({ maxLength: 16 }),
accession_number: Type.String({ maxLength: 25 }),
cik: TypeNullable(TypeSecCik()),
sic: TypeNullable(Type.Integer()),
sic_description: TypeNullable(Type.String({ maxLength: 256 })),
is_spac: Type.Boolean(),
classifier_source: Type.String({ maxLength: 32, description: "sgml-header | sic-unknown | ai" }),
created_at: Type.String(),
});
- resolve filing-level fetch dead-letters (section_name "") on a subsequent
  successful run so the retry sweep doesn't reprocess fixed filings after a bump
  (+ test); clean up the fake model in the s1 task afterEach so multiple tests
  in the file don't collide
- add ISO-8601/UUID/semver field descriptions to S1Classification,
  CanonicalSponsorFamily, and CanonicalSponsorFamilyAlias schemas for parity
  with the person/company tables
- correct the normalizeSponsorFamilyName doc comment's suffix-stripping wording

https://claude.ai/code/session_01F4rB93F3Ce3Yo31qqDY2Uh
…S-1 extractor

Foreign private issuers file F-1 (the S-1 equivalent) with the same prospectus
structure, so they ride the shared registration-submission parser and the S-1
extraction pipeline:
- map F-1, F-1/A, F-1MEF -> S-1 extractor; add to REGISTRATION_PROSPECTUS_FORMS
  (.txt fetch) and the dispatch switch
- Form_F_1 / Form_F_1MEF parse() delegate to parseRegistrationSubmission
- synthetic foreign-issuer F-1 .txt fixture (Cayman SPAC, SIC 6770) + tests;
  the foreign SEC-HEADER still carries ASSIGNED-SIC + CIK, so the parser and SPAC
  classification work unchanged

https://claude.ai/code/session_01F4rB93F3Ce3Yo31qqDY2Uh
@sroussey sroussey merged commit da16031 into main Jun 5, 2026
1 check passed
sroussey pushed a commit that referenced this pull request Jun 5, 2026
- resolve filing-level fetch dead-letters (section_name "") on a subsequent
  successful run so the retry sweep doesn't reprocess fixed filings after a bump
  (+ test); clean up the fake model in the s1 task afterEach so multiple tests
  in the file don't collide
- add ISO-8601/UUID/semver field descriptions to S1Classification,
  CanonicalSponsorFamily, and CanonicalSponsorFamilyAlias schemas for parity
  with the person/company tables
- correct the normalizeSponsorFamilyName doc comment's suffix-stripping wording

https://claude.ai/code/session_01F4rB93F3Ce3Yo31qqDY2Uh
@sroussey sroussey deleted the claude/relaxed-shannon-3cdWq branch June 5, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants