S-1 SPAC classification + sponsor entities, DRS support, and fetch-layer dead-lettering#124
Merged
Conversation
…VED, FETCH_ERROR, parse rethrow)
…forms as .txt - Form_S_1 and Form_DRS now delegate to parseRegistrationSubmission() - FORM_TO_EXTRACTOR_ID maps DRS and DRS/A to the S-1 extractor - ProcessAccessionDocFormTask fetches registration prospectus forms (S-1/DRS family) as full-submission .txt so the SGML header and DOCUMENT selection work correctly - switch block extended to route DRS/DRS/A through processFormS1 - Tests updated: Form_S_1.test checks header+html shape; dead-letter test uses form D (not S-1) for PRIMARY_DOC_UNRESOLVED path; s1 end-to-end test wraps HTML in DOCUMENT envelope and fetches as .txt
Creates src/storage/classification/ with S1ClassificationSchema, S1ClassificationRepo, and a passing unit test. Registers the token and in-memory/SQLite backends in all three DI files (DefaultDI, TestingDI, setupAllDatabases).
Inserts a filing-level S1Classification row in processFormS1 immediately after the issuer observeCompany call. headerSic and isSpac are declared at function-body scope so subsequent tasks can read them.
…ily' query Adds src/commands/sponsorFamily.ts with spacIssuersByFamilyName (alias-aware union across target + all variant family ids) and registerSponsorFamilyCommands (sec canonical sponsor-family alias|alias-list, sec spac by-family). Wired into src/commands/index.ts after addCanonicalCommands so the canonical subcommand is already registered when the sponsor-family subgroup is attached. https://claude.ai/code/session_01F4rB93F3Ce3Yo31qqDY2Uh
… review - skip blank-named sponsor rows instead of throwing in the resolver and aborting the whole section; dead-letter only when none are usable - sponsor catch uses safe message extraction (non-Error throws can't escape) - composite indexes for canonical_sponsor_family + sponsor_family_membership (match the company tier and the actual (resolver_version, name) lookups) - by-family query uses indexed listByTarget instead of scanning all aliases - parser no-DOCUMENT fallback strips past </SEC-HEADER> - tests: blank-sponsor skip; registration form + null primary_doc -> FETCH_ERROR https://claude.ai/code/session_01F4rB93F3Ce3Yo31qqDY2Uh
Contributor
There was a problem hiding this comment.
Pull request overview
Adds two new capabilities on top of the existing S-1 extractor pipeline: (1) fetch-layer dead-lettering so batch runs can continue past filing-level fetch failures, and (2) registration-prospectus enhancements including SGML <SEC-HEADER> parsing (S-1 + DRS), deterministic SPAC classification, and sponsor-family canonical entities + CLI queries.
Changes:
- Split
ProcessAccessionDocFormTaskfailure domains into primary-doc resolution vs body fetch (dead-letter + swallow) vs parse/store (record + rethrow), and fetch registration-prospectus forms as full-submission.txt. - Add SGML-header parsing + primary
<DOCUMENT>selection shared by S-1 and DRS; map DRS/DRS-A to the S-1 extractor. - Introduce SPAC sponsor-family canonical tier (families, aliases, memberships, issuer links), deterministic SIC-based SPAC classification storage, and CLI commands for aliasing and “SPACs by sponsor family”.
Reviewed changes
Copilot reviewed 46 out of 46 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/task/forms/ProcessAccessionDocFormTask.ts | Fetch-layer dead-lettering + .txt fetch for S-1/DRS |
| src/task/forms/ProcessAccessionDocFormTask.s1.test.ts | Update S-1 task test fixture for .txt submission |
| src/task/forms/ProcessAccessionDocFormTask.deadletter.test.ts | New tests for primary-doc/fetch failure dead-lettering |
| src/storage/versioning/extractorIds.ts | Map DRS/DRS-A to S-1 extractor |
| src/storage/versioning/extractorIds.test.ts | Tests for DRS mapping + DRSLTR exclusion |
| src/storage/versioning/componentRegistry.test.ts | Assert sponsor-family resolver registration |
| src/storage/versioning/bootstrapComponentVersions.test.ts | Seed sponsor-family resolver version 1.0.0 |
| src/storage/classification/S1ClassificationSchema.ts | New table schema for filing-level SPAC classification |
| src/storage/classification/S1ClassificationRepo.ts | Repo wrapper for classification storage |
| src/storage/classification/S1ClassificationRepo.test.ts | Repo save/get test coverage |
| src/storage/canonical/SponsorFamilyMembershipSchema.ts | Schema for sponsor-company ↔ family membership |
| src/storage/canonical/SponsorFamilyMembershipRepo.ts | Repo for membership record/list/delete |
| src/storage/canonical/SponsorFamilyMembershipRepo.test.ts | Membership repo idempotency + list test |
| src/storage/canonical/SpacSponsorLinkSchema.ts | Schema for issuer → sponsor → family per filing |
| src/storage/canonical/SpacSponsorLinkRepo.ts | Repo for link save/clear/list-by-family |
| src/storage/canonical/SpacSponsorLinkRepo.test.ts | Link repo save/clear/list tests |
| src/storage/canonical/CanonicalSponsorFamilySchema.ts | Canonical sponsor-family entity schema |
| src/storage/canonical/CanonicalSponsorFamilyRepo.ts | Repo for family create/find/list/delete |
| src/storage/canonical/CanonicalSponsorFamilyRepo.test.ts | Repo create/find test coverage |
| src/storage/canonical/CanonicalSponsorFamilyAliasRepo.ts | Alias repo for sponsor-family merges + target lookups |
| src/storage/canonical/CanonicalSponsorFamilyAliasRepo.test.ts | Alias resolve + chain rejection test |
| src/storage/canonical/CanonicalAliasSchemas.ts | Add sponsor-family alias schema + token |
| src/sec/html/mock_data/s1/SOURCES.md | Document fixture scope (S-1/DRS, .htm + synthetic .txt) |
| src/sec/html/mock_data/s1/drs_1848507_000119312521066104.txt | Synthetic DRS full-submission fixture |
| src/sec/forms/registration-statements/s1/spacSponsorSchema.ts | Structured-output schema for sponsor extraction |
| src/sec/forms/registration-statements/s1/spacSponsor.e2e.test.ts | End-to-end SPAC sponsor → family → issuer tests |
| src/sec/forms/registration-statements/s1/sectionExtractors.ts | Add sponsor extractor prompt + structured call |
| src/sec/forms/registration-statements/s1/sectionExtractors.test.ts | Unit test for sponsor extractor wiring |
| src/sec/forms/registration-statements/s1/parseSubmission.ts | Parse <SEC-HEADER> + primary <DOCUMENT> selection |
| src/sec/forms/registration-statements/s1/parseSubmission.test.ts | Tests for header parsing + DRS document selection |
| src/sec/forms/registration-statements/s1/drsFixture.test.ts | Fixture-based DRS .txt parsing test |
| src/sec/forms/registration-statements/Form_S_1.ts | S-1 parse now uses full-submission parser |
| src/sec/forms/registration-statements/Form_S_1.test.ts | Update S-1 parse test expectations (header + body) |
| src/sec/forms/registration-statements/Form_S_1.storage.ts | Persist SIC classification; resolve sponsors/families/links |
| src/sec/forms/registration-statements/Form_S_1.storage.test.ts | Update tests for new parsed shape + classification |
| src/sec/forms/registration-statements/Form_DRS.ts | DRS parse uses shared submission parser; doc updates |
| src/resolver/SponsorFamilyResolver.ts | New resolver for sponsor-family IDs + alias following |
| src/resolver/SponsorFamilyResolver.test.ts | Sponsor-family resolution + alias-following tests |
| src/resolver/resolverIds.ts | Add sponsor-family to resolver id registry |
| src/resolver/resolverIds.test.ts | Update resolverIds test |
| src/config/TestingDI.ts | Register new storages/tokens for tests (classification/family/link) |
| src/config/setupAllDatabases.ts | Ensure new tables are setup in DB bootstrap |
| src/config/DefaultDI.ts | Register persistent storages for new schemas |
| src/commands/sponsorFamily.ts | CLI: sponsor-family alias/list + SPAC by-family query |
| src/commands/sponsorFamily.test.ts | Tests for alias-aware by-family query helper |
| src/commands/index.ts | Register sponsor-family command group |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+310
to
+329
| if (parseError === undefined) { | ||
| try { | ||
| await runRepo.recordRun({ | ||
| cik: cik!, | ||
| accession_number: accessionNumber, | ||
| form: form!, | ||
| extractor_id: extractorId, | ||
| extractor_version: extractorVersion, | ||
| slot_at_run: slotAtRun, | ||
| success: true, | ||
| error: null, | ||
| }); | ||
| } catch (recordErr) { | ||
| console.error( | ||
| `Failed to record extractor_runs row for ${cik}/${accessionNumber}@${extractorId}:${extractorVersion}:`, | ||
| recordErr | ||
| ); | ||
| } | ||
| return { success: true }; | ||
| } |
Comment on lines
+19
to
+25
| /** | ||
| * The single source of truth for the sponsor-family natural key. Strips | ||
| * suffixes / collapses whitespace via {@link normalizeCompanyName}, then | ||
| * upper-cases so matching is case-insensitive. Every caller that looks up a | ||
| * family by name (resolver, CLI query, alias commands) MUST use this so keys | ||
| * line up. Returns "" when the name normalizes to nothing. | ||
| */ |
Comment on lines
+81
to
+87
| export const CanonicalSponsorFamilyAliasSchema = Type.Object({ | ||
| alias_canonical_id: Type.String({ maxLength: 36 }), | ||
| target_canonical_id: Type.String({ maxLength: 36 }), | ||
| reason: TypeNullable(Type.String({ maxLength: 1024 })), | ||
| created_at: Type.String(), | ||
| created_by: TypeNullable(Type.String({ maxLength: 128 })), | ||
| }); |
Comment on lines
+18
to
+24
| export const CanonicalSponsorFamilySchema = Type.Object({ | ||
| canonical_sponsor_family_id: Type.String({ maxLength: 36, description: "UUID v4" }), | ||
| resolver_version: Type.String({ maxLength: 32 }), | ||
| display_name: TypeNullable(Type.String({ maxLength: 512 })), | ||
| normalized_name: Type.String({ maxLength: 512 }), | ||
| created_at: Type.String(), | ||
| }); |
Comment on lines
+14
to
+23
| export const S1ClassificationSchema = Type.Object({ | ||
| extractor_id: Type.String({ maxLength: 16 }), | ||
| accession_number: Type.String({ maxLength: 25 }), | ||
| cik: TypeNullable(TypeSecCik()), | ||
| sic: TypeNullable(Type.Integer()), | ||
| sic_description: TypeNullable(Type.String({ maxLength: 256 })), | ||
| is_spac: Type.Boolean(), | ||
| classifier_source: Type.String({ maxLength: 32, description: "sgml-header | sic-unknown | ai" }), | ||
| created_at: Type.String(), | ||
| }); |
- resolve filing-level fetch dead-letters (section_name "") on a subsequent successful run so the retry sweep doesn't reprocess fixed filings after a bump (+ test); clean up the fake model in the s1 task afterEach so multiple tests in the file don't collide - add ISO-8601/UUID/semver field descriptions to S1Classification, CanonicalSponsorFamily, and CanonicalSponsorFamilyAlias schemas for parity with the person/company tables - correct the normalizeSponsorFamilyName doc comment's suffix-stripping wording https://claude.ai/code/session_01F4rB93F3Ce3Yo31qqDY2Uh
…S-1 extractor Foreign private issuers file F-1 (the S-1 equivalent) with the same prospectus structure, so they ride the shared registration-submission parser and the S-1 extraction pipeline: - map F-1, F-1/A, F-1MEF -> S-1 extractor; add to REGISTRATION_PROSPECTUS_FORMS (.txt fetch) and the dispatch switch - Form_F_1 / Form_F_1MEF parse() delegate to parseRegistrationSubmission - synthetic foreign-issuer F-1 .txt fixture (Cayman SPAC, SIC 6770) + tests; the foreign SEC-HEADER still carries ASSIGNED-SIC + CIK, so the parser and SPAC classification work unchanged https://claude.ai/code/session_01F4rB93F3Ce3Yo31qqDY2Uh
sroussey
pushed a commit
that referenced
this pull request
Jun 5, 2026
- resolve filing-level fetch dead-letters (section_name "") on a subsequent successful run so the retry sweep doesn't reprocess fixed filings after a bump (+ test); clean up the fake model in the s1 task afterEach so multiple tests in the file don't collide - add ISO-8601/UUID/semver field descriptions to S1Classification, CanonicalSponsorFamily, and CanonicalSponsorFamilyAlias schemas for parity with the person/company tables - correct the normalizeSponsorFamilyName doc comment's suffix-stripping wording https://claude.ai/code/session_01F4rB93F3Ce3Yo31qqDY2Uh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements two related streams on top of the S-1 extraction foundation (#123).
Stream 1 — Fetch-layer dead-lettering
ProcessAccessionDocFormTasknow splits three failure domains: primary-doc resolution →PRIMARY_DOC_UNRESOLVED, body fetch →FETCH_ERROR(both filing-level, recorded + swallowed so batch runs continue and the entries join the version-gated retry sweep), and parse/store → unchanged record-and-rethrow. Applies to all forms; the fetch is isolated behind an overridablerunFetch()seam so the failure paths are unit-tested without the network.Stream 2 — SPAC classification + sponsor entities (+ DRS)
Form_S_1.parse()/Form_DRS.parse()parse the full-submission.txt<SEC-HEADER>(SIC/CIK/name/date) and select the primary<DOCUMENT>body.DRS/DRS/Aride the sameS-1extractor; registration forms fetch the.txt.DRSLTRis excluded (correspondence).S1ClassificationReporecordsis_spac = (header SIC === 6770)with aclassifier_sourceseam (sgml-header|sic-unknown) for a future AI classifier. The header SIC is immutable/point-in-time, so a de-SPAC'd company still reads as a SPAC from its original filing.CanonicalSponsorFamilytier (ownsponsor-familyresolver kind, bootstrapped at 1.0.0, + alias table).SponsorFamilyMembership(legal-sponsor ↔ family) andSpacSponsorLink(issuer CIK → sponsor → family) yield the durable cross-SPAC identifier — "all SPACs backed by X" — including distinct legal sponsors that share a common name.sec canonical sponsor-family alias|alias-listand an alias-awaresec spac by-family <name>.Notes
sponsor-familyresolver kind is bootstrapped fresh.tscclean.Specs & plans: workglow-dev/prd companion PR.
https://claude.ai/code/session_01F4rB93F3Ce3Yo31qqDY2Uh
Generated by Claude Code