project_no_training_data_for

The constraints (Rich’s verbatim, Sunday 19 April 2026):

“there will be no training data available for inherit, or inheritkit, so we need to build something that has no need for it”

“we must remember that tt and openinherit are ‘cold starts’ and bootstrapped - we cannot assume we will acquire a lot of customers to help train our own data”

Philosophical articulation (later same day, clarifying the above):

“i am an old-fashioned person. i believe that ‘solid’ data is better than ‘taking a chance’. a legal document can have no element of chance or uncertainty. i believe that most laws can be ‘codified’ i.e. definied programmatically. but i do understand the power and potential of llms. ref training data, i just wanted to ensure that there was no presumption in our plan of massive growth, as that would be a precarious foundation for the entire business”

This third quote reframes the first two. The architectural constraint is NOT a permanent ban on training data. The constraint is:

Legal documents demand zero chance in any layer producing succession determinations. Codification (Catala, Alloy, TLA+, Rego) is the correct paradigm for this. LLM-as-oracle at the canonical layer is unacceptable on legal-certainty grounds, not just technical-reliability grounds.
Most laws can be codified. The Catala-maximal position is philosophically endorsed — encode every rule that can be encoded; the residue (genuine interpretive “reasonable provision”) is small.
The business plan must not presume training-data growth. Option F cannot be load-bearing on a TT corpus that may never arrive. Architecture works Day-1 with zero training data.
LLMs are welcome where chance is verified-downstream or cosmetic. OCR (output human-reviewed via Pydantic validation), NL interface (translates Catala’s deterministic answer), draft-assist (Rich reviews), residual principle-based reasoning (judgment-assisted, not machine-checked; narrow, particular-not-aggregate).
Training data — if it arrives — is fine. Synthetic from Catala rules, academic partnerships, opportunistic customer consent after SaaS growth are all acceptable inputs to an R&D track. They are never dependencies of the core integrity layer.

Rich’s structured brainstorm answers (same day), recalibrated by the later philosophical articulation:

Architecture-scope permanent — architecture must not depend on TT training data ever; no Day 1 or Day N presumption of corpus availability (A1=a, recalibrated from “permanent ban” to “permanent not-presumed”).
None of these are architecture-dependencies — customer outputs (LegacyLists records), synthetic data from Catala, OSS legal corpora are NOT counted as architectural substitutes for codification (A2=none). Architecture must work without any of them. BUT if they arrive as R&D inputs, they’re welcome — just never load-bearing.
Off-the-shelf LLMs OK for narrow-adjunct layer (OCR + NL interface + statute-to-Catala drafting + user-facing explanations) — all three vendors (OpenAI, Anthropic, Google Gemini) behind an AI Gateway (B1+B2).
Commercial moat ranking (1 = primary): 1. SaaS product revenue (InheritWills / MyFamilyInherits / LegacyLists ARR); 2. InheritKit code (AGPL-3.0 + dual-commercial, iText playbook); 3. INHERIT standard brand + specification + governance (Apache 2.0, institutionally controlled); 4. Catala rulebase (per jurisdiction per family, proprietary commercial licence).
Acquirer-appeal: some acquirers will demand a custom LLM in the deal (D1=b). TT can pursue custom LLM as R&D optionality for more sophisticated tasks (not Day 1 dependency). Logic-canonical stack is the integrity story; custom LLM is the “future R&D” signal.
Acquirer profile: wealth-management / insurance / legal platform (D2).
AI-vendor commercial routes: MaaS (hosted Catala-logic API — no LLM) + commercial licence for embedded use both survive. DaaS (training-corpus bundle) dead.
PII redaction before external API calls is “highly likely” the right pattern (F2) — needs architecting.
Team for INHERIT v2 + InheritKit: Rich + Claude + Kindle-sourced library. No ML-ops contractor, no corpus-labelling solicitor, no external hires (G1). Budget OK (G2).
Partnership / OSS / academic corpora acceptable if they arrive (H1). Prefer real data over synthetic (H2). But none of these are assumed in baseline architecture.

Stated after Option F was already committed + pushed. Modifies Option F; supersedes any prior memo or spec that assumed a TT-owned training corpus or TT-fine-tuned model.

What this means concretely:

No TT-owned training corpus. There will be no proprietary corpus of annotated succession scenarios across 21 jurisdictions. The 10-50k-scenario corpus planned in Option E / Option F Phase 3 will not be built.
No TT-fine-tuned LLM. Option E’s “TT-LLM-as-canonical-interpreter” and Option F’s Phase 3 QLoRA + SFT fine-tune on Mistral Nemo 12B are both dead. TT does not train or own fine-tuned model weights.
Off-the-shelf LLMs remain acceptable for narrow adjunct tasks. Vendor APIs (OpenAI, Anthropic, Google Gemini) or self-hosted open-weights (Qwen 2.5, Llama 3.1, Gemma 2, Mistral Nemo) as shipped, no fine-tune, are acceptable for OCR + natural-language interface + statute-to-Catala draft assist. The architecture depends on general-purpose LLMs, not on TT-specific model artefacts.

Why (implicit from Rich’s statement + architectural context):

Training data is expensive, slow, and politically/commercially hard to assemble for succession-law scenarios (privacy, consent, jurisdiction coverage).
Rich’s prior directive “logic trumps hallucinations” + “LLMs powerful for OCR but not for making decisions about a will via aggregation of other wills” already scoped LLMs to narrow tasks.
This new constraint eliminates the corpus+fine-tune tooling chain entirely — architecture becomes leaner, and the commercial moat shifts from model/corpus to rulebase/code.

How to apply:

Option F architecture revision. §1.3 (LLM narrow-adjunct layer) describes off-the-shelf vendor APIs, not fine-tuned TT weights. §1.7 Runtime unchanged. §1.8 Infrastructure: Kubernetes-for-inference becomes optional (only if self-hosted open-weights needed for privacy); most deployments call vendor APIs. §6 LOC: remove fine-tuning pipeline (~5,500 LOC) + vLLM orchestration (~2,000 LOC); F drops from ~189k to ~181k. §7 LLM-moat analysis: “three commercial assets” collapses to “one: Catala rulebase + InheritKit code licence”. Phased delivery: Phase 3 fine-tune is removed; former Phase 4 tasks move to Phase 3.
Scorecard rescoring (v1.1 → v1.2). Criterion 15 (LLM-moat potential) drops ~2 points under this constraint for every option that depended on TT-trained weights + corpus: D 5 → 3; E 5 → 3; F 4 → 2. Criterion 14 (Talent pool) rises 1 point where ML-ops role is no longer needed: F 3 → 4. Criterion 16 (Acquirer-appeal) drops 1 point for fewer commercial assets: D 5 → 4; E 5 → 4; F 5 → 4. Net: F 70 → 68; D 66 → 63; E 60 → 57. F remains the highest-scoring option by ~5 points.
AI-vendor commercial thesis is partially preserved. The three-route framing in project_ai_vendor_commercial_model.md:
- “Training-data bundle licence-contingent” — DEAD. No corpus to license to OpenAI/Anthropic/Google/Meta/xAI.
- “Hosted metered API” — SURVIVES in modified form. It’s no longer a TT-fine-tuned-LLM API; it’s a hosted wrapper over Catala + InheritKit logic (deterministic rule evaluation as a service). Still billable per request.
- “Commercial licence for embedded use” — SURVIVES unchanged. Same iText-like AGPL-dual pattern. AI vendors license INHERIT schema + InheritKit code + Catala rulebase for embedding in their products.
Commercial moat shift. Was: corpus (DaaS-compounding) + fine-tuned model + Catala rulebase + InheritKit code. Now: Catala rulebase (per jurisdiction per family, proprietary commercial licence) + InheritKit code (AGPL-dual) + SaaS product revenue (InheritWills, MyFamilyInherits) + TT-hosted wrapper over Catala (hosted metered API). Fewer assets; stronger integrity narrative (logic-canonical with zero LLM dependency at the canonical layer).
Acquirer narrative is actually strengthened. Legal-sector acquirers (wealth management, insurance, legal platforms, big 4) trust deterministic integrity more than LLM sophistication. “Logic-canonical with zero LLM dependency + off-the-shelf LLMs for UX only” is a simpler, more defensible pitch than “TT-trained LLM in the loop”. No hallucination risk to diligence. No training-data privacy/consent/GDPR liability. No ML-ops maintenance burden transferring to the acquirer.
Talent-pool implications. No ML-ops contractor needed. No fine-tuning engineer needed. The team is: Python + Catala + Rust (InheritKit SDK) + partner-solicitors authoring rulebase + Alloy/TLA+ awareness. Substantially narrower hiring profile; substantially lower burn rate.
Privacy + GDPR posture. Actually improved. No INHERIT-owned training corpus means no consent-tracking for corpus contributors, no right-to-be-forgotten-from-corpus liability, no re-training-after-deletion complexity. Off-the-shelf vendor APIs inherit the vendor’s terms; TT’s only data handling is per-customer (RaaS deployments, bounded, contract-specified).
What to revise going forward:
- option-F.md v1.0 → v1.1 (remove fine-tune references, reframe LLM as off-the-shelf, update moat + commercial routes + LOC)
- scorecard v1.1 → v1.2 (F column rescoring; D/E rescoring against same constraint; exec summary + recommendation revision)
- option-F-evidence-digest.md (update §6 Option F shape + §7 scoring expectation)
- InheritKit design spec (consistency check — any “TT LLM” references need reframing)
- AI-vendor commercial memo (if it exists as separate spec) — removing DaaS route
- Roadmap spec if any phase references corpus or fine-tune

Related memories:

feedback_logic_trumps_hallucinations.md — Rich’s prior 18 April 2026 directive. This new constraint extends that principle by removing the TT-training option entirely. Logic remains canonical; LLM narrow-adjunct remains narrow; but now LLM adjunct is off-the-shelf only.
project_ai_vendor_commercial_model.md — DaaS route is dead; embedded-licence + hosted-API routes survive.
project_options_d_and_e_synthesis.md — the “acquirers will expect TT to be using our own LLMs in some capacity” directive is superseded. Acquirers will now expect TT to have a defensible logic-canonical standard, not an LLM moat.

Applies to: INHERIT v2 rebuild, Option F architecture, InheritKit SDK, LegacyLists (as InheritKit consumer), InheritWills.com, MyFamilyInherits.com, every future TT product. When in doubt, architecture must work with off-the-shelf LLMs only and defensible commercial moat is rulebase + code + SaaS revenue.

TT Claude Memory

Explorer

project_no_training_data_for_inherit

Graph View

Backlinks