Tier 2 pgvector library index VALIDATED 2026-05-03T09:10 BST. Pre-spike infrastructure for the 22-spike Q&A-formulation suite. Built a pgvector-backed semantic retrieval substrate over 4 corpus kinds:

source_typefileschunks
v6.6-extension42603
v6.6-reference341446
eps-iota-t-file12605
cumulative-state6271
TOTAL942925

Pipeline: chunked at 500-char windows with 50-char overlap → fastembed BAAI/bge-small-en-v1.5 (384-dim; cached 2026-05-03 install pre-flight) → batch-inserted into library_chunks table in inherit_spikes postgres db with HNSW index (m=16, ef_construction=64 per S2.9 v1.1 lock).

Timing: 134s end-to-end (model load 0.23s; embed 127.54s @ 22.9 chunks/sec; insert 6.11s; 0 errors). Retrieval latency: 200-264ms per query including fastembed cold-load; sub-50ms once loaded.

7/7 smoke queries returned canonically-relevant top results (sim 0.713-0.848): “alignment axiom strength” → S2.5+S5+S7+S4 (0.745); “owlready2 emitter LinkML” → cco-bfo-cumulative-state #23 (0.840); “Catala stdlib resolution clerk” → catala-formal-verification #7 (0.723); “Will follow-through” → honestly weak (no prior evidence; correct behaviour) (0.713); “RNRB taper” → S8 #22 (0.737); “graph-RAG retrieval pgvector HNSW” → S2.9 #30 (0.799); “Cedar symcc verification policy” → S2.10 #30 (0.848). Even surfaced the CVC5 env var note in result #5 of query 7.

Artefacts (durable):

  • ~/tools/inherit-spike-corpora/tier-2-pgvector-library-index/schema.sql — DDL
  • ~/tools/inherit-spike-corpora/tier-2-pgvector-library-index/index_library.py — indexer (idempotent on UNIQUE(source_path, chunk_idx))
  • ~/tools/inherit-spike-corpora/tier-2-pgvector-library-index/query_library.py — query CLI for spike-runners
  • ~/tools/inherit-spike-corpora/tier-2-pgvector-library-index/index_library.log — pipeline run log
  • ~/tools/inherit-spike-corpora/tier-2-pgvector-library-index/smoke_queries.log — 206-line evidence file
  • T-file: ~/off-github/library/projects/inherit/T-tier-2-pgvector-library-index-2026-05-03.md v1.0

Spike-runner usage (front-load this in subagent prompts for Spikes 1-22):

~/tools/inherit-spike-env/bin/python \
  ~/tools/inherit-spike-corpora/tier-2-pgvector-library-index/query_library.py \
  "<question>" --k 10 --type <optional-source-type-filter>

Re-index trigger: any v6.6 SEED / 12 ε.ι T-files / 6 cumulative-state docs update. Re-run index_library.py (idempotent).

Future enhancements (not blocking): structure-aware chunking (LinkML/markdown-section/Catala-scope boundaries instead of 500-char windows); BGE-base-en 768-dim model upgrade if recall degrades; per-jurisdiction filter support beyond source_type. None required for the 22-spike suite as-locked.

Cross-references:

  • arch-state §13 NEW row added 2026-05-03 (post-Spike-1 lock-time will reference)
  • Plan v1.4 → v1.5 with §0 Tier 2 row marked complete
  • This is NOT a derisking spike per the 22-spike plan §0 task list — it’s pre-spike enabling infrastructure surfaced via /review-plan call. Treated with same logging contract as a spike for consistency.