Tier 2 pgvector library index VALIDATED 2026-05-03T09:10 BST. Pre-spike infrastructure for the 22-spike Q&A-formulation suite. Built a pgvector-backed semantic retrieval substrate over 4 corpus kinds:
| source_type | files | chunks |
|---|---|---|
| v6.6-extension | 42 | 603 |
| v6.6-reference | 34 | 1446 |
| eps-iota-t-file | 12 | 605 |
| cumulative-state | 6 | 271 |
| TOTAL | 94 | 2925 |
Pipeline: chunked at 500-char windows with 50-char overlap → fastembed BAAI/bge-small-en-v1.5 (384-dim; cached 2026-05-03 install pre-flight) → batch-inserted into library_chunks table in inherit_spikes postgres db with HNSW index (m=16, ef_construction=64 per S2.9 v1.1 lock).
Timing: 134s end-to-end (model load 0.23s; embed 127.54s @ 22.9 chunks/sec; insert 6.11s; 0 errors). Retrieval latency: 200-264ms per query including fastembed cold-load; sub-50ms once loaded.
7/7 smoke queries returned canonically-relevant top results (sim 0.713-0.848): “alignment axiom strength” → S2.5+S5+S7+S4 (0.745); “owlready2 emitter LinkML” → cco-bfo-cumulative-state #23 (0.840); “Catala stdlib resolution clerk” → catala-formal-verification #7 (0.723); “Will follow-through” → honestly weak (no prior evidence; correct behaviour) (0.713); “RNRB taper” → S8 #22 (0.737); “graph-RAG retrieval pgvector HNSW” → S2.9 #30 (0.799); “Cedar symcc verification policy” → S2.10 #30 (0.848). Even surfaced the CVC5 env var note in result #5 of query 7.
Artefacts (durable):
~/tools/inherit-spike-corpora/tier-2-pgvector-library-index/schema.sql— DDL~/tools/inherit-spike-corpora/tier-2-pgvector-library-index/index_library.py— indexer (idempotent on UNIQUE(source_path, chunk_idx))~/tools/inherit-spike-corpora/tier-2-pgvector-library-index/query_library.py— query CLI for spike-runners~/tools/inherit-spike-corpora/tier-2-pgvector-library-index/index_library.log— pipeline run log~/tools/inherit-spike-corpora/tier-2-pgvector-library-index/smoke_queries.log— 206-line evidence file- T-file:
~/off-github/library/projects/inherit/T-tier-2-pgvector-library-index-2026-05-03.mdv1.0
Spike-runner usage (front-load this in subagent prompts for Spikes 1-22):
~/tools/inherit-spike-env/bin/python \
~/tools/inherit-spike-corpora/tier-2-pgvector-library-index/query_library.py \
"<question>" --k 10 --type <optional-source-type-filter>Re-index trigger: any v6.6 SEED / 12 ε.ι T-files / 6 cumulative-state docs update. Re-run index_library.py (idempotent).
Future enhancements (not blocking): structure-aware chunking (LinkML/markdown-section/Catala-scope boundaries instead of 500-char windows); BGE-base-en 768-dim model upgrade if recall degrades; per-jurisdiction filter support beyond source_type. None required for the 22-spike suite as-locked.
Cross-references:
- arch-state §13 NEW row added 2026-05-03 (post-Spike-1 lock-time will reference)
- Plan v1.4 → v1.5 with §0 Tier 2 row marked complete
- This is NOT a derisking spike per the 22-spike plan §0 task list — it’s pre-spike enabling infrastructure surfaced via /review-plan call. Treated with same logging contract as a spike for consistency.