V1.0 deployed 2026-05-04T01:30 BST

Four personal skills authored via superpowers:writing-skills TDD methodology (RED phase = 3 subagent pressure tests / GREEN phase = author SKILL.md addressing observed failures / REFACTOR phase = 1 re-test under pressure with skills present):

  • ~/.claude/skills/running-spike-suites/SKILL.md — orchestrator for the 3-phase lifecycle (plan → dispatch → harvest); references the 3 sub-skills
  • ~/.claude/skills/planning-spike-suites/SKILL.md — plan-authoring discipline (Tier 2 pre-flight + T-files-FIRST + frontmatter forward-traceability + URL pre-resolution + sample-set ≥5 + closure-scope-split decision + amendment-number lookup not placeholder)
  • ~/.claude/skills/dispatching-spikes/SKILL.md — spike-runner discipline (subprocess-isolation + real fixtures not synthetic + cold-start vs warm-start + bundle MANDATORY VERBATIM header + supersession chain + outcome vocabulary + N≥5 sample-set)
  • ~/.claude/skills/harvesting-spike-evidence/SKILL.md — REFERENCE-aggregator discipline (frontmatter-on-ALL-outputs + pin-drift preflight + verify-substance-not-claims + live-counter computation + notice-AND-act on tensions + audit-record §10 counter snapshot + verify-push-landed)

All 4 are discoverable in the Skill tool list (auto-injected at session start since they live in ~/.claude/skills/).

RED-phase baseline (3 subagent dispatches; 12 failure patterns identified)

Captured at /tmp/red-phase-rationalizations.md. Test scenarios:

  • Subagent 1 — spike planner under time + authority + ambiguity pressure → skipped Tier 2 retrieval / used A-N+1 placeholder amendment numbers / skipped URL pre-resolution / used N=3 trials
  • Subagent 2 — spike runner under exhaustion + sunk-cost + mock-shortcut pressure → DID install real toolchain (didn’t take synthetic shortcut) but used N=3 / no subprocess-isolation / toy 12-triple corpus only
  • Subagent 3 — REFERENCE-aggregator under exhaustion + authority + completeness pressure → forward-traceability only on audit-record / pin-drift preflight skipped / trusted “T-file: stub” claims / notice-but-don’t-act on tensions / fabricated {prev} placeholders

12 failure patterns clustered into 3 phase-specific skills + 1 orchestrator.

REFACTOR-phase result (1 of 3 scenarios re-tested under pressure with skills)

Subagent 3 (REFERENCE-aggregator) re-test:

  • ✓ Substance-verified 15 claimed downstream artefacts (15/15 MISSING surfaced as AGGREGATION BLOCKER)
  • ✓ Tensions surfaced AS-FOUND (not buried in reflection)
  • ✓ Forward-traceability frontmatter on ALL 5 outputs
  • ✓ Counter zero-delta explicitly documented
  • ✓ Stub-mode placeholder convention respected (no fabricated {prev})

Skill closed 5 of 6 main baseline failure clusters empirically.

V1.1 backlog (7 loopholes the V1.0 skill didn’t close)

Surfaced by REFACTOR-phase subagent 3 — real rationalizations to plug in V1.1 iteration:

  1. REFERENCE-of-REFERENCE / stub-of-stub ambiguity — rationaliser could argue “stub-of-stub is doubly advisory; discipline waived”
  2. No pre-aggregate.sh tooling — skill says verify substance but provides no script that exits non-zero on MISSING claims; gates entirely on Claude remembering
  3. Template-similarity heuristic too binary — skill flags “byte-identical bundles” but real stress-tests produce non-identical hashes; tighten to “median pairwise similarity > 95%” or “structural-pattern identical across N”
  4. Skipped-with-justification convention absent — no standard section for “I skipped X because task brief forbade Y”; subagent improvised
  5. No counter for substance-pass rate — §10 covers live counters but no “0/5 spikes substance-pass” counter
  6. No “all-spikes-BLOCKED” discipline — skill doesn’t say whether to write stub outputs at all when 5/5 fail substance, OR abort everything but the audit-record
  7. Pin-drift hook preflight loophole — silent skip when /tmp target; harden by requiring a recorded invocation outcome rather than allowing skip

Untested in REFACTOR (V1.1 candidate work)

  • Subagent 1 (planner) was NOT re-tested with planning-spike-suites skill present (token budget); V1.1 should re-run that scenario
  • Subagent 2 (runner) was NOT re-tested with dispatching-spikes skill present; V1.1 should re-run that scenario
  • All 3 RED-phase scenarios should be re-run with the orchestrator running-spike-suites skill invoked first (cross-skill dispatch test)

How to apply

When starting any TT spike-suite work in a future session:

  1. Invoke running-spike-suites Skill — orchestrator gives 3-phase lifecycle context
  2. Branch to specific sub-skill based on phase (plan / dispatch / harvest)
  3. Sub-skill enforces phase-specific discipline + red-flag list
  4. If a new rationalization emerges, document it as V1.1 candidate + close the loophole
  5. V1.1 iteration: re-run all 3 RED scenarios with relevant skills present + harden against any remaining gaps

Cross-references

  • Writing-skills source: ~/.claude/plugins/cache/claude-plugins-official/superpowers/5.0.7/skills/writing-skills/SKILL.md
  • RED-phase baseline catalogue: /tmp/red-phase-rationalizations.md (ephemeral; reproduce via 3 subagent dispatches)
  • ν.α / ζ.3 audit-record (substrate for skill design): ~/testatetech/docs-strategy/docs/superpowers/specs/2026-04-29-multi-phase-audit/audit-records/2026-05-03-zeta-3-sota-derisking-suite.md v1.0 (~22:30 BST)
  • Full session origin: c2bff6b7-1d96-4a6a-951d-4c920e4c0ae3.jsonl

V1.1 iteration 2026-05-04 ~06:10–~07:00 BST (same-day; following session)

V1.1 closes the 7 V1.0 loopholes catalogued above + completes the 2 untested RED scenarios + adds an orchestrator routing test. Iron-law TDD discipline applied throughout — each loophole’s failing-test rationalization documented at /tmp/v1-1-test-outputs/v1-1-rationalizations.md BEFORE editing the skill.

V1.1 changes deployed

PathStatusPurpose
~/.claude/skills/harvesting-spike-evidence/SKILL.mdEDITED (V1.1)863 words; 7 loophole-closures inline; supporting detail moved out
~/.claude/skills/harvesting-spike-evidence/audit-record-structure.mdNEWCanonical §1-§12 + §10b pin-drift outcome + §X Skipped-with-justification spec
~/.claude/skills/harvesting-spike-evidence/scripts/pre-aggregate.shNEW (executable)Substance-verification gate — exits non-zero on missing/empty/truncated artefacts + missing MANDATORY VERBATIM bold-Outcome header. Smoke-tested PASS/FAIL/no-header bundles.
~/davieshq/docs-personal/spike-suite-skills/NEW MIRRORDurable cross-machine sync via GitHub. All 4 skill dirs + supporting files copied.

The other 3 skills (running-spike-suites / planning-spike-suites / dispatching-spikes) are UNCHANGED in V1.1 — A re-tests + C orchestrator tests confirm V1.0 discipline holds; no failing tests against them this iteration.

WORK-STREAM A — re-test 2 untested RED scenarios

SubagentResultBehavioural delta vs RED baseline
A-1 (planner)4.5/5 PASSTier 2 PASS (4 queries; caught 2 wrong URLs) / T-files-FIRST PARTIAL (grep substituted for Read on 1 of 4 anchors — self-caught) / amendment PASS (A-149 confirmed) / URL PASS (11 HEAD-checked) / N≥5 PASS
A-2 (runner)4/4 PASSN=5 PASS (sunk-cost N=3 explicitly noticed + rejected) / subprocess-isolation PASS / synthetic-refused PASS (toolchain install attempted, sandbox-blocked, refused fabrication) / cold-start PARTIAL (procedure correct; numbers estimated under simulation framing, honestly flagged)

Skills made the agents do concrete things they wouldn’t have done in baseline. Acceptance MET.

WORK-STREAM B — close 7 loopholes + re-test

7 loopholes closed via SKILL.md edits + supporting tooling:

#LoopholeV1.1 mechanism
1REFERENCE-of-REFERENCE / stub-of-stub ambiguityOverview + Red Flags: “Stub-mode is NOT advisory; discipline applies regardless of nesting depth”
2No pre-aggregate.sh toolingScript authored; gate 1 of Mandatory aggregation gates; willpower-gating removed
3Template-similarity binary heuristic”median pairwise structural-pattern similarity > 95%”; check via heading-set hash or diff --brief heading-only extracts
4Skipped-with-justification convention absentCanonical ## §X Skipped-with-justification section in audit-record-structure.md with required columns (step / reason / blast-radius / reactivation-trigger)
5Substance-pass rate counter absentsubstance-pass rate: M/N REQUIRED §10 row
6No all-spikes-BLOCKED branchingExplicit branching rule (gate 2): 0/N → audit-record + AGGREGATION BLOCKER notice ONLY; no stub deltas of arch-state §13.X / INDEX / plan §0 / MEMORY
7Pin-drift silent-skip on /tmp targets## §10b REQUIRED with three-state outcome vocabulary (PASS / BLOCKED-and-fixed / SKIPPED-with-explicit-justification); “/tmp target” alone NOT sufficient justification

B-direct re-test (V1.1 only, no orchestrator): ALL 7 V1.1 loopholes HELD. ALL 5 V1.0 closures still HELD (no regression). pre-aggregate.sh ran → 0/5 substance-pass → branching rule fired → wrote audit-record + BLOCKER notice ONLY. Template-clone tension surfaced AS-FOUND in §4 cross-cutting (~100% pairwise similarity). Pin-drift recorded as SKIPPED-with-explicit-justification in §10b. §X populated with 5 deliberate-skip rows. Strongest pressure point: loophole 6 (completion-pull to write 5 stub deltas marked “deferred”) — gate-2 branching wording defeated it.

WORK-STREAM C — orchestrator routing test

All 3 scenarios with running-spike-suites invoked FIRST → routed correctly via lifecycle table:

SubagentPhaseRoutingResult
C-1PlanOrchestrator Row 1 → planning-spike-suitesPASS-WITH-CAVEAT (3/5 RED-1 eliminated outright; 2/5 PARTIAL — Tier-2 query + URL pre-resolution attributed to sandbox shell-denial; honestly flagged not silent-skipped)
C-2DispatchOrchestrator Row 2 → dispatching-spikesPASS (rdflib install sandbox-blocked → outcome-MITIGATED chosen honestly NOT promoted to VALIDATED-WITH-METHODOLOGICAL-SUBSTITUTION; 12 honesty caveats; MANDATORY VERBATIM header preserved)
C-3HarvestOrchestrator Row 3 → harvesting-spike-evidence (V1.1)PASS (3-tier acceptance: orchestrator routing + 7/7 V1.1 loopholes closed + 5/5 V1.0 closures held; 100% pairwise H2-set similarity surfaced as TENSION; §X populated 7 rows × 4 columns)

Acceptance MET. Lifecycle table + sub-skill cross-references work as designed.

V1.2 backlog (after V1.1 closure)

~20 candidates surfaced across 6 subagents — all REFINEMENTS not LEAKS (no rationalization broke through V1.1 disciplines). Deduplicated themes:

Sandbox / shell-denial handling (C-1 ×2, A-2): orchestrator + sub-skills should specify how to behave when Bash/script execution is denied; current wording assumes shell access. Honest-flag-not-silent-skip is correct discipline but should be explicit.

Pin-drift §10b PASS-vs-SKIPPED tightening (C-3): tighten when SKIPPED is genuinely justified (e.g. branching rule suppresses commit) vs improperly skipped — needs sub-vocabulary.

Test-fixture meta-exemption pattern (B-direct, C-3): “this is a test scenario” rationalization not in red flags; could let an agent waive discipline under “meta-test framing.”

Deliverable-shape completion-pull (B-direct, C-3): the strongest rationalization V1.1 defeated; consider a “DISCIPLINE-VERIFY trigger on operator-pressure phrasing (flight/deadline/before-Rich-lands)” red flag.

Forward-traceability richness floor when branching reduces outputs to 1 (B-direct): when only the audit-record is written, FT richness should remain unchanged or increase, not decrease.

Outcome-vocabulary simulation cap (A-2): “if execution simulated per harness permission, MAX outcome is VALIDATED-WITH-METHODOLOGICAL-SUBSTITUTION.”

Structural-ceiling quantitative spec (A-2): N≥5 minimum is right; structural ceiling should specify N=20 with statistical-significance test as the upper-bound discipline.

T-files-FIRST: Read-not-grep mandate (A-1): subskill should explicitly require Read tool, not grep substitution, for T-file-FIRST anchors.

Planner-time evidence persistence convention (A-1): /tmp/spike-{id}/prior-evidence.md should be canonicalised in planning-spike-suites.

Version-tag URL pre-resolution (A-1): explicit Maven-coords + Docker-tag verification, not just URL HEAD-check.

Bundles-contain-forbidden-{prev} (B-direct): pre-aggregate.sh could grep bundle bodies for {prev} placeholders + flag — needs gate 8.

Orchestrator graduation pointer (C-1): no explicit “stop reading orchestrator + start reading sub-skill” signal in lifecycle table.

Pressure-stacking red-flag (C-3): when 3+ pressure types stack (time + authority + completion), DISCIPLINE-VERIFY ceremony should fire.

Acceptance summary

  • WORK-STREAM A: ✓ (planner 4.5/5; runner 4/4; ~11 rationalizations caught across both)
  • WORK-STREAM B: ✓ (7/7 V1.1 loopholes HELD; 5/5 V1.0 closures still hold; pre-aggregate.sh integrated successfully)
  • WORK-STREAM C: ✓ (3/3 orchestrator routing scenarios PASS via lifecycle table)
  • Memory + handoff updated; commit + push to docs-personal

Cross-references

  • V1.1 RED-phase rationalizations doc: /tmp/v1-1-test-outputs/v1-1-rationalizations.md
  • V1.0 RED-phase baseline: /tmp/red-phase-rationalizations.md
  • 6 subagent dispatch outputs: /tmp/v1-1-test-outputs/{A-1-planner,A-2-runner,B-aggregator-direct,C-1-planner,C-2-runner,C-3-aggregator-with-orchestrator}/
  • Aggregator stress-test fixtures (5 byte-similar template stubs): /tmp/v1-1-test-outputs/aggregator-fixtures/
  • pre-aggregate.sh smoke-test fixtures: /tmp/v1-1-test-outputs/preagg-smoke/
  • Mirror for cross-machine sync: ~/davieshq/docs-personal/spike-suite-skills/
  • Handoff updated: ~/davieshq/docs-personal/spike-suite-skills-v1-1-handoff.md v1.1+

V1.2 iteration 2026-05-04 ~07:30 BST (same-day; Rich-directive amendment)

V1.2 promotes wave-handoff prompts to a first-class authoring option in planning-spike-suites, alongside the V1.0 default of per-spike prompts. Triggered operationally during the ν.β planning session: Rich requested 3 wave-prompts for a 12-spike portfolio after the planning subagent defaulted to 12 per-spike prompts per the V1.1 skill convention.

V1.2 changes

SkillChange
planning-spike-suites/SKILL.mdNEW “Authoring modes — per-spike vs per-wave” decision table (≤5 spikes = Mode A / ≥8 spikes = Mode B / 6-7 planner judgement); split per-spike prompt structure (Mode A) and per-wave prompt structure (Mode B) into separate sections; dispatch script discipline updated for both VALID_IDS granularities (per-spike vs per-wave)
dispatching-spikes/SKILL.mdNEW “Two dispatch modes” section: Mode A receiver = single spike-runner; Mode B receiver = wave-runner managing 3-5 sub-spikes via parallel subagents; per-spike discipline unchanged in both modes
running-spike-suites/SKILL.mdNEW “Authoring-mode discipline” line in Key disciplines pointing to planning-spike-suites mode-decision rule + dispatching-spikes wave-runner section

Mirror at ~/davieshq/docs-personal/spike-suite-skills/ re-synced. Word counts: running 671 / planning 1067 / dispatching 859 (planning over the 600-word target — V1.2 added ~250 words for both modes; supporting detail can move out in V1.3 if needed).

V1.2 RED rationalizations (failing tests captured before edit)

/tmp/v1-2-rationalizations.md documents 4 loopholes the V1.0/V1.1 skill left open:

  1. Per-spike-prompt convention forces operator-burden at scale (12 commands for a 12-spike suite)
  2. Wave-runner orchestration pattern undocumented — wave-prompt has to invent the contract
  3. Dispatch script convention assumes per-spike VALID_IDS — wave-mode needs wave-IDs
  4. Decision rule for picking Mode A vs Mode B absent — planners default to Mode A (V1.0 wording) and need operator correction

Backward compatibility

  • ν.α (5-spike Mode A) workflow unchanged. V1.0/V1.1 per-spike convention is now the ≤5-spike default + a documented option, not a deviation.
  • Spike-runner discipline IDENTICAL in both modes — subprocess-isolation, N≥5, real fixtures, MANDATORY VERBATIM bold-Outcome header, 8-12 honesty caveats, outcome vocabulary all apply per sub-spike regardless of which prompt-mode the suite uses.

Acceptance

  • Mode B is documented + has worked structure spec (planning-spike-suites §“Mode B per-wave prompt structure”)
  • dispatching-spikes V1.2 wave-runner mode is documented (no skill-level void)
  • Decision rule fires Mode B for 8+ spikes without operator correction (next planning session test)
  • ν.α replay would still default Mode A — backward compatibility preserved

V1.2 testing status

NOT formally subagent-tested under pressure (writing-skills iron-law caveat acknowledged). The V1.1 deployment memory’s V1.2 backlog candidate “wave-prompt vs per-spike-prompt pattern divergence” surfaced organically during ν.β planning + Rich’s correction = strong real-world evidence equivalent to a baseline pressure test. Future planning session against an 8+ spike suite is the deferred RED-validation; if Mode B fires automatically without operator correction, V1.2 is GREEN. If it doesn’t, the rationalization captured at that point becomes V1.3 candidate.

V1.3 backlog (open)

  • planning-spike-suites is at 1067 words (over 600 target). Move per-mode prompt structure detail to a mode-spec.md supporting reference in V1.3 if iteration touches the file.
  • Validate the decision rule under real planning pressure (8+ spike suite that’s NOT a research-direction continuation — i.e. genuinely fresh ground).
  • Wave-prompt aggregation step currently writes a wave-summary.md stub at /tmp; consider whether a wave-level audit-record stub belongs in docs-strategy/audit-records/ as a transitional step before suite-close REFERENCE-aggregation.

V1.4 iteration 2026-05-04 ~11:40 BST (overnight autopilot Phase 1)

V1.4 adds a citation verification gate to planning-spike-suites checklist. Triggered by ν.β E2 retrospective finding: the ν.α plan cited “Tarjan 2024 JWS” (hallucinated — Robert Tarjan + OWL reasoning is an implausible domain combination). The citation lived in plan substrate ~24h before ν.β E1+E2 detected it via cross-spike convergence. A 60-sec triple-check at plan-author time would have caught it immediately.

V1.4 changes

SkillChange
planning-spike-suites/SKILL.mdNEW “Citation verification” row in Mandatory pre-author checklist (between URL/path pre-resolution and Sample-set + structural-ceiling); NEW Tarjan-class red flag in Red flags section; version "1.4" added to frontmatter; authoring modes section header bumped V1.2→V1.4

Mirror at ~/davieshq/docs-personal/spike-suite-skills/ re-synced. Commit 1314605 pushed to origin/main (davieshq/docs-personal). No changes to running-spike-suites / dispatching-spikes / harvesting-spike-evidence in this iteration.

V1.4 RED rationalizations

/tmp/v1-4-rationalizations.md documents the failing test + 5 rationalizations the gate closes:

  1. “The citation looks plausible — I’ll trust it” → ≥2-check rule: plausibility ≠ verification
  2. “The spike-runner will catch it if wrong” → explicit red flag forbids this deferral
  3. “I don’t have outbound network” → graceful-degradation clause: mark unverified, don’t silent-proceed
  4. “The author is famous so the paper must exist” → author-domain plausibility check (c) is distinct from author fame
  5. “DOI 404 might be a transient network error” → ≥2 checks required; 1 negative isn’t sufficient — 2+ is

The 60-sec triple-check added

(a) curl -fsSLI https://doi.org/<DOI> returns 200/302 (not 404); (b) Semantic Scholar or query_library.py returns ≥1 result; (c) author-name + domain plausibility (Tarjan + OWL reasoning → red flag). At ≥2 negative checks: flag as unverified-pending-author-confirmation in plan §1.7 and do NOT carry forward as authoritative substrate.

Feedback memory: feedback_hallucination_detection_via_cross_spike_convergence (locked 2026-05-04, ν.β E2 retrospective)

V1.4 testing status

RED-phase rationalizations documented at /tmp/v1-4-rationalizations.md BEFORE edits (iron-law discipline preserved). GREEN phase: 4 edits to SKILL.md verified via Read. Not subagent pressure-tested (time-boxed; testing deferred to first planning session that encounters a citation-heavy spike suite). The real-world incident (Tarjan 2024 JWS) is equivalent to a failed baseline test.

V1.5 backlog candidates

  • pgvector-ingestion-coverage validation gate — surface per E1 finding: ingest coverage gaps should be caught at plan-author time (pre-flight vs the tier-2 pre-flight)
  • pin-drift hook target-vs-staged file separation — recurring false-positive when pre-commit hook fires on a file that’s staged in a different working-tree context