4.6 KiB
4.6 KiB
Phase 4 — run-to-run diffing and upstream summary
Goal
Compare the current upstream snapshot to the previous one so the pipeline can preserve durable cross-run lineage, reuse prior names where safe, identify only changed or new segments for LLM work, and emit compact machine-readable facts plus a human-readable upstream-change summary.
Script
scripts/diff-runs.js
Inputs
--prev <runs/<id>/manifest.json>--next <runs/<id>/manifest.json>- Phase 3 context artifacts for both runs
- top-level
lineage/durable store
Core model
- A segment is the ingest-level code unit emitted by Phase 1, not an individual variable or binding.
- Run-local segment IDs are not durable identity across runs.
- Phase 4 mints and carries durable lineage identity across runs.
- Every
nextsegment gets a lineage record, including brand-new segments. - Deletions are modeled as retained tombstones, not removed records.
- Split and merge relationships are explicit lineage edges, not implied labels.
Matching strategy
- exact
normalizedHash - exact
shapeHashplus bounded similarity checks on source length - deterministic fuzzy matching using segment kind, string literals, Phase 3 context packets, and export hints
- optional cheap-LLM assistance for ambiguous candidate ranking only; it is advisory and must not become the source of truth for matching, splitting, or diffing
Matching rules
- Reserve
splitandmergedfor high-confidence structural cases. - When candidates remain contested, emit
ambiguousinstead of forcing a winner. - Ambiguous segments still emit full evidence and can contribute passive downstream context.
- Ambiguous segments are excluded from automated lineage-dependent downstream actions until resolved.
- Similarity checks should use both percentage and absolute source-length bounds.
- Matching is against
prevandnext; transitive lineage can be added later if adjacent-run matching proves insufficient.
Output classifications
unchangedmodifiednewdeletedsplitmergedambiguous
Deliverables
runs/<next-id>/reports/changed-segments.jsonruns/<next-id>/relabel-queue.jsonlruns/<next-id>/reports/upstream-summary.jsonruns/<next-id>/reports/upstream-summary.mdruns/<next-id>/reports/ambiguous-matches.json- append-only lineage events written under top-level
lineage/
Artifact shape
changed-segments.jsonshould be canonical per-nextsegment, with explicit match evidence, confidence, lineage IDs, lineage family IDs where relevant, and split/merge links.- Deleted segments should be emitted as retained tombstones carrying retired lineage IDs and last-known segment metadata.
- Ambiguous match artifacts should include ranked candidate matches and the exact evidence used to score them.
- Machine-readable outputs are the source of truth for later phases.
- Human-readable summary prose is derived from machine facts, not the reverse.
Queueing rules
- Unchanged segments are excluded from the relabel queue.
- Deleted segments are excluded from the relabel queue.
- New and modified segments are eligible for the relabel queue.
- Ambiguous segments should go to match review, not directly into automated rename reuse.
Lineage model
- Durable lineage is stored in a top-level
lineage/directory. - The lineage store is append-only.
- Corrections are expressed as superseding events, never mutation or deletion.
- Splits create a lineage family plus child lineage IDs.
- Merges create a new lineage node linked to multiple parents.
- Graph-oriented projections or adjacency indexes may be derived later, but raw lineage events remain canonical.
Summary requirements
- The primary audience for the summary is a human reviewer.
- The summary should summarize changed capabilities or areas when detectable.
- It should note prompt changes, endpoint changes, feature additions, and important constant or behavior shifts when detectable.
- Such claims should be grounded in explicit evidence from Phase 3 context or other recorded Phase 4 signals.
- Weaker claims should be phrased as detected signals, not asserted as certainty.
- Avoid storing giant line-by-line ledgers.
- Provide enough detail to understand what new material is being sent to the LLM.
Verification
- Diff two ingests of the same bundle and expect almost all
unchanged. - Edit a fixture slightly and verify only nearby segments classify as changed.
- Confirm unchanged and deleted segments are excluded from the LLM queue.
- Confirm ambiguous segments produce a review artifact with ranked candidates and evidence.
- Confirm lineage events are appended under
lineage/without deleting prior IDs or tombstones.