Compare commits

..

20 Commits

Author SHA1 Message Date
ada cd0fdadca1 only do single items at a time 2026-05-26 22:44:53 -06:00
ada a2d191d7a7 rename identify packages to new workflow name and only do single items at a time 2026-05-26 22:44:43 -06:00
ada 8fca79e968 doing individual items not entire groups 2026-05-26 19:49:10 -06:00
ada 2c4df3998a rules are list of options not strings 2026-05-26 19:23:19 -06:00
ada 121336a190 initial blueprint for dependency recovery design 2026-05-26 19:13:16 -06:00
ada 281c2a8a94 edits from review skill, merged policies, moved brand constructors 2026-05-25 20:15:26 -06:00
ada 57a691236e edits to seam review skill to use naming docs 2026-05-25 20:04:02 -06:00
ada 858bf67043 renaming variable and a little refactoring 2026-05-25 20:03:36 -06:00
ada 558f2a0ea9 adding seam review skill 2026-05-25 19:40:33 -06:00
ada 5ec088c250 finished refactoring the directory layout, no longer stubs 2026-05-25 02:36:08 -06:00
ada 4f378c3825 refactor to match the repo directory layout 2026-05-25 02:21:26 -06:00
ada 7cc819a589 refactor of ingest snapshot workflow 2026-05-25 01:49:37 -06:00
ada 11e0de528a telling agent to not skip refactor 2026-05-25 01:46:33 -06:00
ada fe93d2c8b4 assembly of ingest pipeline 2026-05-25 01:37:56 -06:00
ada d15e09504a design security scan and fixes, basically all trusted vs tainted input 2026-05-25 01:28:43 -06:00
ada 44d9d2065c ingest blueprint phase 2026-05-25 01:19:10 -06:00
ada af3e18758b discovery and core sketch for ingest snapshot 2026-05-25 01:06:43 -06:00
ada ea73f4814f recovery pipeline feature 2026-05-25 01:06:22 -06:00
ada ebeb802f88 initial decomposition 2026-05-25 01:06:00 -06:00
ada d1c0ff6332 added decompilation phase docs 2026-05-24 23:54:48 -06:00
64 changed files with 4456 additions and 2 deletions
+2
View File
@@ -26,6 +26,8 @@ F# is the primary design language and contract artifact for this phase. Assembly
Use TDD during assembly, but only after the design is frozen. The contract from the blueprint determines the test surface.
Follow the skill exactly; do not skip RED→GREEN→REFACTOR, and report each cycle
**Constraints:**
- Read `design/workflows/<bounded-context-slug>/shared-model.fs` when present and `design/workflows/<bounded-context-slug>/<workflow-slug>/04-blueprint.fs` as input.
+163
View File
@@ -0,0 +1,163 @@
---
name: tdfddd-seam-review
description: "Evaluates how well a bounded context or seam follows TDFDDD bounded-seam review principles. Use when reviewing a bounded context, workflow seam, policy seam, or public API for intent-first naming, trust boundaries, reviewability, and seam quality."
---
# TDFDDD Seam Review
## Description
Evaluates whether a bounded context exposes strong bounded seams that are easy to review with high confidence.
Use this skill to review an existing bounded context, workflow seam, policy seam, or public API and produce a report about seam quality without changing behavior.
## When to Use This Skill
Activate when the user:
- asks to review a bounded context for clarity or seam quality
- wants to know why a context is hard to review
- wants an evaluation of workflow seams, policy seams, or public APIs
- suspects naming drift, leakage, or tangled responsibilities
- wants guidance before a bounded-seam refactor
## Core Function: The Seam Review Protocol
**Goal:** explain whether the code presents a reviewable bounded seam, why it is or is not working, and what the smallest high-value improvements would be.
**Required input:**
- target bounded context, files, or seam
- optional concern such as naming, trust boundaries, public API sprawl, or review load
If scope is unclear, infer a provisional seam under review and state it explicitly.
## Review Headspace
Treat names as review targets, not as automatically correct just because the initial model exists.
A seam is strong when a reviewer can understand caller intent, authority, invariants, and outcomes without reconstructing internals.
## Instructions
1. Read the relevant code before making any claim.
2. Identify the top bounded seam first.
- usually the context public API or the main workflow entrypoint
- state what command goes in and what event or failure comes out
3. Identify the inner review seams.
- pure policies
- state transitions or model operations
- trust boundaries where tainted input becomes trusted
4. Review the seam using the general bounded-seam questions:
- What caller intent does this seam grant?
- What invariant becomes true after crossing it?
- What tainted input becomes trusted here?
- What can still fail, and how is failure represented?
- What decisions are pure policy versus orchestration versus I/O?
- What internal details are leaking through the public API?
- If this seam were renamed by user goal instead of implementation shape, what would it be called?
5. Apply seam-type-specific questions.
### Workflow seam questions
- Given command X, under what rules does this workflow emit event Y or failure Z?
- Is the workflow mostly assembling decisions and capabilities, or is it hiding business rules inside orchestration?
- Does the workflow expose the business outcome more clearly than the implementation steps?
### Policy seam questions
- What exact business or security decision is being made?
- What inputs are sufficient for that decision?
- Are the reasons for approval or rejection explicit enough to review?
- Is the policy pure and deterministic?
### Public API questions
- Does the public surface expose one intent-first entrypoint or force the reviewer to reconstruct the context from many exports?
- Are internals importing back through the public barrel, making the seam circular or blurry?
- Does the API hide mechanics, or does it leak low-level helpers, transitional state names, or assembly details?
## What to Look For
### Strong seam signals
- intent-first workflow or context entrypoint
- explicit event and failure shapes
- pure policy decisions with explicit reasons
- visible trust boundaries
- narrow public API with clear authority
- internal modules importing direct local dependencies instead of re-entering `index.ts`
### Weak seam signals
- state or type names that describe pipeline progress rather than reviewer intent
- barrels that re-export nearly everything
- internal files importing from the public API of their own context
- mixed concerns inside one module such as parsing, policy, model math, and artifact derivation
- service or workflow code hiding business rules in control flow
- reviewer needing broad context just to answer one contract question
## Output Format
Return a compact report with these sections:
```markdown
## Seam Under Review
- Scope: ...
- Top bounded seam: ...
- Inner review seams: ...
## What Makes This Easy or Hard to Review
- Strengths: ...
- Friction points: ...
- Trust-boundary visibility: ...
- Naming quality: ...
## Seam Evaluation
- Caller intent: ...
- Protected invariant: ...
- Tainted to trusted transition: ...
- Failure contract: ...
- Public API leakage: ...
- Policy / workflow / IO separation: ...
## Judgment
- Seam quality: strong | mixed | weak
- Main issue: ...
- Why review load is high or low: ...
## Smallest High-Value Improvements
- 1. ...
- 2. ...
- 3. ...
```
## Naming Guidance During Review
If naming is part of the problem:
1. state why the current name is weak
2. propose three alternatives when the name is important
3. choose the option that best preserves caller intent across likely model evolution
4. prefer intent-first names such as accepted, confirmed, authorized, selected, or emitted outcomes over vague ready/process labels when those better describe the protected invariant
## Constraints
- Do not write implementation code unless the user explicitly asks for a refactor.
- Do not demand extra layers if the current seam can be clarified with naming or API reduction.
- Do not treat every state-machine name as wrong; flag it only when it increases review load or hides the actual invariant.
- Do not weaken security or trust-boundary checks in the name of simplicity.
## Success Criteria
A good seam review should let a human reviewer answer:
- what the main seam is
- what it promises
- why it is easy or hard to review
- where trust enters and becomes trusted
- which names or exports are increasing review load
- what the smallest justified improvements are
@@ -0,0 +1,66 @@
# Feature Design Map: Recovery Pipeline
## Bounded Contexts
- `ingest-snapshot` — owns deterministic upstream bundle ingest, segment boundaries, canonical source projection, and run manifests.
- `dependency-recovery` — owns vendored package identification, dependency decisions, externalization, and bundled fallback preservation.
- `static-context-evidence` — owns deterministic context packets, binding graphs, and usage evidence for downstream consumers.
- `snapshot-lineage` — owns adjacent-run matching, durable lineage, change classification, relabel eligibility, and upstream summary facts.
- `iterative-naming` — owns relabel queue planning, batch execution handoff, wave reconciliation, safe rename acceptance, and naming memory updates.
- `codebase-regularization` — owns deterministic file placement, structural splitting, import/export reconstruction, and canonical editable tree emission.
- `maintained-transform-replay` — owns replay of long-lived maintained transforms and replay conflict reporting.
- `release-packaging` — owns release artifact assembly, provenance manifests, and publication-ready outputs.
## Feature Step to Workflow Slice Map
| Feature Step | Bounded Context | Workflow Slice | Notes |
| :----------- | :-------------- | :------------- | :---- |
| Ingest upstream bundle snapshot into deterministic recovery artifacts | `ingest-snapshot` | `deterministic-bundle-ingest` | Produces the canonical per-run source of truth used by all later slices. |
| Identify the next vendored package decision from one source | `dependency-recovery` | `identify-next-vendored-package-decision-from-source` | Consumes ingest artifacts, emits one dependency decision at a time, and signals when no more plausible candidates remain. |
| Replace accepted vendored packages with external dependencies while keeping fallbacks | `dependency-recovery` | `externalize-accepted-dependencies` | Depends on identified package decisions; unresolved packages stay bundled. |
| Extract deterministic context packets for each segment | `static-context-evidence` | `extract-segment-context` | Consumes ingest output after dependency treatment to emit machine-readable evidence. |
| Compare adjacent runs and classify lineage-aware changes | `snapshot-lineage` | `diff-adjacent-runs` | Consumes current and previous run manifests plus Phase 3 context. |
| Rank relabel candidates into deterministic queue packets | `iterative-naming` | `plan-relabel-queue` | Uses only new and modified segments from snapshot-lineage. |
| Execute queued relabel batches against the model provider in waves | `iterative-naming` | `execute-wave-batches` | Owns outbound API execution only; no naming decisions are applied here. |
| Evaluate responses, accept safe names, and update queue state | `iterative-naming` | `evaluate-and-apply-renames` | Reconciles at wave boundary and updates naming memory. |
| Emit the canonical editable recovered tree | `codebase-regularization` | `regularize-editable-tree` | Must preserve build-first while improving navigability. |
| Replay long-lived maintained transforms onto the regularized tree | `maintained-transform-replay` | `replay-maintained-transforms` | Carries durable local changes across upgrades. |
| Build release artifacts and publication metadata | `release-packaging` | `build-and-publish-artifacts` | Packages processed and unmodified artifacts for traceable release output. |
## Cross-Context Handoffs
- `ingest-snapshot` -> `dependency-recovery` via run manifest, segments, and canonical projection because vendored matching starts from deterministic ingest evidence.
- `ingest-snapshot` -> `static-context-evidence` via stable segment records because context extraction depends on canonical segment boundaries.
- `dependency-recovery` -> `static-context-evidence` via accepted externalization decisions and preserved fallbacks because context packets must describe the post-decision code surface.
- `static-context-evidence` -> `snapshot-lineage` via deterministic context packets because fuzzy matching and summary facts need machine-readable evidence.
- `snapshot-lineage` -> `iterative-naming` via relabel-eligible changed/new segments and ambiguity reports because only safe changed material should enter naming work.
- `iterative-naming` -> `codebase-regularization` via safely renamed generated source and naming memory because regularization should operate on the best accepted recovered names.
- `codebase-regularization` -> `maintained-transform-replay` via canonical editable tree and placement mappings because replay targets the regularized tree, not the pre-regularized source.
- `maintained-transform-replay` -> `release-packaging` via replay outcomes and transformed tree state because releases must reflect which maintained transforms were applied, skipped, or conflicted.
## Recommended Slice Order
1. `ingest-snapshot/deterministic-bundle-ingest` — all later slices depend on deterministic ingest artifacts and canonical segment boundaries.
2. `dependency-recovery/identify-next-vendored-package-decision-from-source` — shrinks the app-authored surface one source decision at a time before later evidence and naming work.
3. `dependency-recovery/externalize-accepted-dependencies` — completes dependency treatment before downstream evidence extraction.
4. `static-context-evidence/extract-segment-context` — provides deterministic evidence used by diffing, summaries, and transform anchoring.
5. `snapshot-lineage/diff-adjacent-runs` — identifies changed/new material and durable lineage needed for iterative naming.
6. `iterative-naming/plan-relabel-queue` — transforms changed material into deterministic naming work packets.
7. `iterative-naming/execute-wave-batches` — executes persisted batches without applying names yet.
8. `iterative-naming/evaluate-and-apply-renames` — applies only accepted names after wave reconciliation.
9. `codebase-regularization/regularize-editable-tree` — emits the canonical browsable tree once safe names are available.
10. `maintained-transform-replay/replay-maintained-transforms` — reapplies durable local changes onto the regularized tree.
11. `release-packaging/build-and-publish-artifacts` — packages the final tree and release metadata last.
## Orchestration Notes
- The feature-level pipeline is linear by default, but review-needed findings do not automatically halt later safe slices in MVP.
- `iterative-naming` contains three slices inside one bounded context; only wave orchestration crosses those slice boundaries.
- Cross-context decisions stay at handoff seams: each slice makes decisions only over state owned by its context.
- Build-first remains the feature-level acceptance rule, especially across `codebase-regularization`, `maintained-transform-replay`, and `release-packaging`.
## Open Questions
- `static-context-evidence` will consume post-externalization source as its canonical input; if pre-externalization review becomes necessary later, treat it as a secondary review artifact rather than the main slice input.
- The release docs imply publication is optional; the exact publication handoff seam inside `release-packaging` is still open.
- Build verification is a hard invariant, but the repository-wide command set for that verification is not yet frozen in the design artifacts.
@@ -0,0 +1,95 @@
# Feature Discovery: Recovery Pipeline
## 1. Commands (User Intents)
- Pipeline operator wants to ingest an upstream bundle snapshot because they need a deterministic base for recovery work.
- Pipeline operator wants to identify and externalize vendored dependencies because they want to shrink the app-authored surface that later phases must understand.
- Pipeline operator wants to extract deterministic context because later phases need machine-readable evidence without relying on an LLM as the source of truth.
- Pipeline operator wants to diff the current snapshot against the previous snapshot because they want durable lineage, compact upstream summaries, and to avoid resending unchanged material for naming.
- Pipeline operator wants to iteratively relabel changed and new code because they want a more browsable recovered tree with readable names across modules, functions, locals, and parameters.
- Pipeline operator wants to regularize recovered output into a canonical editable tree because they care most about a browsable codebase.
- Pipeline operator wants the recovered tree to build because buildability is the current hard success invariant.
- Pipeline operator wants uncertain areas surfaced in manifests and reports because uncertainty should not block MVP progress.
- Pipeline operator wants manual runtime rescue patches captured as formal maintained transforms because repeated upgrades should become replayable.
- Pipeline operator wants to publish processed and unmodified artifacts with provenance because releases should remain traceable to the upstream snapshot.
## 2. Events (Domain Facts)
- Upstream snapshot ingested (payload: run ID, upstream snapshot identity, emitted manifest, emitted segments).
- Dependency candidate identified (payload: candidate package, evidence, recovered segment boundary).
- Dependency decision recorded (payload: accepted|rejected|unresolved, confidence, rationale, fallback reference).
- Context packet extracted (payload: segment ID, bindings, links, evidence, heuristics).
- Run diff completed (payload: unchanged|modified|new|deleted|split|merged|ambiguous classifications, lineage updates).
- Relabel candidate queued (payload: candidate ID, pass kind, evidence score, difficulty score, priority score).
- Batch wave executed (payload: wave ID, batch IDs, model/config, execution outcomes).
- Rename proposal evaluated (payload: accepted|deferred|stalled|exhausted outcomes, rejection reasons, counters).
- Accepted names applied (payload: candidate fields renamed, updated source/metadata, naming-memory updates).
- Regularized tree emitted (payload: canonical repo-root tree, regularization manifest, placement mappings).
- Review-needed artifact emitted (payload: phase, machine-readable report, concise human summary).
- Maintained transform replayed (payload: applied|conflict|skipped outcome, transform metadata, replay report).
- Release artifact set emitted (payload: processed-source artifact, unmodified-source artifact, release manifest, release notes).
## 3. Business Rules & Invariants
- Rule: The repo root is always the latest canonical editable recovered tree.
- Rule: Per-run artifacts, evidence, queue state, and review reports live under `runs/`.
- Rule: Buildability outranks readability; risky naming or regularization must not be accepted if it jeopardizes correctness.
- Rule: Runtime completeness is desirable but not required for MVP progression if the output still builds and remains browsable.
- Rule: Uncertainty should be surfaced in manifests and reports instead of silently guessed away.
- Rule: For MVP, review-needed states should not halt the entire pipeline if later phases can proceed safely.
- Rule: Later phases must consume deterministic machine-readable artifacts as source of truth.
- Rule: LLM output may assist naming and ambiguous ranking, but must not become the source of truth for deterministic structure, matching, or safety decisions.
- Rule: The root recovered tree is generated, not hand-maintained between runs.
- Rule: Upgrades should start from raw ingest, reuse deterministic prior evidence where valid, then replay maintained transforms.
- Rule: If manual fixes are needed because the code is not runnable, those fixes should become formal Phase 9 maintained transforms.
- Invariant: Build-first is the current formal verification bar for successful regularization/publishing.
- Invariant: If a more navigable regularization attempt breaks the build, the failed attempt must be surfaced for review rather than silently degraded.
- Invariant: Review surfacing must include both machine-readable artifacts and concise human-readable summaries.
## 4. Edge Cases Handled
- Case: Dependency match confidence is low or colliding -> record as unresolved or review-needed instead of forcing externalization.
- Case: Vendored replacement may drift from bundled behavior -> preserve bundled fallback implementations for validation and safety.
- Case: Diff matching remains contested -> emit `ambiguous` artifacts and exclude those segments from automated lineage-dependent actions.
- Case: Rename candidates lack sufficient evidence -> keep them visible in queue state, defer and retry deterministically, then allow terminal `stalled` or `exhausted` outcomes rather than retrying forever.
- Case: Model response is low confidence, insufficiently specific, invalid, or collision-prone -> reject deterministically and feed structured reasons back into queue state.
- Case: A more readable split or placement would make the tree fail to build -> surface the failed regularization attempt for review.
- Case: Runtime behavior is incomplete after recovery -> allow manual rescue patches, but capture durable fixes as maintained transforms when they must persist across upgrades.
- Case: Publication fails after artifacts are built -> keep local built artifacts and separate publication failure from build failure.
- Case: Review-needed findings appear in MVP -> continue later safe phases while recording artifacts for later inspection.
## 5. Candidate Bounded Contexts
- Ingest & Snapshot Evidence: owns deterministic bundle ingest, segment records, and canonical projections.
- Dependency Recovery: owns vendored package identification, confidence decisions, externalization, and fallback preservation.
- Static Context Evidence: owns deterministic context extraction artifacts and evidence packets.
- Snapshot Lineage & Change Detection: owns run-to-run matching, lineage, change classification, and upstream summaries.
- Iterative Naming: owns relabel queue planning, batch execution handoff, semantic acceptance, safe rename application, and naming memory.
- Codebase Regularization: owns deterministic file/folder placement, structural splitting, import/export reconstruction, and editable-tree emission.
- Maintained Transform Replay: owns deterministic replay of long-lived transforms and replay conflict reporting.
- Release Packaging: owns artifact packaging, provenance manifests, and optional publication.
## 6. Candidate Workflow Slices
- ingest-snapshot/deterministic-bundle-ingest: turn an upstream bundle into deterministic segment records and canonical source projection.
- dependency-recovery/identify-next-vendored-package-decision-from-source: recover the next plausible vendored candidate from one source and record one dependency decision or exhaustion.
- dependency-recovery/externalize-accepted-dependencies: replace accepted vendored code with npm imports while preserving fallbacks.
- static-context-evidence/extract-segment-context: emit canonical context packets and binding/link evidence.
- snapshot-lineage/diff-adjacent-runs: classify changes, mint lineage, and produce relabel queues plus upstream summaries.
- iterative-naming/plan-relabel-queue: compute candidate evidence, difficulty, priority, and batch-ready work items.
- iterative-naming/execute-wave-batches: send persisted batch artifacts to the model provider in parallel waves.
- iterative-naming/evaluate-and-apply-renames: validate wave results, accept safe names, update queue state, and refresh naming memory.
- codebase-regularization/regularize-editable-tree: produce the canonical repo-root tree with deterministic placement and mappings.
- maintained-transform-replay/replay-maintained-transforms: apply stored transforms safely and emit replay outcomes.
- release-packaging/build-and-publish-artifacts: package processed and unmodified artifacts with release metadata.
## 7. Shared Language Notes
- Preferred term: Recovery Pipeline = the full release-oriented workflow that turns an upstream bundle snapshot into a buildable, browsable recovered tree plus release artifacts.
- Preferred term: Recovered Tree = the canonical editable source tree emitted at repo root.
- Preferred term: Build-first = the current formal invariant that the recovered tree must build even if runtime completeness is still partial.
- Preferred term: Review-needed artifact = a machine-readable report plus concise human summary describing uncertainty, failure, or conflict that requires later inspection.
- Preferred term: Maintained Transform = a durable replayable change stored outside the numbered upstream-processing pipeline and reapplied in Phase 9.
- Preferred term: Naming Memory = accepted-name history reused to improve future relabel iterations.
- Avoid: “original repo layout” when you mean the deterministic regularized editable tree.
- Avoid: “runtime complete” when you only mean “buildable and browsable enough to inspect.”
+124
View File
@@ -0,0 +1,124 @@
# Design Status: Recovery Pipeline
## Feature
- Name: `Recovery Pipeline`
- Feature slug: `recovery-pipeline`
- Current phase: `Implementation Security Review`
- Overall status: `Assembly Complete`
- Security verification status: `Not Started`
- Current workflow slice: `ingest-snapshot/deterministic-bundle-ingest`
## Feature Artifacts
- [x] `design/feature/recovery-pipeline/discovery.md`
- [x] `design/feature/recovery-pipeline/design.md`
- [x] `design/feature/recovery-pipeline/status.md`
## Feature Discovery Gate
- [x] feature goal and actor intents captured
- [x] commands and events identified at feature level
- [x] business rules and invariants captured at feature level
- [x] edge cases captured at feature level
- [x] candidate bounded contexts identified
- [x] candidate workflow inventory identified
- [x] project-wide shared-language updates captured
- [x] approved for context and workflow decomposition
## Context & Workflow Decomposition Gate
- [x] bounded contexts confirmed
- [x] feature steps mapped to workflow slices
- [x] cross-context handoffs recorded
- [x] per-context shared-language files created or updated
- [x] workflow folders created with `01-decomposition.md`
- [x] recommended slice order recorded
- [ ] approved to begin slice discovery
## Workflow Slice Tracker
| Bounded Context | Workflow Slice | Slice Discovery | Core Sketch | Blueprint | Design Security | Assembly | Impl Security | Refactor | Notes |
| :-------------- | :------------- | :-------------- | :---------- | :-------- | :-------------- | :------- | :------------ | :------- | :---- |
| `ingest-snapshot` | `deterministic-bundle-ingest` | `Complete` | `Complete` | `Complete` | `Complete` | `Complete` | `Not Started` | `Not Started` | `Foundational source-of-truth slice.` |
| `dependency-recovery` | `identify-next-vendored-package-decision-from-source` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Shrinks app-authored surface one decision at a time.` |
| `dependency-recovery` | `externalize-accepted-dependencies` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Depends on package identification decisions.` |
| `static-context-evidence` | `extract-segment-context` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Produces deterministic evidence for downstream consumers.` |
| `snapshot-lineage` | `diff-adjacent-runs` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Owns lineage and changed/new segment routing.` |
| `iterative-naming` | `plan-relabel-queue` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Queue planning only.` |
| `iterative-naming` | `execute-wave-batches` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Outbound model execution only.` |
| `iterative-naming` | `evaluate-and-apply-renames` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Safe deterministic acceptance and application.` |
| `codebase-regularization` | `regularize-editable-tree` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Must preserve build-first invariant.` |
| `maintained-transform-replay` | `replay-maintained-transforms` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Carries maintained changes across upgrades.` |
| `release-packaging` | `build-and-publish-artifacts` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Not Started` | `Release-oriented output only.` |
## Current Slice Gates
### Slice Discovery Gate
- [x] selected slice named explicitly
- [x] happy path captured
- [x] edge cases captured
- [x] business rules and invariants captured
- [x] handoff assumptions captured
- [x] context shared-language updates captured
- [x] approved for core sketch
### Core Sketch Gate
- [x] required state is explicit
- [x] command and events are explicit
- [x] policy signature is explicit
- [x] slice boundaries are explicit
- [x] no cross-context decision logic inside the slice
- [x] approved for blueprint
### Blueprint Gate
- [x] domain types make illegal states harder to express
- [x] shared concepts reused appropriately
- [x] policy is pure
- [x] reducer/apply shape is explicit
- [x] workflow contract is explicit
- [x] approved for design security review or assembly
### Design Security Gate
- [x] trust boundaries reviewed
- [x] authority and least privilege reviewed
- [x] sink and data-flow risks reviewed
- [x] blocking findings resolved or explicitly accepted
- [x] approved for assembly
### Assembly Gate
- [x] tests added
- [x] implementation completed
- [x] types pass
- [x] tests passing
- [x] effect AST checks run for modified Effect files
- [x] approved for implementation security review or next slice
### Implementation Security Gate
- [ ] implementation security review completed or explicitly deferred
- [ ] blocking findings resolved or explicitly accepted
- [ ] approved for refactor consideration or next slice
### Refactor Gate
- [ ] diagnosis completed if structural changes were needed
- [ ] execution completed if approved
- [ ] verification rerun after refactor
- [ ] slice complete
## Open Questions / Blockers
- Build-first is selected, but the exact build command set is still implementation-specific.
- The release docs imply publication is optional; the exact publication handoff seam inside `release-packaging` is still open.
## Context Handoff Notes
- Read first: `design/feature/recovery-pipeline/discovery.md`
- Current focus: `Context & Workflow Decomposition`
- Do not change: `Buildability outranks readability, repo root is the latest editable tree, review-needed states continue in MVP, and uncertainty is surfaced through manifests and reports.`
@@ -0,0 +1,32 @@
# Workflow Decomposition: Regularize Editable Tree
- Bounded context: `codebase-regularization`
- Workflow slug: `regularize-editable-tree`
- Trigger: `Regularize recovered source into canonical editable tree`
- Success outcome: Deterministic file placement, structural splits, and import/export reconstruction yield the canonical regularized tree at repo root.
## Inputs Owned by This Context
- structural split rules
- placement mapping rules
- import/export reconstruction rules
## Inputs Observed from Other Contexts
- safely renamed generated source from `iterative-naming`
- lineage or placement-stability hints from earlier phases when available
## Downstream Handoffs
- emit canonical editable tree and regularization manifests to `maintained-transform-replay`
- emit placement mappings and recovery-scaffolding counts for release summaries
## Slice Boundaries
- Decision logic is owned only by `codebase-regularization`.
- Cross-context orchestration belongs in feature-level notes.
## Dependencies
- Requires `iterative-naming/evaluate-and-apply-renames` so regularization uses the best accepted recovered names.
- Must preserve the build-first invariant before handoff.
@@ -0,0 +1,14 @@
# Shared Language: Codebase Regularization
## Context Meaning
- Owns deterministic reshaping of recovered source into the canonical editable tree that humans can browse and modify.
## Preferred Terms
| Term | Meaning | Use this, not that | Notes |
| :--- | :------ | :----------------- | :---- |
| `Regularized Tree` | Canonical editable source tree emitted from recovered source | `Regularized Tree` not `original repo layout` | The goal is navigability, not historical reconstruction. |
| `Placement Mapping` | Deterministic record of where recovered units are placed in the editable tree | `Placement Mapping` not `file map` | Emphasizes stable placement as a domain output. |
| `Recovery Scaffolding` | Wrapper or bridge kept only when deterministic reconstruction is insufficient | `Recovery Scaffolding` not `hack` | Makes exceptional structure explicit and reviewable. |
| `Structural Split` | Deterministic division of a coarse recovered unit into smaller modules | `Structural Split` not `file breakup` | Domain term for regularization reshaping. |
@@ -0,0 +1,38 @@
# Dependency Recovery Context
**Vendored Package**: third-party code embedded inside the upstream bundle and considered for recovery as an external dependency.
_Avoid_: library blob, bundled package blob
**Dependency Decision**: the context-owned determination that a vendored candidate is accepted, rejected, or unresolved with recorded evidence and rationale.
_Avoid_: match result, package guess
**Acceptance Threshold**: the configurable confidence score boundary at or above which a vendored candidate is accepted automatically.
_Avoid_: hard-coded cutoff, fixed confidence bar
**Rejected Candidate**: a vendored candidate whose evidence deterministically supports that it is not a package match worth externalizing.
_Avoid_: low-confidence maybe, unresolved miss
**Unresolved Candidate**: a vendored candidate that remains plausible but is below the acceptance threshold, colliding, ambiguous, or otherwise unsafe to accept.
_Avoid_: rejected maybe, ignored candidate
**Fallback Preservation**: keeping bundled code available when externalization is unsafe or unresolved so later validation can compare behaviors safely.
_Avoid_: leave old code around, dead backup code
**Externalization**: replacing accepted vendored code with an external dependency reference without deleting the original bundled implementation.
_Avoid_: strip dependency, remove vendor code
## Example dialogue
> **Developer:** "If a candidate scores below the configured threshold, do we reject it?"
> **Domain Expert:** "No. If it still looks plausible, it stays unresolved until stronger evidence appears or a reviewer decides otherwise."
>
> **Developer:** "When do we mark a candidate rejected?"
> **Domain Expert:** "Only when the evidence deterministically says it is not a vendored package match worth externalizing."
>
> **Developer:** "Does accepting a vendored package delete the bundled code?"
> **Domain Expert:** "No. Externalization still preserves the bundled implementation as fallback evidence."
## Flagged ambiguities
- "Low confidence" alone does not mean `rejected`; low-confidence but still plausible candidates are `Unresolved Candidates`.
- "Match result" was too scoring-shaped; the preferred term is `Dependency Decision` because this context owns a reviewer-facing decision with rationale.
@@ -0,0 +1,31 @@
# Workflow Decomposition: Externalize Accepted Dependencies
- Bounded context: `dependency-recovery`
- Workflow slug: `externalize-accepted-dependencies`
- Trigger: `Externalize accepted vendored packages`
- Success outcome: Accepted vendored packages are replaced with external dependency references while unresolved or unsafe cases keep bundled fallbacks.
## Inputs Owned by This Context
- accepted dependency decisions
- fallback preservation rules
- externalization output records
## Inputs Observed from Other Contexts
- canonical source projection from `ingest-snapshot`
- accepted, rejected, and unresolved dependency decisions from `identify-vendored-packages`
## Downstream Handoffs
- emit post-externalization recovered source and fallback records to `static-context-evidence`
- emit dependency externalization results to later release summaries
## Slice Boundaries
- Decision logic is owned only by `dependency-recovery`.
- Cross-context orchestration belongs in feature-level notes.
## Dependencies
- Requires `dependency-recovery/identify-vendored-packages` before execution because only accepted decisions may be externalized.
@@ -0,0 +1,33 @@
# Workflow Decomposition: Identify Next Vendored Package Decision From Source
- Bounded context: `dependency-recovery`
- Workflow slug: `identify-next-vendored-package-decision-from-source`
- Trigger: `Identify next vendored package decision from source`
- Success outcome: The next plausible vendored candidate from one source yields either a single accepted, rejected, or unresolved dependency decision, or an explicit no-more-candidates outcome.
## Inputs Owned by This Context
- vendored candidate scoring rules
- dependency decision records
- confidence thresholds and rationale model
## Inputs Observed from Other Contexts
- run manifest from `ingest-snapshot`
- segment records and canonical projection from `ingest-snapshot`
- optional runtime traces treated only as tie-break evidence
## Downstream Handoffs
- emit dependency decisions and evidence to `externalize-accepted-dependencies`
- emit accepted, rejected, and unresolved decision manifests for later review and release summaries
## Slice Boundaries
- Decision logic is owned only by `dependency-recovery`.
- Cross-context orchestration belongs in feature-level notes.
## Dependencies
- Requires `ingest-snapshot/deterministic-bundle-ingest` first because vendored matching starts from deterministic ingest evidence.
- Precedes `dependency-recovery/externalize-accepted-dependencies` because externalization depends on accepted decisions.
@@ -0,0 +1,59 @@
# Slice Discovery: Identify Next Vendored Package Decision From Source
- Bounded context: `dependency-recovery`
- Workflow slug: `identify-next-vendored-package-decision-from-source`
## Happy Path
- The workflow receives deterministic ingest artifacts from `ingest-snapshot`, including the run manifest, segment records, and canonical source projection.
- The workflow inspects the source snapshot and deterministically recovers the next plausible vendored candidate boundary that has not already been decided.
- The workflow computes a deterministic confidence ranking for package matches on that next candidate by combining the configured evidence signals.
- The workflow compares the strongest candidate match against the configurable acceptance threshold.
- If the strongest candidate is at or above the acceptance threshold, the workflow records one `accepted` dependency decision.
- If the strongest candidate remains plausible but is below threshold, colliding, or ambiguous, the workflow records one `unresolved` dependency decision.
- If the candidate evidence deterministically supports that it is not a vendored package match, the workflow records one `rejected` dependency decision.
- The workflow emits one decision record with evidence, rationale, boundary notes, and replacement planning hints.
- When no plausible undecided vendored candidates remain for the source under the current rules, the workflow emits an explicit no-more-candidates outcome.
- Downstream `externalize-accepted-dependencies` consumes the accumulated accepted decisions for externalization while preserving unresolved and rejected records for review and summaries.
## Edge Cases
- A package is split across multiple obfuscated wrappers -> recover one vendored candidate boundary spanning the related segments and record the grouping rationale.
- Multiple package matches compete for the same segment group -> keep the stronger deterministic ranking, but if the competition remains plausible and unsafe to settle automatically, emit `unresolved` rather than forcing rejection.
- A candidate has some evidence but stays below the configured acceptance threshold -> emit `unresolved`.
- A candidate has strong counter-evidence that it is app-authored or otherwise not a vendored package match -> emit `rejected`.
- Runtime traces are unavailable -> continue with static evidence only.
- Runtime traces conflict with static evidence -> record the mixed provenance and prefer `unresolved` unless the conflict is resolved deterministically.
- A candidate package boundary is only partially recoverable -> record the partial boundary notes and keep the decision unresolved unless the recoverable portion still supports a deterministic accepted or rejected decision.
- Two runs use different threshold settings -> preserve the same confidence scores and evidence, but allow the configurable threshold to change which candidates become accepted.
## Business Rules & Invariants
- Rule: Confidence scoring is deterministic for identical ingest artifacts, evidence sources, and configuration.
- Rule: Acceptance uses a configurable threshold rather than a hard-coded cutoff.
- Rule: `Accepted` means the candidate score is at or above the configured acceptance threshold and the match is safe enough to externalize later.
- Rule: `Unresolved` is preferred over `Rejected` when a candidate remains plausible but is below threshold, colliding, ambiguous, or otherwise unsafe to settle automatically.
- Rule: `Rejected` is reserved for candidates whose evidence deterministically supports that they are not vendored package matches worth externalizing.
- Rule: The workflow records auditable evidence and rationale for every dependency decision.
- Rule: One vendored package candidate may map to multiple segments when the boundary recovery rationale supports it.
- Invariant: This slice identifies and records dependency decisions only; it does not externalize code.
- Invariant: Accepted, rejected, and unresolved candidates all remain visible in emitted manifests.
- Invariant: Changing the acceptance threshold must not require redesigning the manifest format.
## Required Decisions Owned by This Context
- Which evidence signals are combined into deterministic vendored-package confidence scoring.
- Which related segments belong to one vendored package candidate boundary.
- Whether a candidate becomes accepted, rejected, or unresolved.
- What evidence, rationale, ambiguity notes, and replacement planning hints must be persisted for downstream use.
## Handoff Assumptions
- `ingest-snapshot` provides deterministic run manifest, segment records, and canonical projection as the source of truth for candidate discovery.
- `externalize-accepted-dependencies` consumes only accepted dependency decisions for replacement work.
- Later review and release-summary seams consume accepted, rejected, and unresolved manifests without reopening this slice's decision logic.
- Optional runtime traces act only as additional evidence inside this context and do not override deterministic decision recording requirements.
## Open Questions
- None currently inside this slice; threshold tuning and evidence-weight configuration remain implementation concerns within this context.
@@ -0,0 +1,105 @@
# Core Sketch: Identify Next Vendored Package Decision From Source
- Bounded context: `dependency-recovery`
- Workflow slug: `identify-next-vendored-package-decision-from-source`
## Command
- `IdentifyNextVendoredPackageDecisionFromSource`
- Meaning: inspect one source snapshot, deterministically recover the next plausible vendored candidate if one remains, and record exactly one `Dependency Decision` or an explicit no-more-candidates outcome for later `Externalization`.
## Required State
State owned by `dependency-recovery` and required to decide this workflow:
- `VendoredCandidateDiscoveryRules`
- allowed evidence signals for vendored-package discovery
- grouping rules for segment and segment-group candidates
- rationale rules for recovered package boundaries
- `ConfidenceScoringRules`
- deterministic scoring weights or combination rules across evidence types
- ranking rules for competing package matches on the same candidate boundary
- tie-break rules when scores or evidence patterns compete
- `AcceptanceThresholdPolicy`
- configurable `Acceptance Threshold`
- rules for when plausible but below-threshold or colliding candidates remain `Unresolved Candidates`
- rules for when counter-evidence is strong enough to produce a `Rejected Candidate`
- `DependencyDecisionRequirements`
- required manifest fields for evidence summary, raw evidence references, confidence score, provenance, recovered boundary notes, ambiguity notes, replacement plan, and fallback reference
- required auditability rules for later review and downstream handoffs
## Observed Inputs
Snapshots or handoffs read but not owned by this context:
- one source snapshot from `ingest-snapshot`, plus the current decision cursor for that source
- the relevant `Run Manifest` facts and canonical source projection from `ingest-snapshot`
- optional runtime traces from that source used only as additional or tie-break evidence
- optional registry, tarball, or CDN package evidence used to compare matches for the next recovered candidate
## Policy Signature (Pseudo)
```text
recoverNextCandidateBoundary :
SourceSnapshot
-> SourceDecisionCursor
-> VendoredCandidateDiscoveryRules
-> Result<NextCandidateBoundary, NoMoreVendoredCandidates>
scoreVendoredCandidate :
NextCandidateBoundary
-> CandidateEvidenceSources
-> ConfidenceScoringRules
-> RankedCandidateMatches
decideDependencyDecision :
NextCandidateBoundary
-> RankedCandidateMatches
-> AcceptanceThresholdPolicy
-> DependencyDecision
validateDecisionRecord :
DependencyDecision
-> DependencyDecisionRequirements
-> Result<DependencyDecisionRecorded, DependencyDecisionRejected>
performNextVendoredPackageDecisionFromSource :
IdentifyNextVendoredPackageDecisionFromSource
-> DependencyRecoveryState
-> Result<NextVendoredPackageDecisionFromSourceCompleted, VendoredPackageIdentificationHardStopped>
```
## Events
### Success Event
- `NextVendoredPackageDecisionFromSourceCompleted`
- run identity reference
- source reference
- next decision cursor
- emitted single dependency decision
- emitted decision record reference
- evidence artifact references for that candidate
### Exhaustion Event
- `NoMoreVendoredCandidatesRemain`
- run identity reference
- source reference
- final decision cursor
### Failure Event
- `VendoredPackageIdentificationHardStopped`
- run identity when available
- failed stage such as missing ingest artifacts, invalid candidate boundary inputs, or invalid decision-manifest requirements
- failure reason
## Boundary Notes
- The `dependency-recovery` context decides the next vendored-package decision for one source per workflow invocation.
- Outer orchestration may repeat the workflow for the same source using the returned decision cursor until the slice emits `NoMoreVendoredCandidatesRemain`.
- This slice does not externalize accepted packages; that belongs to `dependency-recovery/externalize-accepted-dependencies`.
- `ingest-snapshot` remains the source of truth for run manifest, segment boundaries, and canonical projection; this slice must not reopen ingest decisions.
- Optional runtime traces and package-source comparisons act only as evidence inputs here and must not turn this slice into cross-context orchestration.
- Feature-level orchestration decides whether unresolved or review-needed outcomes slow later phases; this slice only records one audit-ready dependency decision at a time.
@@ -0,0 +1,242 @@
module DependencyRecovery.IdentifyNextVendoredPackageDecisionFromSource
open DependencyRecovery.SharedModel
// 1. Primitives
type TaintedRunManifestReference = TaintedRunManifestReference of string
type TaintedSegmentRecordReference = TaintedSegmentRecordReference of string
type TaintedCanonicalProjectionReference = TaintedCanonicalProjectionReference of string
type TaintedRuntimeTraceReference = TaintedRuntimeTraceReference of string
type TrustedRunManifestReference = TrustedRunManifestReference of string
type TrustedSegmentRecordReference = TrustedSegmentRecordReference of string
type TrustedCanonicalProjectionReference = TrustedCanonicalProjectionReference of string
type TrustedRuntimeTraceReference = TrustedRuntimeTraceReference of string
type CandidateGroupingRule =
| GroupAdjacentSegments
| GroupStructurallyLinkedSegments
| GroupSharedLiteralClusters
| GroupSharedExportSurface
type BoundaryRationaleRule =
| RecordAdjacentGroupingRationale
| RecordStructuralLinkRationale
| RecordSharedLiteralRationale
| RecordExportSurfaceRationale
type ScoringRule =
| WeightLicenseBanner
| WeightPreservedPackageName
| WeightSourceMapHint
| WeightPreservedRequireString
| WeightCharacteristicLiteralSet
| WeightHelperSignature
| WeightAstShapeFingerprint
| WeightExportSurfaceSimilarity
| WeightDependencyGraphPosition
| WeightByteSimilarity
| WeightRuntimeExecutionTrace
type RankingRule =
| RankByTotalEvidenceWeight
| RankByEvidenceDiversity
| RankByBoundaryCoverage
| RankByRuntimeSupport
type TieBreakRule =
| PreferMoreSpecificPackageMatch
| PreferBroaderBoundaryCoverage
| PreferStaticEvidenceAgreement
| PreferStablePackageNameOrder
type UnresolvedRule =
| KeepBelowThresholdCandidatesUnresolved
| KeepCollidingCandidatesUnresolved
| KeepAmbiguousCandidatesUnresolved
| KeepConflictingEvidenceCandidatesUnresolved
type RejectedRule =
| RejectDeterministicallyAppAuthoredCandidates
| RejectDeterministicallyNonPackageCandidates
| RejectDeterministicallyContradictedCandidates
type RequiredManifestField =
| CandidatePackageNameField
| DecisionStateField
| ConfidenceScoreField
| EvidenceSummaryField
| RawEvidenceReferencesField
| MatchedSegmentIdsField
| RecoveredBoundaryNotesField
| ReplacementPlanField
| FallbackReferenceField
| EvidenceProvenanceField
| AmbiguityNotesField
type AuditabilityRule =
| RecordDecisionRationale
| RecordThresholdUsed
| RecordScoringInputs
| RecordCompetingMatches
| RecordDecisionTimestampOrder
// 2. Commands (Inputs)
type SourceDecisionCursor = SourceDecisionCursor of string
type TaintedSourceReference = TaintedSourceReference of string
type TrustedSourceReference = TrustedSourceReference of string
type TaintedCandidateBoundaryReference = TaintedCandidateBoundaryReference of string
type TrustedCandidateBoundaryReference = TrustedCandidateBoundaryReference of string
type IdentifyNextVendoredPackageDecisionFromSource = {
runManifest: TaintedRunManifestReference
canonicalProjection: TaintedCanonicalProjectionReference
source: TaintedSourceReference
decisionCursor: SourceDecisionCursor
runtimeTraces: TaintedRuntimeTraceReference option
}
// 3. Observed inputs and owned state
type TrustedSourceInput = {
runIdentity: RunIdentity
runManifest: TrustedRunManifestReference
canonicalProjection: TrustedCanonicalProjectionReference
source: TrustedSourceReference
decisionCursor: SourceDecisionCursor
runtimeTraces: TrustedRuntimeTraceReference option
}
type VendoredCandidateDiscoveryRules = {
allowedSignals: EvidenceSignal list
groupingRules: CandidateGroupingRule list
boundaryRationaleRules: BoundaryRationaleRule list
}
type ConfidenceScoringRules = {
scoringRules: ScoringRule list
rankingRules: RankingRule list
tieBreakRules: TieBreakRule list
}
type AcceptanceThresholdPolicy = {
acceptanceThreshold: ConfidenceScore
unresolvedRules: UnresolvedRule list
rejectedRules: RejectedRule list
}
type DependencyDecisionRequirements = {
requiredManifestFields: RequiredManifestField list
auditabilityRules: AuditabilityRule list
}
type DependencyRecoveryState = {
candidateDiscoveryRules: VendoredCandidateDiscoveryRules
confidenceScoringRules: ConfidenceScoringRules
acceptanceThresholdPolicy: AcceptanceThresholdPolicy
dependencyDecisionRequirements: DependencyDecisionRequirements
}
// 4. Events (Facts)
type NextCandidateBoundary = {
source: TrustedSourceReference
candidateBoundary: TrustedCandidateBoundaryReference
nextCursor: SourceDecisionCursor
}
type DecisionRecordReference = DecisionRecordReference of string
type NextVendoredPackageDecisionFromSourceCompleted = {
runIdentity: RunIdentity
source: TrustedSourceReference
nextCursor: SourceDecisionCursor
dependencyDecision: DependencyDecision
decisionRecord: DecisionRecordReference
evidenceArtifacts: EvidenceReference list
}
type NoMoreVendoredCandidatesRemain = {
runIdentity: RunIdentity
source: TrustedSourceReference
finalCursor: SourceDecisionCursor
}
type VendoredPackageIdentificationStage =
| CandidateInputParsingStage
| CandidateScoringStage
| DependencyDecisionStage
| DecisionRecordValidationStage
type VendoredPackageIdentificationFailureReason =
| MissingIngestArtifacts
| InvalidIngestArtifactReference
| InvalidCandidateBoundaryReference
| InvalidDecisionRecordRequirements
type VendoredPackageIdentificationHardStopped = {
runIdentity: RunIdentity option
failedStage: VendoredPackageIdentificationStage
reason: VendoredPackageIdentificationFailureReason
}
// 5. State (Aggregate)
type DependencyIdentificationState =
| AwaitingVendoredPackageIdentification of DependencyRecoveryState
| NextVendoredPackageDecisionRecorded of NextVendoredPackageDecisionFromSourceCompleted
| VendoredCandidateDiscoveryExhausted of NoMoreVendoredCandidatesRemain
// 6. Parse and decision contracts
val parseSourceInput :
IdentifyNextVendoredPackageDecisionFromSource
-> Result<TrustedSourceInput, VendoredPackageIdentificationHardStopped>
val recoverNextCandidateBoundary :
TrustedSourceInput
-> VendoredCandidateDiscoveryRules
-> Result<NextCandidateBoundary, NoMoreVendoredCandidatesRemain>
val scoreCandidateMatches :
NextCandidateBoundary
-> TrustedSourceInput
-> ConfidenceScoringRules
-> Result<CandidateMatch list, VendoredPackageIdentificationHardStopped>
val decideDependencyDecision :
AcceptanceThresholdPolicy
-> NextCandidateBoundary
-> CandidateMatch list
-> Result<DependencyDecision, VendoredPackageIdentificationHardStopped>
val validateDecisionRecord :
DependencyDecisionRequirements
-> DependencyDecision
-> Result<DecisionRecordReference, VendoredPackageIdentificationHardStopped>
val decide :
DependencyIdentificationState
-> IdentifyNextVendoredPackageDecisionFromSource
-> Result<NextVendoredPackageDecisionFromSourceCompleted, VendoredPackageIdentificationHardStopped>
val apply :
DependencyIdentificationState
-> NextVendoredPackageDecisionFromSourceCompleted
-> DependencyIdentificationState
val workflow :
IdentifyNextVendoredPackageDecisionFromSource
-> Effect.Effect<Result<NextVendoredPackageDecisionFromSourceCompleted, VendoredPackageIdentificationHardStopped>>
@@ -0,0 +1,14 @@
# Shared Language: Dependency Recovery
## Context Meaning
- Owns decisions about which bundled code is third-party, how confident that decision is, and whether it can be externalized safely.
## Preferred Terms
| Term | Meaning | Use this, not that | Notes |
| :--- | :------ | :----------------- | :---- |
| `Vendored Package` | Third-party code embedded inside the upstream bundle | `Vendored Package` not `library blob` | Keeps ownership focused on package recovery. |
| `Dependency Decision` | Accepted, rejected, or unresolved determination about a vendored candidate | `Dependency Decision` not `match result` | Emphasizes reviewer-facing judgment with rationale. |
| `Fallback Preservation` | Keeping bundled code available when externalization is unsafe or unresolved | `Fallback Preservation` not `leave old code around` | Safety term for preserving executable behavior. |
| `Externalization` | Replacing accepted vendored code with an external dependency reference | `Externalization` not `strip dependency` | Avoids implying destructive removal. |
@@ -0,0 +1,69 @@
module DependencyRecovery.SharedModel
type RunIdentity = RunIdentity of string
type SegmentId = SegmentId of string
type CandidateBoundaryId = CandidateBoundaryId of string
type PackageName = PackageName of string
type ConfidenceScore = private ConfidenceScore of int
type EvidenceReference = EvidenceReference of string
type BoundaryNote = BoundaryNote of string
type AmbiguityNote = AmbiguityNote of string
type ReplacementPlan = ReplacementPlan of string
type FallbackReference = FallbackReference of string
type Rationale = Rationale of string
type EvidenceProvenance =
| Registry
| Tarball
| Cdn
| Runtime
| Static
| Mixed
type EvidenceSignal =
| LicenseBanner
| PreservedPackageName
| SourceMapHint
| PreservedRequireString
| CharacteristicLiteralSet
| HelperSignature
| AstShapeFingerprint
| ExportSurfaceSimilarity
| DependencyGraphPosition
| ByteSimilarity
| RuntimeExecutionTrace
type CandidateBoundary = {
boundaryId: CandidateBoundaryId
segmentIds: SegmentId list
boundaryNotes: BoundaryNote list
}
type EvidenceSummary = {
signals: EvidenceSignal list
rawEvidence: EvidenceReference list
provenance: EvidenceProvenance
rationale: Rationale
}
type CandidateMatch = {
packageName: PackageName
confidenceScore: ConfidenceScore
evidence: EvidenceSummary
ambiguityNotes: AmbiguityNote list
}
type DependencyDecision =
| AcceptedDecision of CandidateBoundary * CandidateMatch * ReplacementPlan * FallbackReference
| RejectedDecision of CandidateBoundary * CandidateMatch
| UnresolvedDecision of CandidateBoundary * CandidateMatch list
@@ -0,0 +1,35 @@
# Ingest Snapshot Context
**Snapshot Ingest**: the deterministic intake of one upstream bundle snapshot into stable per-run recovery artifacts.
_Avoid_: import step, parse pass
**Run Identity**: the deterministic identity for one ingest run, derived from the upstream snapshot identity rather than manually assigned.
_Avoid_: ad hoc run id, operator-chosen id
**Trusted Bundle Location**: a bundle location that has been parsed and accepted for ingest use.
_Avoid_: raw bundle path, unchecked file location
**Verified Previous Run Manifest**: a prior run manifest that has passed schema and integrity checks before being used for continuity hints.
_Avoid_: trusted old manifest, reused manifest blob
**Segment Record**: one deterministic ingest-level code unit produced from any AST slice boundary that can be proven stably.
_Avoid_: chunk, guessed module
**Canonical Source Projection**: the normalized recovered code emitted from ingest for downstream phases to consume as source-of-truth input.
_Avoid_: formatted bundle, pretty output
## Example dialogue
> **Developer:** "Can Snapshot Ingest continue if the bundle does not parse?"
> **Domain Expert:** "No. Snapshot Ingest hard-stops because no trustworthy Segment Records can be emitted."
>
> **Developer:** "Who chooses the Run Identity?"
> **Domain Expert:** "The system derives Run Identity deterministically from the upstream snapshot identity."
>
> **Developer:** "What counts as a Segment Record?"
> **Domain Expert:** "Any AST slice boundary we can prove deterministically, not only wrapper modules."
## Flagged ambiguities
- "Module" is too narrow for this context because ingest may emit deterministic AST-slice boundaries that do not correspond to a bundler module wrapper.
- "Run ID" previously sounded operator-provided; resolved term is `Run Identity`, which is deterministically derived.
@@ -0,0 +1,32 @@
# Workflow Decomposition: Deterministic Bundle Ingest
- Bounded context: `ingest-snapshot`
- Workflow slug: `deterministic-bundle-ingest`
- Trigger: `Ingest upstream bundle snapshot`
- Success outcome: Deterministic run artifacts exist for one upstream snapshot, including manifest, segment records, canonical projection, and summary.
## Inputs Owned by This Context
- upstream bundle snapshot selected for ingest
- run identity inputs
- deterministic segment-boundary rules
- emitted run manifest and segment records
## Inputs Observed from Other Contexts
- previous run manifest snapshot reference when reuse or continuity checks are helpful
- upstream snapshot metadata used later by release-packaging
## Downstream Handoffs
- emit run manifest, segment records, canonical source projection, and summary to `dependency-recovery`
- emit stable segment records to `static-context-evidence`
## Slice Boundaries
- Decision logic is owned only by `ingest-snapshot`.
- Cross-context orchestration belongs in feature-level notes.
## Dependencies
- Foundational first slice for the feature; later slices depend on its artifacts.
@@ -0,0 +1,55 @@
# Slice Discovery: Deterministic Bundle Ingest
- Bounded context: `ingest-snapshot`
- Workflow slug: `deterministic-bundle-ingest`
## Happy Path
- Operator selects one upstream bundle snapshot for recovery.
- The workflow derives `Run Identity` deterministically from the upstream snapshot identity.
- The bundle parses successfully.
- The workflow detects deterministic AST slice boundaries for the snapshot.
- For each proven segment boundary, the workflow emits a `Segment Record` with source slice, AST node type, canonical source projection, and stable hashes.
- The workflow emits machine-readable ingest artifacts for the run, including `manifest.json` and `segments.jsonl`.
- The workflow may emit a human-readable summary, but success is defined by the machine artifacts.
- Downstream contexts consume the emitted manifest, segment records, and canonical source projection as the source of truth for later phases.
## Edge Cases
- Bundle does not parse -> hard stop the slice because no trustworthy ingest artifacts can be emitted.
- Bundle exceeds configured size or parse budget -> hard stop the slice before deeper ingest work to reduce resource-exhaustion risk.
- An apparent boundary cannot be proven deterministically -> do not emit it as a separate segment record; keep only proven AST slice boundaries.
- Identical upstream snapshot is ingested again -> derive the same run identity inputs and emit the same deterministic machine artifacts.
- Previous run manifest is available but not verified -> do not use it for continuity hints.
- Previous run manifest is available but continuity is weak -> ingest may observe it for continuity hints, but the current run artifacts still come only from the current snapshot and deterministic ingest rules.
- Human-readable summary generation fails -> slice still succeeds if machine-readable artifacts were emitted correctly.
## Business Rules & Invariants
- Rule: `Run Identity` is derived deterministically from upstream snapshot identity rather than chosen manually.
- Rule: A `Segment Record` may come from any deterministic AST slice boundary that can be proven stably.
- Rule: Ingest emits machine-readable artifacts as the source of truth for later phases.
- Rule: Human-readable summary output is optional relative to core ingest success.
- Rule: Bundle location input must be parsed into a trusted bundle location before ingest uses it.
- Rule: A previous run manifest must be verified before it may influence continuity hints.
- Invariant: If the bundle cannot be parsed, ingest hard-stops rather than emitting speculative artifacts.
- Invariant: The workflow does not guess segment boundaries that it cannot prove deterministically.
- Invariant: Identical upstream snapshot inputs produce identical deterministic ingest outputs.
## Required Decisions Owned by This Context
- Whether the selected upstream snapshot is parseable enough to begin deterministic ingest.
- Which AST slice boundaries are proven enough to become `Segment Records`.
- Which machine-readable artifacts are required for ingest success.
- How deterministic run identity is derived from upstream snapshot identity.
## Handoff Assumptions
- `dependency-recovery` receives `manifest.json`, `segments.jsonl`, and canonical source projection as the authoritative ingest outputs.
- `static-context-evidence` receives stable segment records whose boundaries were decided only inside `ingest-snapshot`.
- `release-packaging` may later consume upstream snapshot identity recorded in the run manifest.
- Cross-run continuity hints from a previous manifest do not override the current slice's deterministic ingest decisions.
## Open Questions
- None currently inside this slice; broader build verification and publication-seam questions remain feature-level concerns.
@@ -0,0 +1,88 @@
# Core Sketch: Deterministic Bundle Ingest
- Bounded context: `ingest-snapshot`
- Workflow slug: `deterministic-bundle-ingest`
## Command
- `IngestUpstreamSnapshot`
- Meaning: start `Snapshot Ingest` for one selected upstream bundle snapshot so the recovery pipeline has deterministic source-of-truth artifacts for later phases.
## Required State
State owned by `ingest-snapshot` and required to decide this workflow:
- `SelectedSnapshot`
- upstream snapshot identity
- trusted bundle location derived from tainted ingest input
- optional upstream metadata intended for later release provenance
- `RunIdentityRules`
- deterministic derivation rules from upstream snapshot identity
- collision policy for repeated ingest of the same snapshot identity
- `SegmentBoundaryRules`
- deterministic AST slice boundary rules
- proof rules for when a candidate boundary is strong enough to become a `Segment Record`
- `IngestArtifactRequirements`
- required machine artifacts for success: `Run Manifest`, `segments.jsonl`, and `Canonical Source Projection`
- optional human-readable summary artifact
## Observed Inputs
Snapshots or handoffs read but not owned by this context:
- optional verified previous `Run Manifest` reference used only for continuity hints
- upstream metadata needed later by `release-packaging`
## Policy Signature (Pseudo)
```text
deriveRunIdentity : SelectedSnapshot -> RunIdentity
parseBundleLocation :
TaintedBundleInput -> Result<TrustedBundleLocation, IngestRejected>
validatePreviousRunManifest :
RunManifest -> Result<VerifiedPreviousRunManifest, IngestRejected>
validateSnapshotIngest :
IngestUpstreamSnapshot -> SelectedSnapshot -> Result<SnapshotReady, IngestRejected>
decideSegmentBoundaries :
SnapshotReady -> SegmentBoundaryRules -> Result<NonEmptyList<SegmentRecord>, IngestRejected>
validateRequiredArtifacts :
RunIdentity -> SegmentRecords -> IngestArtifactRequirements -> Result<IngestArtifactsReady, IngestRejected>
performSnapshotIngest :
IngestUpstreamSnapshot
-> IngestSnapshotState
-> Result<UpstreamSnapshotIngested, SnapshotIngestHardStopped>
```
## Events
### Success Event
- `UpstreamSnapshotIngested`
- run identity
- upstream snapshot identity
- emitted `Run Manifest` reference
- emitted `Segment Record` set reference
- emitted `Canonical Source Projection` reference
- optional summary reference
### Failure Event
- `SnapshotIngestHardStopped`
- upstream snapshot identity when available
- failure reason
- failed stage such as parse failure, boundary proof failure, or required artifact failure
## Boundary Notes
- The `ingest-snapshot` context decides only whether one snapshot can be deterministically ingested and which boundaries become `Segment Records`.
- It does not decide package identity, dependency externalization, context heuristics, lineage matching, naming, regularization, transform replay, or release publication.
- Observing a previous `Run Manifest` does not let this slice reuse or override current ingest decisions; cross-run lineage belongs to `snapshot-lineage`.
- A previous `Run Manifest` must be verified for schema and integrity before this slice may use it as a continuity hint.
- Human-readable summary generation is outside the hard success contract for this slice; required machine artifacts remain the source of truth.
- Feature-level orchestration decides when later phases may continue after downstream review-needed states; this slice only hard-stops when deterministic ingest itself is not trustworthy.
@@ -0,0 +1,105 @@
module IngestSnapshot.DeterministicBundleIngest
open IngestSnapshot.SharedModel
// 1. Primitives
type TaintedBundleInput = TaintedBundleInput of TaintedBundleLocation
type DerivedRunIdentity = DerivedRunIdentity of RunIdentity
type MaxBundleBytes = MaxBundleBytes of int64
type ParseBudget = ParseBudget of int64
type BoundaryProof = BoundaryProof of string
type RequiredArtifact =
| RunManifestArtifact
| SegmentRecordsArtifact
| CanonicalProjectionArtifact
type IngestFailureReason =
| BundleNotParseable
| RunIdentityCouldNotBeDerived
| PreviousRunManifestNotVerified
| BundleTooLarge of MaxBundleBytes
| ParseBudgetExceeded of ParseBudget
| NoDeterministicBoundaryProven
| RequiredArtifactMissing of RequiredArtifact
// 2. Commands (Inputs)
type IngestUpstreamSnapshot =
{ SnapshotIdentity: SnapshotIdentity
BundleInput: TaintedBundleInput
SnapshotMetadata: SnapshotMetadata option
PreviousRunManifest: VerifiedPreviousRunManifest option }
// 3. Events (Facts)
type UpstreamSnapshotIngested =
{ RunManifest: RunManifest
SegmentRecords: SegmentRecord list
CanonicalProjectionPath: TrustedCanonicalProjectionPath
SummaryPath: TrustedSummaryPath option }
type SnapshotIngestHardStopped =
{ SnapshotIdentity: SnapshotIdentity
Reason: IngestFailureReason }
type Event =
| UpstreamSnapshotIngested of UpstreamSnapshotIngested
// 4. Errors
type Error =
| SnapshotIngestHardStopped of SnapshotIngestHardStopped
// 5. State (Aggregates)
type AwaitingSnapshotSelection =
{ RunIdentityRulesDescription: string
BoundaryRulesDescription: string
RequiredArtifacts: RequiredArtifact list
MaxBundleBytes: MaxBundleBytes
ParseBudget: ParseBudget }
type SnapshotReady =
{ SelectedSnapshot: SelectedSnapshot
PreviousRunManifest: VerifiedPreviousRunManifest option
RequiredArtifacts: RequiredArtifact list
MaxBundleBytes: MaxBundleBytes
ParseBudget: ParseBudget }
type DeterministicSegmentsReady =
{ RunIdentity: RunIdentity
SelectedSnapshot: SelectedSnapshot
PreviousRunManifest: VerifiedPreviousRunManifest option
SegmentRecords: SegmentRecord list
BoundaryProofs: BoundaryProof list
RequiredArtifacts: RequiredArtifact list }
type State =
| AwaitingSnapshotSelection of AwaitingSnapshotSelection
| SnapshotReady of SnapshotReady
| DeterministicSegmentsReady of DeterministicSegmentsReady
| SnapshotIngested of UpstreamSnapshotIngested
// 6. Contract Signatures
val parseBundleLocation : TaintedBundleInput -> Result<TrustedBundleLocation, Error>
val deriveRunIdentity : SelectedSnapshot -> Result<DerivedRunIdentity, Error>
val validatePreviousRunManifest : RunManifest -> Result<VerifiedPreviousRunManifest, Error>
val validateSnapshotSelection : State -> IngestUpstreamSnapshot -> Result<SnapshotReady, Error>
val decideSegmentRecords : SnapshotReady -> Result<DeterministicSegmentsReady, Error>
val decide : State -> IngestUpstreamSnapshot -> Result<Event, Error>
val apply : State -> Event -> State
val workflow : IngestUpstreamSnapshot -> Effect.Effect<Result<Event, Error>>
@@ -0,0 +1,14 @@
# Shared Language: Ingest Snapshot
## Context Meaning
- Owns deterministic intake of an upstream bundle snapshot into stable per-run recovery artifacts.
## Preferred Terms
| Term | Meaning | Use this, not that | Notes |
| :--- | :------ | :----------------- | :---- |
| `Snapshot Ingest` | Deterministic intake of one upstream bundle snapshot into recovery artifacts | `Snapshot Ingest` not `import step` | Focus on the domain event, not the script shape. |
| `Run Manifest` | Canonical record that identifies one ingest run and its emitted artifacts | `Run Manifest` not `build metadata` | Downstream slices treat it as the primary run reference. |
| `Segment Record` | One deterministic ingest-level code unit with hashes and canonical source | `Segment Record` not `chunk` | Matches the Phase 1 unit of work. |
| `Canonical Source Projection` | Pretty-printed recovered code emitted from ingest for downstream phases | `Canonical Source Projection` not `formatted bundle` | Preferred when the artifact is used as domain evidence. |
@@ -0,0 +1,64 @@
module IngestSnapshot.SharedModel
// 1. Shared primitives
type SnapshotIdentity = SnapshotIdentity of string
type TaintedBundleLocation = TaintedBundleLocation of string
type TrustedBundleLocation = TrustedBundleLocation of string
type RunIdentity = RunIdentity of string
type SourceSpan =
{ StartOffset: int
EndOffset: int }
type AstNodeKind = AstNodeKind of string
type RawHash = RawHash of string
type NormalizedHash = NormalizedHash of string
type ShapeHash = ShapeHash of string
type TrustedManifestPath = TrustedManifestPath of string
type TrustedSegmentsPath = TrustedSegmentsPath of string
type TrustedCanonicalProjectionPath = TrustedCanonicalProjectionPath of string
type TrustedSummaryPath = TrustedSummaryPath of string
// 2. Shared compounds
type SnapshotMetadata =
{ ReleaseNotesSource: string option
CollectedAt: string option }
type SelectedSnapshot =
{ SnapshotIdentity: SnapshotIdentity
BundleLocation: TrustedBundleLocation
SnapshotMetadata: SnapshotMetadata option }
type VerifiedPreviousRunManifest = VerifiedPreviousRunManifest of RunManifest
type SegmentHashes =
{ RawHash: RawHash
NormalizedHash: NormalizedHash
ShapeHash: ShapeHash }
type SegmentRecord =
{ SegmentId: string
SourceSpan: SourceSpan
AstNodeKind: AstNodeKind
CanonicalSource: string
Hashes: SegmentHashes }
type RunManifest =
{ RunIdentity: RunIdentity
SnapshotIdentity: SnapshotIdentity
ManifestPath: TrustedManifestPath
SegmentsPath: TrustedSegmentsPath
CanonicalProjectionPath: TrustedCanonicalProjectionPath
SummaryPath: TrustedSummaryPath option }
@@ -0,0 +1,32 @@
# Workflow Decomposition: Evaluate and Apply Renames
- Bounded context: `iterative-naming`
- Workflow slug: `evaluate-and-apply-renames`
- Trigger: `Reconcile one completed wave and evaluate naming proposals`
- Success outcome: Safe accepted names are applied after wave reconciliation, queue state is updated, and naming memory is refreshed.
## Inputs Owned by This Context
- deterministic naming validation rules
- wave reconciliation rules
- naming memory update rules
## Inputs Observed from Other Contexts
- executed request and response artifacts from `execute-wave-batches`
- current generated source and context references from earlier slices
## Downstream Handoffs
- emit safely renamed generated source to `codebase-regularization`
- emit refreshed naming memory back into later `plan-relabel-queue` iterations
## Slice Boundaries
- Decision logic is owned only by `iterative-naming`.
- Cross-context orchestration belongs in feature-level notes.
## Dependencies
- Requires `iterative-naming/execute-wave-batches` because evaluation operates on persisted wave outcomes.
- Iterates with `iterative-naming/plan-relabel-queue` until naming work is complete or terminal.
@@ -0,0 +1,32 @@
# Workflow Decomposition: Execute Wave Batches
- Bounded context: `iterative-naming`
- Workflow slug: `execute-wave-batches`
- Trigger: `Execute one planned relabel wave`
- Success outcome: Batch request and response artifacts are persisted for one wave, with transport outcomes recorded, but no names are applied yet.
## Inputs Owned by This Context
- wave execution rules
- provider/model configuration for one wave
- transport retry and rate-limit handling
## Inputs Observed from Other Contexts
- batch-ready request artifacts from `plan-relabel-queue`
- queue snapshot references from the iterative-naming queue store
## Downstream Handoffs
- emit persisted request, response, and execution metadata to `evaluate-and-apply-renames`
- emit transport failure and retry records for review artifacts when needed
## Slice Boundaries
- Decision logic is owned only by `iterative-naming`.
- Cross-context orchestration belongs in feature-level notes.
## Dependencies
- Requires `iterative-naming/plan-relabel-queue` because execution consumes persisted queue packets.
- Precedes `iterative-naming/evaluate-and-apply-renames` because responses must exist before evaluation.
@@ -0,0 +1,33 @@
# Workflow Decomposition: Plan Relabel Queue
- Bounded context: `iterative-naming`
- Workflow slug: `plan-relabel-queue`
- Trigger: `Plan deterministic relabel work from changed segments`
- Success outcome: New and modified naming candidates are ranked into deterministic batch-ready work packets with attempt metadata.
## Inputs Owned by This Context
- relabel candidate ranking rules
- naming difficulty and priority scoring
- queue packet schema
## Inputs Observed from Other Contexts
- relabel-eligible changed/new segments from `snapshot-lineage`
- context packet references from `static-context-evidence`
- prior naming memory from previous iterative-naming cycles
## Downstream Handoffs
- emit batch-ready relabel request artifacts to `execute-wave-batches`
- persist queue state for later evaluation and feedback loops
## Slice Boundaries
- Decision logic is owned only by `iterative-naming`.
- Cross-context orchestration belongs in feature-level notes.
## Dependencies
- Requires `snapshot-lineage/diff-adjacent-runs` because unchanged and deleted segments must stay out of the queue.
- Precedes the other iterative-naming slices because it owns queue formation.
@@ -0,0 +1,14 @@
# Shared Language: Iterative Naming
## Context Meaning
- Owns the relabel queue, model batch execution handoff, safe acceptance of names, and feedback loops that improve later passes.
## Preferred Terms
| Term | Meaning | Use this, not that | Notes |
| :--- | :------ | :----------------- | :---- |
| `Relabel Candidate` | One binding or symbol that still needs a better recovered name | `Relabel Candidate` not `rename task` | Candidate remains undecided until evaluation. |
| `Wave` | Set of batches executed against one shared pre-wave queue snapshot | `Wave` not `parallel run` | Carries a specific reconciliation rule. |
| `Naming Attempt` | One candidate-level attempt to recover a better name | `Naming Attempt` not `retry` | Distinct from transport retries and batch attempts. |
| `Naming Memory` | Small reusable history of accepted names that can inform later relabel work | `Naming Memory` not `cache` | Reviewable memory, not opaque storage. |
@@ -0,0 +1,31 @@
# Workflow Decomposition: Replay Maintained Transforms
- Bounded context: `maintained-transform-replay`
- Workflow slug: `replay-maintained-transforms`
- Trigger: `Replay maintained transforms onto the regularized tree`
- Success outcome: Durable maintained transforms are applied, skipped, or surfaced as conflicts against the current regularized tree.
## Inputs Owned by This Context
- maintained transform metadata
- replay ordering and dependency rules
- conflict reporting rules
## Inputs Observed from Other Contexts
- canonical editable tree and placement mappings from `codebase-regularization`
- long-lived maintained transform store from stable metadata
## Downstream Handoffs
- emit replay outcomes and transformed tree state to `release-packaging`
- emit conflict reports for later review when replay is unsafe
## Slice Boundaries
- Decision logic is owned only by `maintained-transform-replay`.
- Cross-context orchestration belongs in feature-level notes.
## Dependencies
- Requires `codebase-regularization/regularize-editable-tree` because replay targets the canonical editable tree.
@@ -0,0 +1,14 @@
# Shared Language: Maintained Transform Replay
## Context Meaning
- Owns replay of long-lived locally maintained changes onto the latest regularized upstream-derived tree.
## Preferred Terms
| Term | Meaning | Use this, not that | Notes |
| :--- | :------ | :----------------- | :---- |
| `Maintained Transform` | Durable replayable local change stored outside the numbered pipeline phases | `Maintained Transform` not `manual patch` | Preferred repository-wide term. |
| `Replay Conflict` | Explicit report that a transform cannot be applied safely to the current tree | `Replay Conflict` not `failed patch` | Signals surfaced review rather than hidden failure. |
| `Transform Anchor` | Stable file, module, segment, or lineage reference used to target replay | `Transform Anchor` not `patch location` | Better expresses deterministic targeting. |
| `Replay Outcome` | Applied, skipped, or conflict result for one maintained transform | `Replay Outcome` not `patch status` | Shared term used by release-packaging too. |
@@ -0,0 +1,33 @@
# Workflow Decomposition: Build and Publish Artifacts
- Bounded context: `release-packaging`
- Workflow slug: `build-and-publish-artifacts`
- Trigger: `Assemble final release artifacts for one published version`
- Success outcome: Processed and unmodified release artifacts, release manifest, and publication-ready metadata are emitted for the same upstream snapshot identity.
## Inputs Owned by This Context
- release artifact assembly rules
- provenance manifest rules
- publication target metadata
## Inputs Observed from Other Contexts
- transformed canonical tree and replay outcomes from `maintained-transform-replay`
- compact upstream summary facts from `snapshot-lineage`
- dependency and regularization summary facts from earlier contexts
## Downstream Handoffs
- emit release artifact set and release manifest for publication or retryable publication handoff
- emit release notes grounded in recorded pipeline facts
## Slice Boundaries
- Decision logic is owned only by `release-packaging`.
- Cross-context orchestration belongs in feature-level notes.
## Dependencies
- Requires `maintained-transform-replay/replay-maintained-transforms` because releases are built from the post-replay tree.
- Depends on earlier summary facts because release notes and manifests must stay traceable.
@@ -0,0 +1,14 @@
# Shared Language: Release Packaging
## Context Meaning
- Owns final artifact assembly, provenance reporting, and publication-ready release outputs for the recovery pipeline.
## Preferred Terms
| Term | Meaning | Use this, not that | Notes |
| :--- | :------ | :----------------- | :---- |
| `Release Artifact Set` | Processed and unmodified outputs emitted for one published recovery version | `Release Artifact Set` not `build output` | Includes more than the installable package. |
| `Release Manifest` | Machine-readable provenance record for one published version | `Release Manifest` not `release notes json` | Canonical artifact identity document. |
| `Processed Source Artifact` | Release artifact built from the post-Phase-9 tree | `Processed Source Artifact` not `final code zip` | Distinguishes it from the unmodified artifact. |
| `Unmodified Source Artifact` | Traceability artifact derived from the same upstream snapshot without maintained transforms | `Unmodified Source Artifact` not `raw source zip` | Important for debugging and provenance. |
@@ -0,0 +1,31 @@
# Workflow Decomposition: Diff Adjacent Runs
- Bounded context: `snapshot-lineage`
- Workflow slug: `diff-adjacent-runs`
- Trigger: `Compare current run to previous run`
- Success outcome: Every next-run segment gets a lineage-aware change classification, queue eligibility facts, and upstream summary evidence.
## Inputs Owned by This Context
- lineage minting and append-only rules
- change classification rules
- ambiguity handling rules
## Inputs Observed from Other Contexts
- previous and current run manifests from `ingest-snapshot`
- previous and current context packets from `static-context-evidence`
## Downstream Handoffs
- emit changed-segment facts and relabel eligibility to `iterative-naming`
- emit ambiguous match review artifacts and upstream summary facts for release reporting
## Slice Boundaries
- Decision logic is owned only by `snapshot-lineage`.
- Cross-context orchestration belongs in feature-level notes.
## Dependencies
- Requires `static-context-evidence/extract-segment-context` for both adjacent runs because deterministic evidence supports fuzzy matching and summary claims.
@@ -0,0 +1,14 @@
# Shared Language: Snapshot Lineage
## Context Meaning
- Owns durable identity and adjacent-run change facts for recovered segments across upstream snapshots.
## Preferred Terms
| Term | Meaning | Use this, not that | Notes |
| :--- | :------ | :----------------- | :---- |
| `Lineage Record` | Durable identity record carried across runs for a recovered segment or family | `Lineage Record` not `persistent ID row` | Keeps the concept domain-first. |
| `Change Classification` | Deterministic label such as unchanged, modified, new, deleted, split, merged, or ambiguous | `Change Classification` not `diff flag` | Human-reviewable change fact. |
| `Retained Tombstone` | Durable record for a deleted segment lineage | `Retained Tombstone` not `deleted row` | Important because deletions remain part of history. |
| `Ambiguous Match` | Contested lineage candidate that must not be forced into a winner | `Ambiguous Match` not `best effort match` | Signals explicit review-needed state. |
@@ -0,0 +1,32 @@
# Workflow Decomposition: Extract Segment Context
- Bounded context: `static-context-evidence`
- Workflow slug: `extract-segment-context`
- Trigger: `Extract deterministic context packets for each segment`
- Success outcome: Each segment has machine-readable context evidence usable for diffing, relabel planning, summaries, and transform anchoring.
## Inputs Owned by This Context
- context packet schema
- binding graph extraction rules
- usage-hint extraction rules
## Inputs Observed from Other Contexts
- stable segment records from `ingest-snapshot`
- post-externalization recovered source view from `dependency-recovery`; this is the canonical input for context extraction
## Downstream Handoffs
- emit context packets, binding graphs, and context summary artifacts to `snapshot-lineage`
- emit deterministic evidence usable later by `iterative-naming` and `maintained-transform-replay`
## Slice Boundaries
- Decision logic is owned only by `static-context-evidence`.
- Cross-context orchestration belongs in feature-level notes.
## Dependencies
- Requires `ingest-snapshot/deterministic-bundle-ingest` because context is segment-scoped.
- Should follow dependency recovery treatment so later slices see the intended recovered code surface.
@@ -0,0 +1,14 @@
# Shared Language: Static Context Evidence
## Context Meaning
- Owns deterministic evidence packets that describe how each segment behaves without asking an LLM to infer structure.
## Preferred Terms
| Term | Meaning | Use this, not that | Notes |
| :--- | :------ | :----------------- | :---- |
| `Context Packet` | Machine-readable evidence emitted for one segment | `Context Packet` not `prompt context` | The packet exists before any model request. |
| `Binding Graph` | Deterministic record of local names and relationships in one segment | `Binding Graph` not `scope dump` | Preferred for reviewer-readable evidence. |
| `Usage Hint` | Deterministic clue from calls, assignments, returns, or literals | `Usage Hint` not `guess` | Keeps evidence separate from inference. |
| `Observed Origin` | Visible import-like or member-access source seen in code | `Observed Origin` not `dependency source` | Avoids confusion with dependency-recovery decisions. |
+18
View File
@@ -0,0 +1,18 @@
# Recovery pipeline phases
This directory contains the current phase documents for the release-oriented deobfuscation workflow.
## Main index
- [Phase overview](file:///home/user/git/amp-decompiled/docs/phases/phase-overview.md)
## Phase files
- [Phase 1 — deterministic ingest](file:///home/user/git/amp-decompiled/docs/phases/phase-1-deterministic-ingest.md)
- [Phase 2 — dependency identification and externalization](file:///home/user/git/amp-decompiled/docs/phases/phase-2-overview.md)
- [Phase 3 — context extraction](file:///home/user/git/amp-decompiled/docs/phases/phase-3-context-extraction.md)
- [Phase 4 — run-to-run diffing and upstream summary](file:///home/user/git/amp-decompiled/docs/phases/phase-4-run-to-run-diffing.md)
- [Phase 5 — iterative relabel queue planning and batching](file:///home/user/git/amp-decompiled/docs/phases/phase-5-iterative-relabel-queue-export.md)
- [Phase 6 — relabel API execution and wave scheduling](file:///home/user/git/amp-decompiled/docs/phases/phase-6-relabel-api-execution-and-wave-scheduling.md)
- [Phase 7 — iterative relabel evaluation, application, and queue feedback](file:///home/user/git/amp-decompiled/docs/phases/phase-7-iterative-relabel-evaluation-application-and-queue-feedback.md)
- [Phase 8 — deterministic codebase regularization](file:///home/user/git/amp-decompiled/docs/phases/phase-8-deterministic-codebase-regularization.md)
- [Phase 9 — derive and replay maintained transforms](file:///home/user/git/amp-decompiled/docs/phases/phase-9-patch-capture-and-replay.md)
- [Phase 10 — build and publish release artifacts](file:///home/user/git/amp-decompiled/docs/phases/phase-10-build-recovered-source-tree.md)
@@ -0,0 +1,39 @@
# Phase 1 — deterministic ingest
## Goal
Turn an upstream bundle snapshot into deterministic segment records and a canonical source projection that later phases can diff, name, and replay transforms against.
## Script
`scripts/ingest-bundle.js`
## Inputs
- `--input <bundle.js>`
- `--run-id <id>`
- `--previous <runs/<id>/manifest.json>` optional
- optional upstream snapshot metadata for release manifests
## Responsibilities
1. parse the bundle with Babel
2. detect segment or module boundaries deterministically
3. for each segment/module:
- capture source slice and AST node type
- generate canonical pretty-printed code
- normalize identifiers for hashing where appropriate
- compute `rawHash`, `normalizedHash`, and `shapeHash`
4. emit:
- `runs/<run-id>/manifest.json`
- `runs/<run-id>/segments.jsonl`
- `runs/<run-id>/bundle.formatted.js`
- `runs/<run-id>/reports/summary.md`
## Release-oriented requirements
- manifest must identify the upstream snapshot being ingested
- output must be deterministic so identical upstream snapshots reuse the same run identity inputs
- segment IDs should persist when matching is strong, but minting new IDs is acceptable when reshaping is severe
- ingest output is the base tree for later relabeling and transform replay
## Verification
- run ingest against a fixture bundle
- confirm segment count is non-zero
- confirm manifest and JSONL are emitted
- spot-check representative segments for stable formatting
@@ -0,0 +1,39 @@
# Phase 10 — build and publish release artifacts
## Goal
Build the final release artifact set from the regularized and maintained tree, include the unmodified upstream-derived source for traceability, and emit release metadata suitable for Git tags, Gitea Releases, and package publication.
## Script
`scripts/build-recovered-view.js`
## Outputs
- canonical editable source tree at repo root
- processed source release artifact from the post-Phase-9 tree
- unmodified upstream-derived source artifact for historical and debugging purposes
- release-ready output bundle or package contents when applicable
- `releases/<version>.manifest.json`
- supporting human-readable release notes derived from the pipeline summary and maintained transforms
## Release manifest requirements
- published version `0.y.z`
- upstream snapshot identity
- current and previous run IDs
- processed and unmodified artifact paths and hashes
- transforms included in the release
- compact upstream summary
- packaging outputs and publication targets
- compact pipeline summary metrics such as naming effectiveness, regularization stats, and replay outcomes
## Publishing model
- Git tag + Gitea Release are the canonical published version identity
- package registry publishes installable artifacts for the same version when applicable
- unmodified upstream-derived source may be attached to releases for traceability/debugging without needing to be part of package-registry payloads
- raw upstream bundles stay as release assets, not committed source files
- publication may be retried from existing built artifacts without rebuilding everything
## Verification
- confirm the release tree builds from regularized upstream-derived source plus transforms
- confirm processed and unmodified artifacts correspond to the same upstream snapshot identity
- confirm release manifest is emitted
- confirm manifest-referenced artifacts exist and hash-match
- confirm package contents and release metadata refer to the same version
+32
View File
@@ -0,0 +1,32 @@
# Phase 2 — dependency identification and externalization
## Goal
Shrink the app-authored surface by identifying vendored third-party code, recording package-match decisions, and externalizing only high-confidence matches while preserving safe fallbacks.
## Scripts
- `scripts/identify-dependencies.js`
- `scripts/externalize-dependencies.js`
## Inputs
- `--manifest <runs/<id>/manifest.json>`
- `--segments <runs/<id>/segments.jsonl>`
- `--bundle <bundle.js>` optional
- optional npm metadata, tarballs, or CDN mirrors
- optional runtime traces
## Responsibilities
1. score vendored candidates using static evidence first
2. use runtime evidence only as a tie-breaker
3. record accepted, rejected, and unresolved decisions in machine-readable manifests
4. externalize accepted dependencies only
5. preserve bundled fallback code so execution remains safe
## Release-oriented requirements
- dependency decisions should be reusable across nearby upstream snapshots when hashes and evidence remain compatible
- dependency reports should feed the compact upstream summary for published releases
- do not block later phases on perfect package-version recovery
## Verification
- verify accepted matches externalize cleanly
- verify unresolved matches stay in the bundled fallback path
- confirm dependency decisions are deterministic for the same ingest output
@@ -0,0 +1,31 @@
# Phase 3 — context extraction
## Goal
Build deterministic machine-readable evidence packets for naming, diff summaries, and later transform replay decisions without using an LLM.
## Script
`scripts/extract-context.js`
## Deliverables
- `runs/<run-id>/context/segments.jsonl`
- `runs/<run-id>/context/bindings.jsonl`
- `runs/<run-id>/reports/context-summary.md`
- detailed requirements: [docs/phase-3-requirements.md](file:///home/user/git/amp-decompiled/docs/phase-3-requirements.md)
## Packet contents per segment
- stable segment ID
- canonical code reference
- local binding graph
- visible import-like origins
- member access patterns
- string literals, export hints, error text, CLI flags, and neighboring context
- callsite, assignment, and return usage hints for local functions
## Release-oriented requirements
- context must be deterministic across reruns of the same snapshot
- context should support compact upstream summaries, not just naming prompts
- context packets must remain usable when deriving transform anchors for maintained changes
## Verification
- spot-check bindings and call patterns for representative segments
- confirm packets are deterministic across two runs of the same bundle
@@ -0,0 +1,89 @@
# Phase 4 — run-to-run diffing and upstream summary
## Goal
Compare the current upstream snapshot to the previous one so the pipeline can preserve durable cross-run lineage, reuse prior names where safe, identify only changed or new segments for LLM work, and emit compact machine-readable facts plus a human-readable upstream-change summary.
## Script
`scripts/diff-runs.js`
## Inputs
- `--prev <runs/<id>/manifest.json>`
- `--next <runs/<id>/manifest.json>`
- Phase 3 context artifacts for both runs
- top-level `lineage/` durable store
## Core model
- A segment is the ingest-level code unit emitted by Phase 1, not an individual variable or binding.
- Run-local segment IDs are not durable identity across runs.
- Phase 4 mints and carries durable lineage identity across runs.
- Every `next` segment gets a lineage record, including brand-new segments.
- Deletions are modeled as retained tombstones, not removed records.
- Split and merge relationships are explicit lineage edges, not implied labels.
## Matching strategy
1. exact `normalizedHash`
2. exact `shapeHash` plus bounded similarity checks on source length
3. deterministic fuzzy matching using segment kind, string literals, Phase 3 context packets, and export hints
4. optional cheap-LLM assistance for ambiguous candidate ranking only; it is advisory and must not become the source of truth for matching, splitting, or diffing
## Matching rules
- Reserve `split` and `merged` for high-confidence structural cases.
- When candidates remain contested, emit `ambiguous` instead of forcing a winner.
- Ambiguous segments still emit full evidence and can contribute passive downstream context.
- Ambiguous segments are excluded from automated lineage-dependent downstream actions until resolved.
- Similarity checks should use both percentage and absolute source-length bounds.
- Matching is against `prev` and `next`; transitive lineage can be added later if adjacent-run matching proves insufficient.
## Output classifications
- `unchanged`
- `modified`
- `new`
- `deleted`
- `split`
- `merged`
- `ambiguous`
## Deliverables
- `runs/<next-id>/reports/changed-segments.json`
- `runs/<next-id>/relabel-queue.jsonl`
- `runs/<next-id>/reports/upstream-summary.json`
- `runs/<next-id>/reports/upstream-summary.md`
- `runs/<next-id>/reports/ambiguous-matches.json`
- append-only lineage events written under top-level `lineage/`
## Artifact shape
- `changed-segments.json` should be canonical per-`next` segment, with explicit match evidence, confidence, lineage IDs, lineage family IDs where relevant, and split/merge links.
- Deleted segments should be emitted as retained tombstones carrying retired lineage IDs and last-known segment metadata.
- Ambiguous match artifacts should include ranked candidate matches and the exact evidence used to score them.
- Machine-readable outputs are the source of truth for later phases.
- Human-readable summary prose is derived from machine facts, not the reverse.
## Queueing rules
- Unchanged segments are excluded from the relabel queue.
- Deleted segments are excluded from the relabel queue.
- New and modified segments are eligible for the relabel queue.
- Ambiguous segments should go to match review, not directly into automated rename reuse.
## Lineage model
- Durable lineage is stored in a top-level `lineage/` directory.
- The lineage store is append-only.
- Corrections are expressed as superseding events, never mutation or deletion.
- Splits create a lineage family plus child lineage IDs.
- Merges create a new lineage node linked to multiple parents.
- Graph-oriented projections or adjacency indexes may be derived later, but raw lineage events remain canonical.
## Summary requirements
- The primary audience for the summary is a human reviewer.
- The summary should summarize changed capabilities or areas when detectable.
- It should note prompt changes, endpoint changes, feature additions, and important constant or behavior shifts when detectable.
- Such claims should be grounded in explicit evidence from Phase 3 context or other recorded Phase 4 signals.
- Weaker claims should be phrased as detected signals, not asserted as certainty.
- Avoid storing giant line-by-line ledgers.
- Provide enough detail to understand what new material is being sent to the LLM.
## Verification
- Diff two ingests of the same bundle and expect almost all `unchanged`.
- Edit a fixture slightly and verify only nearby segments classify as changed.
- Confirm unchanged and deleted segments are excluded from the LLM queue.
- Confirm ambiguous segments produce a review artifact with ranked candidates and evidence.
- Confirm lineage events are appended under `lineage/` without deleting prior IDs or tombstones.
@@ -0,0 +1,35 @@
# Phase 5 — iterative relabel queue planning and batching
## Goal
Take the Phase 4 filtered rename candidates, rank them by estimated naming difficulty, and export compact iterative work packets that attack the easiest, most evidence-rich names first.
## Script
`scripts/export-relabel-queue.js`
## Queue model
- build the queue from only changed or new segments and only the bindings that still need names
- assign each candidate a naming difficulty score so obvious literals, property-backed aliases, and high-evidence bindings run before opaque arithmetic or pass-through values
- sort the queue by easiest-first priority, with deterministic tie-breaking so reruns stay stable
- track per-candidate `namingAttempts` and stop requeueing once `maxNamingAttempts` is reached
- process the queue in small configurable batches, with 10 candidates as the default starting point
## Packet contents per candidate or micro-batch
- stable segment ID and candidate binding ID
- smallest useful code slice instead of whole-module context when possible
- canonical code and deterministic context packet references from Phase 3
- neighboring accepted names, previous accepted names, and relevant naming memory when matched
- current `namingAttempts` count and queue priority metadata
- exact requested JSON response schema
- configurable target model version
- required per-name confidence score output, including separate confidence for function names versus parameter or local names when needed
## Requirements
- unchanged segments must not be resent to the LLM
- queue format should support iterative passes where low-confidence items can be retried later after neighboring names are accepted
- packets should preserve enough evidence to avoid speculative renaming while still staying as small as practical
- queue metadata should record why a candidate was included, why it was ranked where it was, and whether it was deferred from an earlier pass
## Verification
- export prioritized queues for representative changed segments
- confirm prompts stay within token budget with the default batch size of 10
- confirm queue entries include prior naming memory, difficulty rank, attempt counters, and target model configuration
@@ -0,0 +1,40 @@
# Phase 6 — relabel API execution and wave scheduling
## Goal
Execute Phase 5 batch artifacts against the configured LLM provider, schedule them in parallel waves against a shared pre-wave queue snapshot, and persist execution outcomes for later evaluation without applying names yet.
## Script
`scripts/execute-relabel-batches.js`
## Shared terms
- a **batch** is one model request containing one or more work items
- a **wave** is a set of batches executed in parallel against the same queue snapshot and reconciled only after execution and retries for that wave finish
## Inputs
- `runs/<run-id>/queue.sqlite`
- Phase 5 batch-ready request artifacts
- provider and model configuration
- wave concurrency configuration
- transport retry and rate-limit configuration
## Execution rules
- Phase 6 owns outbound API execution; Phase 5 stops at persisted batch artifacts and queue state
- execute batches in waves so all batches in one wave see the same pre-wave queue snapshot
- restrict each wave to one model/config in MVP
- support multiple concurrent batch requests within a wave, subject to configurable concurrency and provider rate limits
- retries for transport or provider failures should occur before the wave is closed, within configured retry limits
- transport and provider failures must be tracked separately from semantic naming failures
## Outputs
- persisted wave execution metadata
- persisted batch transport outcomes and retry records
- `runs/<run-id>/batches/<batchId>/request.json`
- `runs/<run-id>/batches/<batchId>/response.json`
- updated artifact references in `runs/<run-id>/queue.sqlite`
## Verification
- confirm waves execute batches in parallel against one pre-wave snapshot
- confirm one wave uses one model/config in MVP
- confirm request and response artifacts are persisted for each executed batch
- confirm transport failures and semantic failures are recorded separately
- confirm retry handling completes before wave reconciliation closes
@@ -0,0 +1,55 @@
# Phase 7 — iterative relabel evaluation, application, and queue feedback
## Goal
Validate Phase 6 model responses, evaluate and apply accepted names safely after wave reconciliation, and feed accepted-name results back into queue state so Phases 5 through 7 can iterate together.
## Script
`scripts/apply-relabel-results.js`
## Inputs
- Phase 5 queue state from `runs/<run-id>/queue.sqlite`
- executed Phase 6 batch artifacts such as `runs/<run-id>/batches/<batchId>/request.json` and `runs/<run-id>/batches/<batchId>/response.json`
- fixed type-specific or pass-specific response schemas emitted by Phase 5
- configured confidence thresholds, including any per-pass-kind overrides
- current generated source files or source metadata
- reusable naming memory state
## Safety rules
- only apply renames for bindings proven local to the segment scope
- keep original identifiers in metadata when useful for traceability
- reject reserved words, invalid identifiers, unchanged names, and local collisions
- validate every model response against the fixed schema for that candidate type or pass kind before using it
- enforce deterministic machine-checkable naming rules derived from the naming-conventions prompt constraints
- reject insufficiently specific names before collision resolution when specificity checks fail
- do not invent fallback names or auto-adjust model proposals
## Iteration rules
- reconcile results at the wave boundary; batches in the same wave must not affect each other mid-wave
- accept names independently when the model is confident about one symbol but uncertain about related symbols in the same response
- support partial attempts inside one work item, because a batch attempt does not imply every candidate in that item was meaningfully attempted
- track candidate naming attempts separately from batch attempts and insufficient-evidence counters
- immediately apply accepted names only after wave-level reconciliation has produced the final accepted set for that wave
- write accepted-name feedback back into queue state so Phase 5 can recompute evidence, difficulty, and priority deterministically on the next iteration
- send unresolved candidates to the back of the queue instead of forcing a guess
- stop iterating when all candidates are named, all remaining candidates hit configured retry limits, or no new names are being accepted
## Outputs
- updated `runs/<run-id>/segments.jsonl`
- updated generated source files or source metadata
- updated `runs/<run-id>/queue.sqlite` with candidate results, statuses, defer reasons, attempt counts, insufficient-evidence counts, and accepted-name feedback
- preserved executed batch request and raw response artifacts for replay and audit
- updated `stable/naming-memory.json`
## Release-oriented requirements
- relabel output should improve browsing and editing, but not become a hidden semantic transform layer
- naming memory should remain small and reviewable
## Verification
- validate each response file against the fixed schema expected for its batch or work-item type
- confirm candidate naming attempts, batch attempts, and insufficient-evidence counters are recorded separately
- confirm wave-level reconciliation happens before any names are applied
- parse generated output again with Babel
- confirm no syntax breakage
- confirm accepted names are written into naming memory for future reuse
- confirm accepted-name feedback is persisted to queue state for the next Phase 5 reranking step
- confirm low-confidence, insufficient-evidence, insufficient-specificity, and collision-risk candidates are requeued or terminally marked with correct reasons
@@ -0,0 +1,28 @@
# Phase 8 — deterministic codebase regularization
## Goal
Deterministically convert the recovered post-relabel source into a conventional, significantly more navigable editable tree that humans and LLMs can explore and modify more effectively without guessing original repository structure.
## Script
`scripts/regularize-codebase.js`
## Workflow
1. take the post-relabel recovered source from earlier phases
2. split coarse recovered units into smaller files or modules only where deterministic boundaries can be proven
3. assign deterministic file and folder placement
4. reconstruct deterministic import and export boundaries between split files
5. emit the canonical editable tree and regularization manifests
6. preserve stable placement for unchanged areas across runs when lineage and structure still match
7. use wrappers only as a last resort and mark them as recovery scaffolding
## Outputs
- canonical editable source tree at repo root
- regularization manifest and lineage mappings under `runs/<run-id>/`
- reusable stable placement metadata under `stable/` when helpful
## Verification
- confirm deterministic reruns produce the same regularized tree and manifest from identical inputs
- confirm the regularized tree parses after structural transformations
- confirm import/export graph consistency is preserved after splitting
- confirm unchanged modules preserve stable placement across runs when lineage matches
- confirm wrapper use remains exceptional and is surfaced in manifests or summary counts
@@ -0,0 +1,36 @@
# Phase 9 — replay maintained transforms
## Goal
Replay externally authored maintained transforms onto the regularized canonical editable tree so upgrades can carry your changes forward automatically where safe.
## Script
`scripts/replay-transforms.js`
## Workflow
1. load maintained transforms from long-lived stable metadata
2. target those transforms against the regularized upstream-derived tree produced by Phase 8
3. replay transforms in deterministic dependency-aware order
4. prefer `jscodeshift` codemods in MVP while allowing other deterministic transform forms when explicitly supported
5. emit explicit conflicts instead of forcing weak or unsafe replays
6. produce replay reports for applied, skipped, and conflicting transforms
## Transform metadata
- transform ID
- transform type
- stable file, module, segment, or lineage anchor
- codemod path or AST selector
- inputs
- declared dependency or ordering metadata
- replay status
## Constraints
- transform authoring and capture are outside the numbered upstream-processing pipeline
- do not derive transforms from git diffs inside this phase
- do not auto-apply destructive file removals in MVP
- do not apply weak-match replays
## Verification
- replay transforms onto the same run successfully
- replay them onto a lightly changed regularized run successfully
- fail with an explicit conflict on an incompatible upstream change
- confirm the replayed tree parses, builds, and runs maintained/basic tests after replay
+38
View File
@@ -0,0 +1,38 @@
# Phase overview
Use this document as the top-level index for the current release-oriented recovery pipeline.
## Repository model
- repo root is the canonical editable deobfuscated tree
- `runs/` keeps local current and previous upstream snapshot artifacts
- `stable/` keeps long-lived metadata reused across snapshots
- `releases/` stores machine-readable manifests for published deobfuscated versions
- upstream bundles are stored as release assets, not committed into git
## Release model
- an upstream “release” is any snapshot you decide to ingest
- your published versions use your own versioning, e.g. `0.y.z`
- `y` changes when the upstream snapshot changes
- `z` changes when only maintained transforms or packaging change
- publish with both Git tags/Gitea Releases and the package registry
## Cross-phase invariants
- do not rewrite semantics during relabeling
- do not depend on LLM output for splitting or diffing
- only send changed or new segments to the LLM on upgrade
- keep maintained changes replayable as transforms, preferably `jscodeshift` codemods
- keep normal git commits as the human audit log for maintained changes
- emit compact upstream summary manifests instead of line-by-line historical ledgers
- surface low-confidence transform replays as conflicts instead of auto-applying them
## Phases
1. [Phase 1 — deterministic ingest](file:///home/user/git/amp-decompiled/docs/phases/phase-1-deterministic-ingest.md)
2. [Phase 2 — dependency identification and externalization](file:///home/user/git/amp-decompiled/docs/phases/phase-2-overview.md)
3. [Phase 3 — context extraction](file:///home/user/git/amp-decompiled/docs/phases/phase-3-context-extraction.md)
4. [Phase 4 — run-to-run diffing and upstream summary](file:///home/user/git/amp-decompiled/docs/phases/phase-4-run-to-run-diffing.md)
5. [Phase 5 — iterative relabel queue planning and batching](file:///home/user/git/amp-decompiled/docs/phases/phase-5-iterative-relabel-queue-export.md)
6. [Phase 6 — relabel API execution and wave scheduling](file:///home/user/git/amp-decompiled/docs/phases/phase-6-relabel-api-execution-and-wave-scheduling.md)
7. [Phase 7 — iterative relabel evaluation, application, and queue feedback](file:///home/user/git/amp-decompiled/docs/phases/phase-7-iterative-relabel-evaluation-application-and-queue-feedback.md)
8. [Phase 8 — deterministic codebase regularization](file:///home/user/git/amp-decompiled/docs/phases/phase-8-deterministic-codebase-regularization.md)
9. [Phase 9 — derive and replay maintained transforms](file:///home/user/git/amp-decompiled/docs/phases/phase-9-patch-capture-and-replay.md)
10. [Phase 10 — build and publish release artifacts](file:///home/user/git/amp-decompiled/docs/phases/phase-10-build-recovered-source-tree.md)
@@ -0,0 +1,103 @@
# Phase 10 requirements — build and publish release artifacts
## Goal
Phase 10 builds the final release artifact set from the fully processed tree after Phases 1 through 9, includes the unmodified upstream-derived source for traceability, emits machine-readable release metadata, and optionally publishes the resulting artifacts.
Phase 10 is release-oriented only. It does not perform further semantic cleanup or structural regularization.
## Scope
Phase 10 is responsible for:
- building final processed release artifacts from the post-Phase-9 tree
- packaging installable outputs when applicable
- including the unmodified upstream-derived source as a release artifact for historical/debugging purposes
- generating release manifests and release notes from pipeline outputs
- attaching provenance and pipeline-summary metadata to emitted artifacts
- publishing artifacts when requested
- preserving built artifacts locally even when publication is skipped or partially fails
Phase 10 is not responsible for:
- additional code restructuring or cleanup
- transform replay
- upstream summary generation beyond consuming existing summaries
## Inputs
Phase 10 requires the following inputs:
- the final processed tree after Phase 9
- the unmodified upstream-derived source artifact or source snapshot
- Phase 4 upstream summary outputs
- Phase 8 regularization outputs and stats
- Phase 9 replay reports and replay stats
- naming and queue summary metrics from earlier phases
- packaging configuration
- publication target configuration
## Artifact requirements
1. Phase 10 must define explicit artifact categories.
2. MVP artifact categories must include at least:
- `processed-source`
- `unmodified-source`
- `package-output`
- `release-manifest`
- `release-notes`
3. The `processed-source` artifact must reflect the post-Phase-9 canonical editable tree.
4. The `unmodified-source` artifact must reflect the original upstream-derived source for the same upstream snapshot.
5. The unmodified-source artifact exists for traceability and debugging, not as the preferred editable artifact.
6. Phase 10 must retain release artifacts locally even when publishing is skipped or partially fails.
## Manifest and provenance requirements
1. Phase 10 must emit one machine-readable release manifest per version.
2. The manifest must include at least:
- published version `0.y.z`
- upstream snapshot identity
- current and previous run IDs
- artifact categories, paths, and hashes
- packaging outputs
- publication targets
- replay summary information
- compact upstream summary
- pipeline summary metrics
3. The manifest must attach provenance metadata linking artifacts back to run IDs, upstream snapshot identity, and transform replay results.
4. The manifest must record which artifacts were built and which were successfully published.
5. The manifest must verify that processed and unmodified artifacts correspond to the same upstream snapshot identity.
## Pipeline summary metric requirements
1. Phase 10 must include a compact pipeline summary section in release metadata.
2. Pipeline summary metrics must include at least:
- naming effectiveness metrics such as acceptance, stalled, and exhausted counts
- regularization metrics
- transform replay success and conflict counts
3. These metrics are informational in MVP and must not act as release-blocking thresholds by default.
4. Pipeline summary data should support downstream debugging and historical comparison.
## Release note requirements
1. Phase 10 must generate release notes from pipeline outputs rather than relying on upstream patch notes.
2. Release notes must be derived from at least:
- the Phase 4 upstream summary
- the Phase 9 replay summary
- artifact and pipeline metadata
3. Release notes must explicitly distinguish upstream-origin changes from maintained/custom changes.
4. Release notes should stay compact and mechanically derived where practical.
## Versioning requirements
1. Published versions must follow the `0.y.z` scheme.
2. `y` changes when the upstream snapshot changes.
3. `z` changes when only maintained transforms, packaging, or other non-upstream release details change.
4. Git tag, Gitea Release, and package registry publication must refer to the same version identity when published.
## Publishing requirements
1. Publishing must remain optional at execution time.
2. Phase 10 must support publication to at least:
- Git tag
- Gitea Release
- package registry when applicable
3. Publication failure must be tracked separately from artifact build failure.
4. Publication should be restartable and retriable from existing built artifacts without rebuilding everything.
5. The unmodified-source artifact may be attached to release publications without being included in package-registry payloads when it does not belong there.
## Verification requirements
1. Phase 10 must verify packaging and release integrity for the final artifact set.
2. Phase 10 must confirm the processed artifact set and unmodified artifact set correspond to the same upstream snapshot.
3. Phase 10 must confirm all manifest-referenced artifacts exist and hash-match.
4. Phase 10 must confirm package contents and release metadata refer to the same version.
5. Phase 10 must preserve built artifacts even when publication fails.
6. Heavy parse/build/test checks may rely on earlier phases, but Phase 10 must still perform final artifact and packaging integrity verification.
@@ -0,0 +1,198 @@
# Phase 2 requirements: dependency identification and externalization
## Purpose
Phase 2 exists to reduce the amount of bundled, obfuscated code that must be decompiled or deobfuscated by identifying vendored third-party dependencies, carving them out of the recovered projection, and replacing them with npm imports in an unbundled repo layout.
The intent is not to recreate the original bundle. The intent is to shrink the app-authored surface area that later phases must understand.
## Problem statement
The input artifact is a single obfuscated JavaScript bundle with inline dependencies and no original `package.json`, lockfile, or source map guarantees. Package identification therefore cannot rely on exact string matching or preserved symbol names. Phase 2 must work even when dependencies are split across multiple obfuscated wrappers and exposed only through structural or behavioral clues.
## Goals
1. Identify likely third-party packages embedded in the obfuscated bundle.
2. Recover probable package boundaries even when a single package spans multiple segments.
3. Rank matches probabilistically and support configurable confidence thresholds.
4. Preserve auditable evidence for every decision.
5. Emit a manifest containing accepted, rejected, and unresolved matches.
6. Externalize accepted dependencies into an unbundled repo projection that imports packages from npm.
7. Preserve original bundled implementations as fallbacks for validation.
8. Prioritize buildable/runnable reconstructed output over human readability.
## Non-goals
1. Recover the exact original dependency versions.
2. Make the reconstructed code human-friendly in Phase 2.
3. Delete bundled vendored code after a dependency is externalized.
4. Depend on exact package-name string matches.
5. Fully automate conflict resolution for ambiguous low-confidence matches.
## Inputs
Phase 2 must support these inputs:
- `manifest.json` from Phase 1 ingest
- `segments.jsonl` from Phase 1 ingest
- optional original bundle path
- optional runtime instrumentation traces
- optional npm registry metadata and tarball contents
- optional package file mirrors from services such as unpkg or jsdelivr
- configurable confidence thresholds for acceptance and review
## Functional requirements
### FR1. Candidate package discovery
The system must scan segment and segment-group candidates for evidence of vendored third-party code.
Accepted evidence sources include:
- license banners
- preserved package names
- source map hints
- preserved `require(...)` strings
- characteristic literal sets and error messages
- helper signatures
- AST shape fingerprints
- export surface similarity
- dependency graph position
- byte or AST similarity against npm package contents
- runtime execution/export traces when available
### FR2. Module-boundary recovery
The system must recover candidate package boundaries even when package code is split across adjacent or related obfuscated wrappers.
This includes:
- grouping adjacent segments
- grouping structurally linked segments
- recording the rationale for the recovered boundary
- allowing one candidate package to map to multiple segments
### FR3. Confidence scoring
The system must assign a confidence score to each candidate package match.
The scoring system must:
- combine multiple evidence types
- tolerate obfuscation and renamed symbols
- rank multiple candidate matches for the same segment group
- support configurable thresholds such as `0.95` for auto-accept and `0.85` for review
### FR4. Decision states
Each candidate match must be recorded in one of these states:
- `accepted`
- `rejected`
- `unresolved`
The system must record why the candidate is in that state.
### FR5. Manifest output
Phase 2 must produce a manifest rich enough to drive later reconstruction and reruns.
Each manifest candidate record must include:
- candidate package name
- decision state
- confidence score
- evidence summary
- raw evidence references
- matched segment IDs or group IDs
- recovered boundary notes
- replacement plan
- shim requirement if any
- fallback/original module reference
- evidence provenance such as registry, tarball, CDN, runtime, static, or mixed
- notes on ambiguity, collisions, and follow-up work
### FR6. Externalization
For every accepted dependency match, the system must externalize the dependency into the reconstructed repo projection.
Externalization must:
- use the latest npm package source by default
- replace vendored references in the recovered projection with package imports
- preserve original bundled implementations for fallback and verification
- support CommonJS/ESM interop shims when required
- keep vendored replacement logic separable from app-authored code
### FR7. Fallback preservation
The system must preserve original bundled code for accepted matches so validation can compare imported package behavior with original vendored behavior.
### FR8. Manual review support
The system should provide a narrow manual review path for low-confidence collisions or unresolved candidates, while assuming the default path is automated.
### FR9. Verification harness
Phase 2 must define and support verification steps suitable for partially recovered codebases.
Minimum verification requirements:
- install/resolve verification for accepted npm packages
- syntax/parse validation of generated imports and shims
- smoke-launch verification when feasible
- targeted command checks for important Amp entrypoints when feasible
- selected equivalence checks between npm-backed paths and preserved vendored fallbacks
- staged rollout by confidence threshold
## Output requirements
Phase 2 outputs must include:
- dependency candidate report JSON
- vendored dependency summary report
- dependency externalization report
- updated manifest with candidate decisions
- reconstructed repo projection with npm imports for accepted packages
- preserved fallback references to original vendored implementations
## Buildability requirements
Phase 2 should optimize for a reconstructed repo that can become buildable and runnable before it becomes easy to read.
This means:
- import correctness matters more than naming quality
- interop compatibility matters more than cosmetic cleanup
- preserving execution semantics matters more than source aesthetics
## Acceptance criteria
Phase 2 is acceptable when all of the following are true:
1. The system can identify at least some obvious vendored dependencies from an obfuscated bundle.
2. Accepted, rejected, and unresolved candidates are persisted in the manifest.
3. Accepted candidates can be externalized to npm imports in the recovered projection.
4. Original vendored implementations remain available for fallback or equivalence checking.
5. Confidence thresholds can be changed without redesigning the manifest format.
6. Verification can be run even before full deobfuscation or full Amp test coverage exists.
## Milestones
### Milestone 2A: evidence and scoring
- implement static evidence collection
- implement candidate grouping and package-boundary recovery
- implement ranked match scoring
- emit candidate manifest records
### Milestone 2B: runtime-assisted matching
- add optional runtime instrumentation
- collect export-shape and execution-trace evidence
- merge runtime evidence into scoring
### Milestone 2C: dependency externalization
- fetch latest npm package sources for accepted matches
- emit import replacements
- emit fallback preservation wiring
- add interop shims as needed
### Milestone 2D: verification harness
- implement resolve/install verification
- implement smoke-launch and targeted command checks where possible
- implement selected fallback equivalence checks
- document threshold-based rollout policy
### Milestone 2E: handoff to later phases
- emit reduced unbundled repo state
- emit final Phase 2 manifest as input to later readability and deobfuscation work
## Risks and open questions
- minimal smoke-test scope may remain weak until later deobfuscation phases exist
- some packages may only partially match npm sources
- some vendored code may be too entangled with app-authored code to externalize immediately
- multiple npm packages may collide on API shape or literal signatures
- latest-package replacement may occasionally drift from the bundled behavior and require fallback retention longer than expected
@@ -0,0 +1,327 @@
# Phase 3 requirements: context extraction
## Purpose
Phase 3 exists to extract deterministic, machine-readable context from the recovered bundle projection so later phases can make better decisions without depending on an LLM for truth.
The primary customer of Phase 3 is the pipeline, not the human reviewer. Human-readable summaries are still useful, but the core output of this phase is canonical evidence that later phases can consume for naming, diffing, and maintained-transform anchoring.
## Relationship to other phases
Phase 3 sits after deterministic ingest and dependency identification, and before run-to-run diffing, relabel queue export, safe relabel application, and transform replay.
This means Phase 3 is responsible for producing evidence, not decisions:
- Phase 2 identifies and externalizes vendored dependencies.
- Phase 3 extracts static context from the recovered projection.
- Phase 4 uses that context when matching and summarizing changed segments.
- Phases 5 and 6 use that context to support staged, incremental renaming.
- Phase 7 uses that context to support stable transform anchors and replay safety.
Phase 3 must therefore remain broad enough to support more than naming alone.
## Problem statement
The recovered source still contains many bindings, functions, object shapes, and cross-segment relationships whose intent is not obvious from local code alone. Purely local segment text is often insufficient to:
- infer whether a binding is likely boolean, string-like, config-like, logger-like, or error-like
- infer how function parameters are used across callsites
- infer what returned values represent from downstream consumption
- recognize recurring object shapes and property clusters
- preserve enough scope and alias information to support later rename safety checks
- provide stable evidence for run-to-run diffing and maintained-transform anchoring
Phase 3 must solve this with deterministic static analysis only. It must not drift into runtime instrumentation, naming decisions, or semantic rewrites.
## Goals
1. Produce one canonical static context packet per segment.
2. Include bounded cross-segment evidence, including directly linked segments and optional physical neighbors.
3. Emit deterministic evidence usable by multiple later phases, not only naming.
4. Record raw evidence and deterministic derived heuristics side-by-side.
5. Preserve ambiguity rather than forcing premature semantic conclusions.
6. Record scope ownership, aliasing, shadowing, and usage sites needed for later rename-safety work.
7. Capture bounded object-shape and call-pattern evidence strong enough to support staged incremental naming later.
8. Provide stable anchors for segment-local records and cross-segment link records across reruns of the same snapshot.
9. Keep artifacts append-only so later phases consume evidence without mutating Phase 3 truth.
## Non-goals
1. Execute the recovered program or require runtime instrumentation.
2. Perform renaming of bindings, properties, functions, files, or segments.
3. Decide package identity or dependency externalization behavior already handled by Phase 2.
4. Rewrite semantics, simplify logic, or refactor recovered code.
5. Walk arbitrary-depth program graphs in ways that make packets unstable or unbounded.
6. Collapse conflicting evidence into a single forced interpretation.
## Clarification on package identification
Package identification is explicitly out of scope for Phase 3 because Phase 2 already owns dependency identification and externalization. Phase 3 may still record static evidence that happens to be suggestive, such as preserved import-like strings or export surfaces, but it must not become a second package-matching system.
## Inputs
Phase 3 must support these inputs:
- `manifest.json` from earlier phases
- deterministic segment records produced by ingest
- canonical code references and stable segment IDs
- any static dependency/link information already produced by earlier phases
- the reconstructed projection after accepted Phase 2 dependency decisions
Phase 3 must not require runtime traces.
## Core output model
Phase 3 must produce a canonical context packet for every segment, plus derived summaries for downstream consumers.
The canonical packet is the source of truth. Any narrower naming-oriented, diff-oriented, or transform-oriented views must be derived from it rather than recomputed independently.
## Functional requirements
### FR1. Canonical per-segment packet
Phase 3 must emit one canonical machine-readable context packet for every segment.
Each packet must include at least:
- stable segment ID
- canonical code reference and compact source spans
- segment-local binding records
- cross-segment link records
- evidence items with provenance
- deterministic heuristics derived from evidence
- unknown or unresolved observations when applicable
Phase 3 should reference canonical code and spans instead of embedding large AST snippets.
### FR2. Static-only analysis
Phase 3 must operate using deterministic static analysis only.
It must not require:
- executing recovered code
- dynamic tracing
- browser automation
- test harness instrumentation
Future runtime enrichment may be added in a later phase if static results prove insufficient, but it is not part of Phase 3 requirements.
### FR3. Bounded cross-segment evidence
Phase 3 must include bounded cross-segment evidence rather than analyzing each segment in total isolation.
Required cross-segment evidence includes:
- directly linked segments through static import/export-like relationships
- directly linked segments through shared binding or reference relationships when statically evident
- optional physical neighbor segments within a deterministic bounded window
Phase 3 must not depend on unbounded graph walks.
### FR4. Binding inventory and usage evidence
Phase 3 must summarize all bindings in a segment.
For every binding, the packet must record as applicable:
- declaration kind
- declaration span
- read sites
- write sites
- call sites
- assignment targets and sources
- return participation
- export participation
- shadowing information
- alias relationships
- scope ownership
The system may emit richer details for salient bindings, but it must still inventory all bindings.
### FR5. Function parameter, return, and callsite evidence
Phase 3 must collect function-level evidence sufficient for later staged naming passes.
This includes:
- parameter usage by position
- argument-shape patterns observed at callsites
- whether parameters are read, written, forwarded, or returned
- returned expression categories
- downstream usage patterns of call results when statically visible
- whether a function appears to act as a wrapper, passthrough, factory, predicate, formatter, logger, or similar deterministic heuristic categories when evidence supports it
### FR6. Object-shape evidence
Phase 3 must collect bounded object-shape evidence.
This includes:
- observed property reads and writes
- property co-occurrence clusters
- destructuring patterns
- method clusters
- recurring shape fragments across direct links when statically visible
- frequency information for important property accesses
### FR7. Literal and signal extraction
Phase 3 must extract high-value literal signals.
This includes:
- string literals
- number literals where they appear semantically relevant
- error messages
- CLI flags
- environment variable keys
- import-like strings
- export-name hints
Where cheap and deterministic, Phase 3 should store both raw and normalized forms.
### FR8. Heuristics with preserved raw evidence
Phase 3 may emit deterministic heuristics, but only when raw evidence remains attached.
Examples include:
- likely boolean flag
- likely enum-like discriminator
- likely logger
- likely error object
- likely config object
- likely promise-like or callback-like value
Heuristics must be explicitly labeled as heuristics rather than treated as truth.
### FR9. Ambiguity preservation
When evidence conflicts, Phase 3 must preserve competing interpretations.
This includes:
- storing multiple candidate semantic categories
- storing evidence counts and confidence values
- storing unresolved or conflicting states explicitly
Phase 3 must not flatten contradictory evidence into a single forced interpretation.
### FR10. Provenance and negative evidence
Every evidence item must carry provenance.
At minimum, provenance should distinguish categories such as:
- `static-ast`
- `token`
- `binding-graph`
- `cross-segment-link`
- `literal-extraction`
- `canonicalization`
Phase 3 should also record negative evidence for high-value categories when useful, such as:
- no writes observed
- never returned
- never exported
- only read as condition
### FR11. Stable anchors and identity
Phase 3 must produce stable identifiers for:
- segment-local evidence anchors
- binding records
- cross-segment link records
These identifiers must be stable across repeated runs of the same snapshot, assuming the same canonical segment input.
### FR12. Append-only evidence artifacts
Phase 3 outputs must be treated as append-only evidence artifacts.
Later phases may derive views, decisions, queues, or rename proposals from Phase 3 outputs, but they must not mutate the original Phase 3 evidence packets in place.
### FR13. Support for later rename-safety work
Phase 3 must record prerequisite evidence for later rename-safety checks without performing rename decisions itself.
This includes:
- scope boundaries
- ownership boundaries
- alias chains
- shadowed names
- reads versus writes
- local versus exported visibility
- direct cross-segment references relevant to rename impact
### FR14. Multi-consumer compatibility
Phase 3 requirements must explicitly support at least these downstream consumers:
- Phase 4 diffing and upstream summary
- Phase 5 relabel queue export
- Phase 6 safe relabel application
- Phase 7 transform anchoring and replay safety
A Phase 3 design that only serves naming is insufficient.
## Output requirements
Phase 3 outputs must include:
- `runs/<run-id>/context/segments.jsonl`
- `runs/<run-id>/context/bindings.jsonl`
- `runs/<run-id>/reports/context-summary.md`
The machine-readable outputs must preserve evidence richness even if the report is compact.
## Quality constraints
Phase 3 outputs must be:
- deterministic across repeated runs of the same snapshot
- bounded in size and graph depth
- traceable back to code spans and segment IDs
- explicit about confidence, ambiguity, and provenance
- usable without LLM interpretation as the source of truth
## Acceptance criteria
Phase 3 is acceptable when all of the following are true:
1. The system emits one canonical context packet for each segment.
2. The same snapshot processed twice yields equivalent context outputs and stable anchors.
3. Packets include both local binding evidence and bounded cross-segment evidence.
4. Packets preserve raw evidence and deterministic heuristics side-by-side.
5. Conflicting interpretations remain visible rather than being collapsed.
6. Scope, aliasing, shadowing, and usage evidence is rich enough to support later rename-safety checks.
7. Object-shape, callsite, parameter, and return evidence is present for representative segments.
8. Outputs are suitable for later diffing and transform anchoring, not just naming.
9. No runtime execution is required.
10. No renaming or dependency-identification decisions are performed in this phase.
## Representative edge cases
The requirements and verification examples must cover at least:
- wrapper functions that forward parameters
- reexports and passthrough segments
- shared helper segments used by multiple callers
- object-literal APIs with recurring property sets
- shadowed bindings in nested scopes
- condition-only flags
- values whose observed usage produces conflicting interpretations
- adjacent segments with no direct semantic relationship
## Suggested milestones
### Milestone 3A: canonical packet skeleton
- emit per-segment packets with stable IDs, code references, and binding inventory
- emit initial segment and binding JSONL outputs
### Milestone 3B: cross-segment and call-pattern extraction
- add direct-link records
- add callsite, parameter, and return usage evidence
- add bounded neighbor context
### Milestone 3C: object-shape and literal signal extraction
- add property-cluster and destructuring evidence
- add error text, CLI flag, env key, and import-like signal extraction
- add raw plus normalized literal storage where useful
### Milestone 3D: heuristics, ambiguity, and provenance
- add deterministic heuristics
- add competing interpretations with counts/confidence
- add evidence provenance and selected negative evidence
### Milestone 3E: determinism and downstream validation
- validate rerun stability
- validate representative edge cases
- prove outputs are sufficient for at least one downstream diffing and one downstream relabeling use case
## Verification
Minimum verification requirements:
- run Phase 3 twice on the same snapshot and confirm deterministic output equivalence
- spot-check representative segments for binding, callsite, return, and object-shape evidence
- confirm stable IDs for bindings and cross-segment links across reruns
- confirm physically adjacent but semantically unrelated segments are not over-linked
- confirm conflicting evidence remains represented as ambiguity rather than forced truth
- confirm reports are derivable from machine-readable packets rather than containing extra undiscoverable conclusions
@@ -0,0 +1,288 @@
# Phase 5 requirements — iterative relabel queue planning and batching
## Goal
Phase 5 converts the Phase 4 filtered rename set into a deterministic, resumable, inspectable queue of rename work. It prioritizes the easiest and most evidence-rich naming work first, emits small inseparable work items for model execution, and continuously re-evaluates deferred work as accepted names improve the surrounding evidence.
Phase 5 does not apply renames. It plans, ranks, materializes, and records naming work so Phase 6 can execute model calls and feed accepted names back into the queue.
## Scope
Phase 5 is responsible for:
- consuming changed/new rename candidates and dependency edges from Phase 4
- consuming deterministic context artifacts from Phase 3
- constructing candidate and work-item queue state
- computing evidence, difficulty, and priority scores deterministically
- enforcing pass ordering and dependency-aware scheduling
- selecting batches of work items for execution
- persisting queue state and executed batch artifacts
- re-evaluating deferred and pending items after accepted names from earlier iterations improve context
Phase 5 is not responsible for:
- validating model responses
- deciding whether a proposed name is accepted
- applying code renames
- mutating naming memory directly except through queue planning artifacts
## Core terms
- **rename candidate**: the atomic naming target tracked by the queue. Candidates are typed and may represent a `local`, `param`, `function`, `module`, `alias`, or `property-alias`.
- **work item**: the smallest inseparable unit sent for a naming attempt. A work item may contain one candidate or a deterministic cluster of candidates that must be named together.
- **batch**: a scheduled collection of work items sent together to one configured model.
- **dependency edge**: a directed candidate-to-candidate dependency used to express that one naming decision may become easier after another is accepted.
- **evidence score**: a deterministic score representing how much useful naming evidence currently exists for a candidate.
- **difficulty score**: a deterministic score representing the intrinsic difficulty of naming a candidate or work item.
- **priority score**: a deterministic scheduling score derived from evidence, difficulty, dependency penalties, pass ordering, and aging.
- **deferred**: visible queue state used for candidates that should not be selected yet because current evidence is too weak or unresolved naming dependencies make the attempt too vague.
## Inputs
Phase 5 requires the following inputs:
- Phase 4 filtered rename candidates for changed or new segments only
- Phase 4 dependency edges between candidates, including cross-segment edges when supported by deterministic evidence
- Phase 4 lineage and neighbor links used to constrain cross-segment dependency discovery
- Phase 3 context artifacts and context packet references
- reusable prior naming memory matches when structural matching is strong enough
- Phase 6 accepted-name feedback from earlier iterations in the same run
- configuration for pass ordering, scoring weights, batch size, model selection, token budgets, and threshold values
## Determinism requirements
1. Given identical Phase 3 inputs, Phase 4 inputs, accepted-name feedback, and Phase 5 configuration, Phase 5 must produce identical queue state, scores, work items, and batch selection order.
2. Phase 5 must not invent candidates or dependencies outside deterministic upstream evidence from earlier phases.
3. Phase 5 may derive work-item-level or segment-level rollups from candidate-level edges, but candidate-level edges are the canonical dependency representation.
4. Tie-breaking must be deterministic. At minimum, tie-breaking must use `priorityScore`, then dependency depth, then stable identifiers such as `segmentId` and `candidateId`.
## Candidate model requirements
Each rename candidate must have a stable ID independent of queue order and batch membership.
Each candidate record must include at least:
- `candidateId`
- `segmentId`
- `bindingType`
- `originalName`
- `scopeId`
- reference to the smallest available code slice artifact
- evidence summary fields
- dependency summary fields
- `evidenceScore`
- `difficultyScore`
- `priorityScore`
- `namingAttempts`
- current status
- defer reason when deferred
- primary pass kind
- target model selection fields or references
Candidates may gain additional usable evidence over time as earlier accepted names improve surrounding context. Minimum viable evidence must therefore be re-evaluated on every queue iteration.
## Candidate taxonomy
Phase 5 must support the following candidate taxonomy in MVP:
- `local`
- `param`
- `function`
- `module`
- `alias`
- `property-alias`
Easy cases such as a literal-backed local variable are still typed as `local`. Their ease is represented through evidence and ranking fields rather than by introducing a separate candidate type.
## Pass ordering requirements
Phase 5 must define and enforce explicit pass kinds in MVP:
- `locals`
- `params`
- `functions`
- `module`
Pass ordering is a scheduling preference, not an absolute ban. Strong direct evidence may justify selecting a later-pass candidate early.
A work item may mix candidate types or pass kinds only when the grouped candidates are truly inseparable. Mixed work items must still declare a single primary pass kind for scheduling and schema selection.
A candidate may appear in later iterations under a different pass kind if improved context changes the most appropriate naming frame.
## Dependency requirements
1. Dependency edges must be explicit and candidate-to-candidate.
2. Dependency edges may cross segment boundaries when they are supported by deterministic Phase 3 and Phase 4 evidence.
3. Cross-segment dependencies must be constrained to explicit linked candidates surfaced by earlier phases rather than arbitrary global inference.
4. Each dependency edge must include a `dependencyType`. MVP must support at least:
- `data-flow`
- `signature`
- `alias`
- `call-site`
5. Unresolved dependencies must influence evidence and priority scoring.
6. Unresolved dependencies do not always make a candidate ineligible. A candidate with strong direct evidence may still be batched despite dependency penalties.
7. Phase 5 must distinguish between:
- deferred because evidence is below minimum viable threshold
- deferred because unresolved naming dependencies make the attempt too vague right now
- attempted but deferred again because the model returned low confidence or insufficient evidence
## Minimum viable evidence requirements
1. Phase 5 must define a minimum viable evidence rule per candidate type.
2. A candidate that does not meet minimum viable evidence must remain visible in queue state as `deferred`; it must not be dropped.
3. Deferred candidates must automatically re-enter normal ranking when new accepted names or other deterministic evidence raise them above minimum viable evidence.
4. Accepted names from earlier iterations must be folded back into candidate evidence before reranking.
5. Evidence dimensions in MVP must include at least:
- literal semantics
- usage patterns
- dependency resolution state
- strong prior naming memory matches
- neighboring accepted names
## Scoring requirements
Phase 5 must keep the following score families separate:
- `evidenceScore`: how much useful evidence currently exists
- `difficultyScore`: intrinsic naming difficulty
- `priorityScore`: scheduling priority after pass-order preferences, dependency penalties, and aging
### Difficulty scoring
Difficulty scoring must be deterministic and configurable. The requirements document must name weighted dimensions even if exact weights are configured outside the code path.
MVP difficulty dimensions must include at least:
- literal clarity
- dependency depth
- context richness
- prior-memory strength
- ambiguity or collision risk
### Priority scoring
Priority scoring must be deterministic and configurable and must include at least:
- difficulty ordering, favoring easier work first
- pass-kind ordering preference
- unresolved dependency penalties
- aging bump for long-deferred work
- direct evidence strength
Priority scoring must prefer easiest-first but should secondarily prefer co-batching cheap prerequisite neighbors when token budget allows.
## Aging and starvation requirements
1. Aging must be based on queue iterations or cycles, not wall-clock time.
2. Aging may increase priority for deferred work, but unresolved dependency penalties must still dominate when appropriate.
3. The queue must include fairness behavior so difficult work is not starved forever before `maxNamingAttempts` is reached.
## Work-item requirements
1. A work item is the smallest inseparable naming unit and must have a stable `workItemId` independent of queue order and batch membership.
2. Work items are built from candidate-level dependency and scope information.
3. Once a work item is constructed as the smallest inseparable unit, Phase 5 must not split it further.
4. A work item may appear in multiple batches over time as context improves.
5. One work item may contain multiple candidates, and one batch attempt may increment attempt state for multiple member candidates.
6. Work items must carry both raw score dimensions and computed scheduling fields.
7. Work items must record why they were selected now, with deterministic machine-readable reason fields such as:
- `easy-literal`
- `dependency-neighbor-resolved`
- `aging-bump`
- `strong-direct-evidence`
## Batch requirements
1. Phase 5 must select batches from queued work items using the current deterministic priority order.
2. Batches are collections of work items, not bare candidate lists.
3. The default starting batch size is 10 work items.
4. A batch may contain work items with one or many candidates.
5. Each work item must target exactly one configured model version for the attempt in which it is scheduled.
6. Batch construction must obey model-specific token budget estimation.
7. Phase 5 must prefer the smallest useful code slice for each work item, including a single assignment or line-level slice when that is sufficient.
8. Every work item must still include enough proving context to avoid speculative naming. At minimum this includes:
- enclosing scope signature or equivalent scope context
- nearest dependent or dependee candidate references when relevant
- relevant Phase 3 evidence references
9. Phase 5 may co-batch nearby dependencies when cheap, but must not inflate context so far that the batch ceases to be compact.
10. Phase 5 must persist executed batch artifacts, but it does not need to persist merely planned batches by default.
## Response schema requirements
Phase 5 must export strict response schemas for model execution.
Response schemas must be fixed and deterministic by candidate type or pass-kind combination.
Each schema must require at least:
- `candidateId`
- proposed name or names
- confidence per proposed name as integer `0-100`
- explicit per-candidate `attempted` or `insufficientEvidence`-style state
- machine-readable low-confidence or partial-attempt note fields when required
Rationale output should be minimized to save tokens.
- High-confidence straightforward cases may omit rationale entirely.
- Low-confidence or partial-attempt cases must emit a compact structured note using fixed fields such as `reasonCode` and `note`.
MVP `reasonCode` values must include at least:
- `insufficient-local-context`
- `unresolved-dependency`
- `ambiguous-semantics`
- `collision-risk`
Phase 5 may request integer confidence `0-100` from the model and normalize it internally later. Acceptance threshold handling belongs to Phase 6.
## Attempt-tracking requirements
1. Phase 5 must track candidate naming attempts separately from batch attempts.
2. A batch attempt does not imply that every candidate in every included work item was meaningfully attempted by the model.
3. Model responses must explicitly distinguish candidates that were attempted from candidates left unresolved due to insufficient evidence.
4. `maxNamingAttempts` applies per candidate, not merely per work item.
5. A work item is exhausted only when all member candidates are either accepted or exhausted.
6. Candidates that were included in a batch but explicitly not attempted must not necessarily consume a naming attempt unless the configured attempt policy says they do.
## Status model requirements
Phase 5 must support at least the following statuses:
- `pending`
- `batched`
- `accepted`
- `deferred`
- `exhausted`
Deferred state must always be paired with a deterministic defer reason.
## Persistence and artifact requirements
Phase 5 must be resumable and inspectable.
### Authoritative state
- authoritative queue state must live in SQLite under the run directory
- SQLite must store queue metadata, scores, statuses, IDs, dependency edges, batch records, and artifact references
- SQLite must not be used to store large prompt blobs, raw code slices, or raw model response bodies when those can be stored as file artifacts
### Large immutable artifacts
Large immutable artifacts must be stored as files under the run directory and referenced by path and content hash from SQLite.
These artifacts include at least:
- candidate code slices when not already represented by Phase 3 or Phase 4 artifacts
- model request payloads
- raw model responses
- evidence snapshots when they are large enough to justify external storage
Phase 5 must avoid duplicating large code and context blobs when they can be content-addressed and reused.
### Top-level persisted artifacts
MVP must define exact top-level persisted artifacts, including at least:
- `runs/<run-id>/queue.sqlite`
- `runs/<run-id>/artifacts/<hash>.json`
- `runs/<run-id>/batches/<batchId>/request.json`
- `runs/<run-id>/batches/<batchId>/response.json`
### Batch execution records
Batch execution records must be persisted and must include at least:
- `batchId`
- configured model
- creation timestamp
- included `workItemId` values
- included `candidateId` values or a derived reference
- artifact paths or artifact references
- token estimate
- result status
Executed batch artifacts must include both request and raw response files for replay and audit.
## Examples that the implementation must support
### Easy literal-backed local
A candidate like `L2 = "https://website.site"` should rank as comparatively easy because the literal semantics are strong and local.
### Dependency-heavy expression
A candidate like `r4 = r5 + n2` should rank as comparatively difficult unless surrounding names and evidence have already been filled in. If `r5` and `n2` are later accepted, `r4` must be re-evaluated with improved evidence on a later iteration.
### Function with partially knowable names
A function may have enough evidence to name the function itself while still lacking enough evidence to confidently name one or more parameters. The queue and response schema must support partial attempts and independent confidence per proposed name.
## Verification requirements
Phase 5 verification must confirm at least:
1. unchanged segments are not added to the queue
2. queue construction is deterministic across repeated runs with identical input
3. candidate-level dependency edges are persisted and used in scoring
4. minimum viable evidence is re-evaluated after accepted-name feedback
5. the same candidate can move from deferred to pending after neighboring names are accepted
6. batch selection honors configured batch size, token budget, and deterministic ordering
7. work items remain the smallest inseparable units and are not split later
8. response schemas are emitted in fixed type-specific formats
9. executed batch request and response artifacts are persisted and referenced from SQLite
10. fairness and aging prevent indefinite starvation before `maxNamingAttempts`
11. large prompt and response blobs are stored as file artifacts rather than embedded in SQLite
@@ -0,0 +1,154 @@
# Phase 6 requirements — relabel API execution and wave scheduling
## Goal
Phase 6 executes Phase 5 batch artifacts against OpenRouter, schedules those batch requests in parallel waves against a shared pre-wave queue snapshot, persists execution outcomes for audit and recovery, and hands completed wave results to Phase 7 for semantic evaluation.
Phase 6 does not evaluate proposed names semantically and does not apply renames. It is responsible for outbound execution, execution-state persistence, retry handling, and wave-level completion boundaries.
## Scope
Phase 6 is responsible for:
- consuming Phase 5 queue state and batch-ready request artifacts
- launching batch requests to OpenRouter
- grouping those requests into parallel execution waves
- enforcing concurrency, retry, timeout, and rate-limit behavior
- persisting request, response, and execution metadata
- determining when a wave is terminal and ready for Phase 7 reconciliation
Phase 6 is not responsible for:
- semantic validation of proposed names
- acceptance or rejection of proposed names
- collision resolution between names
- rename application
- naming-memory updates
## Shared terms
- **batch**: one model request containing one or more work items.
- **wave**: a set of batches executed in parallel against the same queue snapshot and reconciled only after execution and retries for that wave finish.
- **pre-wave queue snapshot**: the exact queue state and selected batch set captured before any batch in the wave is launched.
## Inputs
Phase 6 requires the following inputs:
- `runs/<run-id>/queue.sqlite`
- Phase 5 batch-ready request artifacts
- OpenRouter configuration
- requested model configuration for the wave
- global and per-wave concurrency configuration
- timeout configuration
- retry and backoff configuration
- rate-limit handling configuration
## OpenRouter execution requirements
1. Phase 6 must target OpenRouter explicitly in MVP.
2. Each batch request must be executed against the exact persisted request artifact produced by Phase 5.
3. Retries must reuse the same request bytes as the original attempt; Phase 6 must not regenerate or mutate the request payload during retry.
4. Phase 6 must persist the requested model and, when available, the actual routed model returned by OpenRouter.
5. OpenRouter-specific metadata may be recorded when available, but its absence must not by itself make an otherwise successful batch execution fail.
6. Sensitive credentials, auth headers, API keys, and secret-bearing configuration values must never be persisted in artifacts or logs.
## Wave requirements
1. Phase 6 must execute batches in waves.
2. Every batch in a wave must observe the same pre-wave queue snapshot.
3. No batch result may affect any other batch in the same wave before the wave closes.
4. A wave must be restricted to one model/config in MVP.
5. Phase 6 must persist a deterministic `waveId` and a wave manifest before launching requests.
6. The wave manifest must include at least:
- `waveId`
- queue snapshot identifier or hash
- selected model/config
- selected batch IDs
- start and end timestamps
- wave status
- aggregate counts by terminal batch status
7. A wave remains open until all of its batches are terminal for Phase 6.
8. Phase 6 terminal batch states must include at least:
- `succeeded`
- `schema-failed`
- `retry-exhausted`
9. A wave may complete with partial success.
10. `partial-complete` means at least one batch ended terminally without success but wave reconciliation may still proceed.
11. `failed` must be reserved for wave-level infrastructure failure such as manifest or execution-state corruption, not ordinary batch-level execution problems.
## Parallelism and rate-limit requirements
1. Phase 6 must support parallel execution of multiple batch requests within a wave.
2. Phase 6 must support both global-run concurrency control and per-wave concurrency control.
3. Phase 6 must respond to OpenRouter rate-limit signals when available, including provider hints such as `Retry-After`.
4. Phase 6 must also enforce a local concurrency governor independent of provider signals.
5. When rate-limit pressure is detected, Phase 6 should throttle or stop launching additional batches within the wave rather than continuing at the same rate.
6. A run must not require strictly serial execution of all batches.
## Retry, timeout, and failure requirements
1. Phase 6 must support configurable retry limits.
2. Phase 6 must support configurable wave timeout limits so a stuck provider interaction cannot stall forever.
3. Retry behavior must use provider-hint-aware exponential backoff.
4. Retries are allowed for transport and provider failures.
5. Retries are not required for semantic or schema failures in MVP.
6. A transport/provider failure must be recorded separately from a semantic naming failure.
7. A response with invalid or truncated JSON content must be treated as a schema or semantic failure, not as a transport failure, when a response body was successfully received.
8. If some batches fail after retry exhaustion, the wave must still be allowed to close with partial results.
9. Successful batches in that wave must still be available for later Phase 7 reconciliation.
## Batch execution-state requirements
1. Each batch must have a stable `batchId` across retries.
2. Retry attempts must be recorded under the batch rather than creating new batch identities.
3. Batch execution statuses must be explicit. MVP must support statuses equivalent to at least:
- `queued`
- `in-flight`
- `succeeded`
- `transport-failed`
- `schema-failed`
- `retry-exhausted`
4. A batch must not move to `succeeded` until the raw response has been fully persisted.
5. If the process crashes after response receipt but before persistence is complete, the batch must not remain marked as `succeeded`.
6. On restart, interrupted `in-flight` batches must be returned to a retryable state with attempt history preserved.
7. Phase 6 must persist a deterministic list of expected batch IDs for the wave so restart recovery can detect missing completions.
## Persistence and artifact requirements
Phase 6 must be resumable and auditable.
### Authoritative state
- authoritative execution state must be persisted under the run directory
- `runs/<run-id>/queue.sqlite` must include references to wave records, batch records, retry attempts, artifact paths, and terminal execution states
- execution-state persistence must support crash recovery without ambiguity about which batches remain retryable
### Required artifact layout
MVP must define exact top-level artifacts including at least:
- `runs/<run-id>/waves/<waveId>/manifest.json`
- `runs/<run-id>/batches/<batchId>/request.json`
- `runs/<run-id>/batches/<batchId>/response.json`
### Attempt records
1. Each batch attempt must record at least:
- attempt number
- request timestamp
- response timestamp when present
- HTTP status when present
- error class when present
- retry-after or equivalent provider hint when present
- latency when measurable
2. If no response body was received, Phase 6 must still persist lightweight failed-attempt metadata.
3. Raw response artifacts should be persisted when a response body exists.
## Idempotency and reproducibility requirements
1. Phase 6 must preserve idempotent execution boundaries for the same persisted wave snapshot.
2. Rerunning execution for the same wave snapshot must not duplicate semantic outcomes because Phase 6 itself does not apply names.
3. Request retries for the same batch must reuse the identical request artifact bytes.
4. Phase 6 must persist enough metadata to reproduce what was sent to OpenRouter and what came back.
## Handoff requirements to Phase 7
1. Phase 6 must hand off completed wave results to Phase 7 only after the wave reaches a terminal wave state.
2. Phase 6 must expose which batches succeeded, which batches ended in schema failure, and which batches exhausted retries.
3. Phase 7 must receive persisted request and raw response artifacts rather than reconstructed approximations.
4. Phase 6 must not perform semantic acceptance decisions before handoff.
## Verification requirements
Phase 6 verification must confirm at least:
1. batches are executed in parallel within a wave
2. all batches in a wave share one pre-wave queue snapshot
3. one wave uses exactly one model/config in MVP
4. request and response artifacts are persisted for each executed batch with a response
5. transport/provider failures and schema/semantic failures are recorded separately
6. retries use the same persisted request bytes
7. interrupted in-flight batches return to a retryable state on restart
8. a wave closes only after all batches are terminal for Phase 6
9. partial-complete waves still hand successful batch results to Phase 7
10. auth headers, API keys, and other secrets are absent from persisted artifacts and logs
@@ -0,0 +1,175 @@
# Phase 7 requirements — iterative relabel evaluation, application, and queue feedback
## Goal
Phase 7 evaluates completed Phase 6 wave results, performs deterministic semantic validation of proposed names, applies only accepted names safely, feeds accepted and structured rejection feedback back into queue state, and updates naming memory for future runs.
Phase 7 does not execute outbound API requests. It operates only after a wave has completed and uses persisted Phase 6 artifacts rather than reconstructed approximations.
## Scope
Phase 7 is responsible for:
- consuming completed wave results from Phase 6
- validating fixed response schemas at the semantic handoff boundary
- evaluating proposals through a deterministic acceptance pipeline
- performing AST-aware scope ownership and collision checks
- applying accepted renames to source and metadata
- updating candidate counters, statuses, rejection reasons, and feedback fields
- feeding accepted and rejection feedback to Phase 5 for reranking
- updating `stable/naming-memory.json` from accepted names only
- determining no-progress behavior for the iterative loop
Phase 7 is not responsible for:
- batch construction
- outbound API execution
- transport retry handling
- inventing replacement names or modifying model proposals
## Inputs
Phase 7 requires the following inputs:
- completed Phase 6 wave records
- `runs/<run-id>/queue.sqlite`
- persisted batch request and raw response artifacts from Phase 6
- fixed type-specific or pass-specific response schemas emitted by Phase 5
- generated source files or source metadata to be renamed
- deterministic naming-rule configuration derived from naming-conventions prompt constraints
- confidence thresholds and counter thresholds
- current `stable/naming-memory.json`
## Acceptance unit requirements
1. The atomic acceptance unit is the individual proposed name field within a candidate.
2. Phase 7 must support partial acceptance within one candidate.
3. A candidate with one accepted field and one rejected field still counts as progress and must feed accepted feedback back into queue state.
4. Phase 7 must support partial terminality, where one field becomes terminal while another field in the same candidate remains retryable.
## Wave-boundary requirements
1. Phase 7 must evaluate results only after a Phase 6 wave is complete.
2. Phase 7 must not evaluate or apply names incrementally per batch inside an open wave.
3. Collision resolution and final acceptance decisions must be computed across the reconciled wave result set, not one batch at a time.
4. Accepted names from one batch in the wave must not affect another batch mid-wave; they may only affect final wave reconciliation outcomes.
## Deterministic acceptance pipeline
Phase 7 must define and follow a fixed deterministic acceptance pipeline.
At a minimum, the pipeline order must be:
1. schema validity at the Phase 7 handoff boundary
2. candidate existence and result-to-candidate matching
3. AST-proven scope ownership of the target binding
4. identifier validity and reserved-word checks
5. deterministic machine-checkable naming-rule checks
6. deterministic specificity checks
7. confidence threshold checks
8. AST-aware collision resolution within the proven target scope
9. final acceptance and application
Additional rules:
- deterministic checks must outrank model confidence
- collision resolution must happen only after all earlier gates have passed
- Phase 7 must not invent fallback names, auto-suffix names, or otherwise modify model proposals
- remaining fields in a multi-field result must still be evaluated independently when one field fails
## Rejection-reason taxonomy
Phase 7 must define and persist a fixed rejection-reason taxonomy.
MVP reasons must include at least:
- `schema-invalid`
- `candidate-missing`
- `unsafe-scope`
- `invalid-identifier`
- `reserved-word`
- `unchanged-name`
- `naming-rule-violation`
- `insufficient-specificity`
- `low-confidence`
- `collision-risk`
- `insufficient-evidence`
- `non-progress`
`non-progress` is a first-class umbrella reason. `unchanged-name` is one concrete form of non-progress.
Rejected and deferred outcomes must be persisted in history for audit and later prompt feedback.
## Naming-rule and specificity requirements
1. Phase 7 must enforce deterministic machine-checkable naming rules derived from the naming-conventions prompt constraints.
2. Phase 7 must enforce only the machine-checkable subset of those prompt constraints.
3. Naming-rule checks and specificity checks must be candidate-type-aware.
4. Candidate types may include at least the Phase 5 taxonomy such as `local`, `param`, `function`, `module`, `alias`, and `property-alias`.
5. `insufficient-specificity` must be fed back into Phase 5 so future prompts can explicitly request a more specific name.
6. The exact naming-rule configuration format may remain implementation-defined as long as enforcement is deterministic.
## Counter and threshold requirements
1. `maxNamingAttempts` must remain separate from specialized counters.
2. `insufficient-evidence` must have its own counter and max threshold.
3. `insufficient-specificity` must have its own counter and max threshold.
4. Phase 7 is responsible for incrementing these counters from validated wave results.
5. A candidate or field that reaches the specialized insufficient-evidence or insufficient-specificity threshold becomes terminally `stalled`.
6. A candidate or field that reaches the general naming-attempt threshold becomes terminally `exhausted`.
## Collision and scope requirements
1. Phase 7 must use AST-aware scope ownership proof before rename application.
2. Phase 7 must use AST-aware collision detection within the proven target scope.
3. Collision checks must be limited to the relevant proven scope, not global project-wide names.
4. If a proposal would collide in one scope but not another due to shadowing, only the proven target scope matters.
5. If two colliding proposals in the same scope are both too generic or insufficiently specific, both must be rejected with `insufficient-specificity` before winner selection.
6. Otherwise, Phase 7 must deterministically choose a single winner by:
- higher confidence
- then higher Phase 5 priority
- then stable identifier tie-break
7. The losing proposal must be rejected with `collision-risk`.
## Application requirements
1. Phase 7 must apply accepted names only after wave-level reconciliation has produced the final accepted set for that wave.
2. Accepted names may be applied in any deterministic order after outcomes are fixed.
3. Application order must be persisted for audit even though it must not affect acceptance outcomes.
4. Phase 7 must update generated source files or source metadata consistently with accepted names.
5. Phase 7 must not apply rejected, deferred, stalled, or exhausted names.
## Feedback requirements to Phase 5
1. Phase 7 must feed back both accepted names and structured rejection feedback to Phase 5.
2. Feedback must update at least:
- candidate status
- per-field outcome
- counters
- rejection reasons
- prompt-hint or defer-reason fields needed for later iterations
3. Feedback written by Phase 7 must be sufficient for Phase 5 reranking without rereading raw response bodies.
4. Accepted-name feedback must improve the evidence available to neighboring and dependent candidates in later iterations.
## Naming memory requirements
1. Phase 7 owns updating `stable/naming-memory.json`.
2. Naming memory must be updated only from accepted names.
3. Rejected, deferred, stalled, or exhausted proposals must never be written into naming memory.
## No-progress loop requirements
1. Phase 7 must track no-progress behavior at the wave level.
2. A wave with zero accepted names counts as a no-progress wave.
3. Repeated non-progress responses for a candidate count toward no-progress logic even when technically attempted.
4. Loop termination for no-progress behavior must use a configurable consecutive no-progress wave threshold.
## State model requirements
1. Phase 7 must support explicit terminal states including at least:
- `accepted`
- `stalled`
- `exhausted`
2. Nonterminal states such as `pending` and `deferred` continue to be carried from earlier phases.
3. Phase 7 must persist outcomes at both per-field and per-candidate levels.
4. A candidate may remain partially active when different fields are in different states.
## Persistence requirements
1. Phase 7 must persist per-field validation outcomes and final per-candidate outcomes.
2. Phase 7 must persist structured rejection history for audit.
3. Phase 7 must persist accepted-name feedback into `runs/<run-id>/queue.sqlite`.
4. Phase 7 must preserve references to the original Phase 6 request and response artifacts used for the decision.
## Verification requirements
Phase 7 verification must confirm at least:
1. Phase 7 evaluates only completed waves
2. the deterministic acceptance pipeline is followed in order
3. deterministic checks outrank model confidence
4. AST-aware scope ownership is proven before rename application
5. AST-aware collision detection is performed within the proven scope
6. partial acceptance and partial terminality are supported correctly
7. rejected and deferred outcomes are preserved in history with fixed reason codes
8. accepted names are applied only after wave reconciliation
9. modified code is re-parsed after each reconciled wave
10. only intended bindings changed during rename application
11. naming memory receives accepted names only
12. feedback written to Phase 5 is sufficient for reranking without rereading raw responses
@@ -0,0 +1,91 @@
# Phase 8 requirements — deterministic codebase regularization
## Goal
Phase 8 deterministically converts the recovered post-relabel source into a conventional, significantly more navigable editable tree that humans and LLMs can explore and modify more effectively.
The goal is not to recover Amp's original repository layout. The goal is to produce a stable, semantics-preserving, regularized codebase that is easier to inspect, edit, and maintain across upstream updates.
## Scope
Phase 8 is responsible for:
- transforming recovered post-relabel source into a conventional project tree
- splitting coarse recovered output into smaller files or modules when deterministic boundaries can be proven
- assigning deterministic file and folder placement
- reconstructing deterministic import and export boundaries between split files
- emitting a canonical editable tree for later maintained work
- emitting regularization manifests and lineage mappings
- preserving or reusing stable placement for unchanged areas across runs
Phase 8 is not responsible for:
- inferring original repository structure
- making semantic feature changes
- using LLM output to guess structure
- replacing maintained transform replay logic from later phases
## Core requirements
1. Phase 8 must be strictly deterministic and non-LLM.
2. Phase 8 must operate on the post-relabel recovered source produced by earlier phases.
3. Phase 8 must produce the canonical editable tree that later maintained edits and transforms target.
4. Structural regularization success and semantic understanding success are distinct. Phase 8 may improve editability without fully understanding every internal name or subsystem.
5. Some code may remain ugly or partially obfuscated if that is the deterministic cost of avoiding speculative structure.
## Structural regularization requirements
1. Phase 8 must split large or coarse recovered units into smaller files or modules when deterministic boundaries can be proven.
2. Split heuristics must be defined at a high level and may include proven exported-object clusters, helper groups, vendor versus app boundaries, and stable dependency subgraphs.
3. Cross-file extraction is allowed only when dependency edges and scope ownership are proven.
4. If a clean structural split cannot be proven deterministically, Phase 8 must keep the code coarse rather than guessing.
5. Regularization may include limited mechanical normalization of formatting, layout, import/export structure, and related editability-oriented conventions when the transform is deterministic and semantics-preserving.
## Folder and file layout requirements
1. Phase 8 must assign a stable conventional folder and package layout even when that layout is reconstructed rather than original.
2. The layout may be project-defined and arbitrary as long as it is deterministic, stable, and easier to navigate than the raw recovered output.
3. Phase 8 must define deterministic file naming rules for generated modules.
4. When semantic names are weak, file names must fall back to stable structural names rather than guessed human names.
5. Unchanged upstream areas should preserve stable file placement as much as possible across releases when lineage and structure still match.
6. Phase 8 may regroup code in later runs only when deterministic placement rules require it, while minimizing churn.
7. Phase 8 is responsible for minimizing layout churn across runs.
## Import and export reconstruction requirements
1. Phase 8 must reconstruct deterministic import and export boundaries between split files.
2. Import/export reconstruction must remain semantics-preserving.
3. Import/export reconstruction must be represented in machine-readable regularization metadata.
## Wrapper requirements
1. Wrapper or shell modules are allowed only as a last resort.
2. Phase 8 may introduce deterministic wrapper modules only when needed to create a stable editable boundary that cannot be achieved by pure splitting or moving.
3. Wrappers must be explicitly marked in metadata as recovery scaffolding rather than inferred original structure.
4. Wrappers must follow deterministic naming and placement rules.
5. Wrapper use should remain exceptional and visible in verification and manifest outputs.
## Mapping and metadata requirements
1. Phase 8 must preserve a mapping from every regularized file or module back to source segment IDs, lineage IDs, or equivalent stable anchors.
2. Phase 8 must emit a machine-readable regularization manifest describing at least:
- regularized files and folders
- split decisions
- wrapper modules
- import/export reconstruction
- lineage mappings
- stable placement reuse decisions
3. Phase 8 must persist this manifest under `runs/<run-id>/`.
4. Useful stable placement and mapping metadata must also be reusable under `stable/` when it helps preserve placement across runs.
5. Stable reuse must be based on structural hashes, lineage IDs, or equivalent deterministic identities.
## Editable-tree requirements
1. Phase 8 must output both a files-on-disk editable tree and machine-readable mapping metadata.
2. The editable tree should be significantly more navigable and editable than the direct recovered segment-oriented output.
3. Success means regular enough for exploration and maintenance, not perfectly deobfuscated or perfectly pleasant everywhere.
4. Phase 8 may leave some recovered code in oversized files when deterministic splitting would be too risky.
## Stability requirements
1. Given identical inputs, Phase 8 must produce the same regularized tree and manifest.
2. Stable placement reuse should keep unchanged modules in the same paths across runs when lineage and structure match strongly enough.
3. Phase 8 must avoid unnecessary file churn when relabeling or upstream changes do not force structural movement.
## Verification requirements
Phase 8 verification must confirm at least:
1. deterministic reruns produce the same regularized tree and manifest from identical inputs
2. the regularized tree parses correctly after structural transformations
3. import and export graph consistency is preserved after splitting
4. unchanged modules preserve stable placement across runs when lineage matches
5. wrapper use remains exceptional and is surfaced in manifests or summary counts
6. mappings from regularized outputs back to source lineage are persisted
7. regularization preserves program functionality rather than introducing semantic feature changes
@@ -0,0 +1,94 @@
# Phase 9 requirements — replay maintained transforms
## Goal
Phase 9 replays externally authored maintained transforms onto the regularized editable tree produced by Phase 8 so maintained functionality can survive later upstream updates.
Phase 9 does not derive transforms from git diffs and does not own the transform-authoring workflow. It only loads stored transforms, applies them deterministically when safe, records replay outcomes, and emits explicit conflicts when replay is unsafe.
## Scope
Phase 9 is responsible for:
- loading maintained transforms from long-lived stable metadata
- replaying them onto the Phase 8 regularized tree
- ordering transforms deterministically with dependency awareness
- verifying replay safety before apply
- recording replay outcomes and replay reports
- preserving conflict information for later human or agent-assisted resolution
Phase 9 is not responsible for:
- deriving transforms from git diffs
- capturing maintained edits into transforms
- inventing fallback transformations
- forcing weak matches through
## Inputs
Phase 9 requires the following inputs:
- the regularized editable tree produced by Phase 8
- maintained transforms stored under `stable/`
- stable lineage, file, and module mapping metadata
- transform dependency and ordering metadata
- replay verification configuration, including parse, build, and test commands
## Core requirements
1. Phase 9 must operate only on the regularized tree produced by Phase 8.
2. Phase 9 must assume transforms already exist before replay begins.
3. Transform authoring and capture must remain outside the numbered upstream-processing pipeline.
4. Replay must be deterministic.
5. Weak-match replays must become explicit conflicts rather than partial applies.
6. Unsafe replay must never be forced through.
## Transform taxonomy requirements
1. Phase 9 must define a high-level transform taxonomy.
2. MVP transform types must include at least:
- `jscodeshift-codemod`
- `structured-edit`
- `file-addition`
3. `jscodeshift-codemod` must be the preferred/default transform type in MVP.
4. Phase 9 may support additional deterministic transform types later if explicitly defined.
## Transform targeting requirements
1. Each transform must declare enough targeting metadata to locate its application point deterministically in the regularized tree.
2. Targeting must prefer stable file/module lineage and AST selectors over plain path-only matching when possible.
3. If a target file moved during Phase 8 regularization but lineage mapping preserves identity, replay must still be able to find it.
4. Target metadata may include files, AST selectors, stable lineage anchors, module anchors, or equivalent deterministic references.
5. Transforms must not rely on raw line offsets as their primary long-term targeting mechanism.
## Ordering and dependency requirements
1. Phase 9 must replay transforms in deterministic persisted order.
2. Ordering must be explicit metadata, not inferred ad hoc from filesystem state or commit timing.
3. Transforms may declare dependencies on earlier transforms.
4. If a transform conflicts, later transforms should still be attempted when ordering and dependency rules show they are independent.
5. Dependents blocked by an earlier failed transform must be skipped explicitly rather than attempted blindly.
## Replay safety requirements
1. Phase 9 must apply transforms only on high-confidence deterministic matches.
2. Weak-match cases must become explicit conflicts.
3. Phase 9 must not invent fallback changes when a transform target cannot be matched safely.
4. Phase 9 must not auto-apply destructive file removals in MVP.
5. Phase 9 may add new files into the regularized tree when a transform explicitly represents a maintained feature addition.
6. File additions must respect deterministic placement and naming rules from Phase 8.
## Persistence and reporting requirements
1. Phase 9 must record replay outcomes per transform.
2. Replay outcomes must support statuses equivalent to at least:
- `applied`
- `conflict`
- `skipped-blocked`
- `skipped-not-applicable`
3. Phase 9 must emit a per-run replay report artifact.
4. Replay reports must include enough detail for a later skill or agent to help resolve conflicts.
5. Long-lived transform definitions should remain mostly immutable.
6. Replay history and replay outcomes should be stored per run rather than mutating the canonical transform definition into an execution log.
## Verification requirements
1. Phase 9 must verify the tree parses after each applied transform when practical.
2. Phase 9 must perform a final end-to-end parse check after all replay attempts complete.
3. Phase 9 must perform a final build verification after replay.
4. Phase 9 must run maintained tests and basic functionality tests after replay, once such tests exist.
5. Transforms that add files must be verified for deterministic placement and naming relative to Phase 8 regularization rules.
6. Replay verification failures must be surfaced explicitly rather than silently tolerated.
## Conflict handling requirements
1. Unsafe or incompatible replay must emit explicit conflicts.
2. Conflicts must preserve enough targeting and failure context for later human or agent-assisted resolution.
3. Conflict emission must not prevent replay of later independent transforms.
4. Blocked dependents must be recorded as blocked, not misreported as direct conflicts.
+25
View File
@@ -46,6 +46,31 @@ Use this checklist when reviewing a design produced by a human or an LLM.
- Can service and adapter internals be trusted mostly through seam tests and type constraints rather than line-by-line domain review?
- Do service implementations avoid accumulating hidden business logic?
## General bounded-seam review questions
- What caller intent does this seam grant?
- What invariant becomes true after crossing it?
- What tainted input becomes trusted here?
- What can still fail, and how is failure represented?
- What decisions are pure policy versus orchestration versus I/O?
- What internal details are leaking through the public API?
- If this seam were renamed by user goal instead of implementation shape, what would it be called?
## Applying the questions by seam type
### Workflow seams
- Given command X, under what rules does this workflow emit event Y or failure Z?
- Is the workflow mostly assembling decisions and capabilities, or is it hiding business rules inside orchestration?
- Does the workflow expose the business outcome more clearly than the implementation steps?
### Policy seams
- What exact business or security decision is being made?
- What inputs are sufficient for that decision?
- Are the reasons for approval or rejection explicit enough to review?
- Is the policy pure and deterministic?
## Security review readiness
- Are trust boundaries visible enough for a reviewer to identify where untrusted data enters?
+6 -2
View File
@@ -32,8 +32,12 @@ Feature-specific naming choices should also be recorded in the relevant design a
| Term | Meaning | Use this, not that | Notes |
| :--- | :--- | :--- | :--- |
| `<PreferredTerm>` | `<Short domain meaning>` | `<PreferredTerm>` not `<RejectedSynonym>` | `<Optional note>` |
| `<PreferredTerm>` | `<Short domain meaning>` | `<PreferredTerm>` not `<RejectedSynonym>` | `<Optional note>` |
| `Recovery Pipeline` | Release-oriented workflow that turns one upstream snapshot into a buildable, browsable recovered tree and release artifacts | `Recovery Pipeline` not `deobfuscation script chain` | Feature-level umbrella term used across contexts. |
| `Recovered Tree` | Canonical editable source tree emitted at repo root for review and modification | `Recovered Tree` not `original repo layout` | The tree is reconstructed for usability, not historical fidelity. |
| `Build-first` | Acceptance rule that preserves buildability even when readability improvements are still incomplete | `Build-first` not `runtime complete` | Current hard success invariant for the feature. |
| `Review-needed Artifact` | Machine-readable report plus concise human summary that surfaces uncertainty, failure, or conflict | `Review-needed Artifact` not `warning log` | Explicit inspection seam rather than hidden failure. |
| `Maintained Transform` | Durable replayable local change stored outside the numbered upstream-processing phases | `Maintained Transform` not `manual patch` | Reused by replay and release contexts. |
| `Naming Memory` | Small reviewable history of accepted recovered names reused in later relabel iterations | `Naming Memory` not `rename cache` | Shared iterative-naming term with reviewer-facing meaning. |
## Review questions
+46
View File
@@ -0,0 +1,46 @@
export type {
Error,
Event,
IngestUpstreamSnapshot,
State,
UpstreamSnapshotIngested,
SnapshotIngestHardStopped,
AwaitingTrustedBundle,
TrustedSnapshotSelected,
IngestableSnapshot,
} from "./models/types.js"
export type {
RunManifest,
SegmentRecord,
SelectedSnapshot,
SnapshotIdentity,
SnapshotMetadata,
TaintedBundleLocation,
TrustedBundleLocation,
RunIdentity,
} from "./models/shared.js"
export {
makeSnapshotIdentity,
makeTaintedBundleInput,
makeTaintedBundleLocation,
makeVerifiedPreviousRunManifest,
} from "./models/factories.js"
export {
makeAstNodeKind,
makeNormalizedHash,
makeRawHash,
makeRunIdentity,
makeShapeHash,
makeTrustedCanonicalProjectionPath,
makeTrustedManifestPath,
makeTrustedSegmentsPath,
makeTrustedSummaryPath,
} from "./models/ops.js"
export { workflow } from "./workflows/ingestSnapshot.js"
export {
apply,
decide,
emitSnapshotIngested,
makeAwaitingTrustedBundle,
validatePreviousRunManifest,
} from "./policies/decideSnapshotIngest.js"
@@ -0,0 +1,32 @@
import type {
Error,
IngestFailureReason,
TaintedBundleInput,
VerifiedPreviousRunManifest,
} from "./types.js"
import type {
RunManifest,
SnapshotIdentity,
TaintedBundleLocation,
} from "./shared.js"
export const makeSnapshotIdentity = (value: string): SnapshotIdentity =>
value as SnapshotIdentity
export const makeTaintedBundleLocation = (value: string): TaintedBundleLocation =>
value as TaintedBundleLocation
export const makeVerifiedPreviousRunManifest = (
manifest: RunManifest,
): VerifiedPreviousRunManifest => ({ _tag: "VerifiedPreviousRunManifest", manifest })
export const makeTaintedBundleInput = (
location: TaintedBundleLocation,
): TaintedBundleInput => ({ _tag: "TaintedBundleInput", location })
export const foldFailure = (
snapshotIdentity: SnapshotIdentity,
reason: IngestFailureReason,
): Error => ({
_tag: "SnapshotIngestHardStopped",
payload: { SnapshotIdentity: snapshotIdentity, Reason: reason },
})
+199
View File
@@ -0,0 +1,199 @@
import { Either } from "effect"
import {
isNonEmptyString,
type AstNodeKind,
type NormalizedHash,
type RawHash,
type RunIdentity,
type SegmentRecord,
type SelectedSnapshot,
type ShapeHash,
type SnapshotIdentity,
type TrustedBundleLocation,
type TrustedCanonicalProjectionPath,
type TrustedManifestPath,
type TrustedSegmentsPath,
type TrustedSummaryPath,
} from "./shared.js"
import { foldFailure } from "./factories.js"
import type {
DerivedRunIdentity,
Error,
IngestableSnapshot,
RequiredArtifact,
TaintedBundleInput,
TrustedSnapshotSelected,
} from "./types.js"
export const makeTrustedBundleLocation = (value: string): TrustedBundleLocation =>
value as TrustedBundleLocation
export const makeRunIdentity = (value: string): RunIdentity => value as RunIdentity
export const makeAstNodeKind = (value: string): AstNodeKind => value as AstNodeKind
export const makeRawHash = (value: string): RawHash => value as RawHash
export const makeNormalizedHash = (value: string): NormalizedHash =>
value as NormalizedHash
export const makeShapeHash = (value: string): ShapeHash => value as ShapeHash
export const makeTrustedManifestPath = (value: string): TrustedManifestPath =>
value as TrustedManifestPath
export const makeTrustedSegmentsPath = (value: string): TrustedSegmentsPath =>
value as TrustedSegmentsPath
export const makeTrustedCanonicalProjectionPath = (
value: string,
): TrustedCanonicalProjectionPath => value as TrustedCanonicalProjectionPath
export const makeTrustedSummaryPath = (value: string): TrustedSummaryPath =>
value as TrustedSummaryPath
const parseBundleLocationText = (location: string): string | null => {
const trimmedLocation = location.trim()
return trimmedLocation.length === 0 || !trimmedLocation.includes("/")
? null
: trimmedLocation
}
const decideSegmentRecordFailure = (
selectedSnapshot: SelectedSnapshot,
bundleLocation: string,
): Error | null => {
if (bundleLocation.includes("too-large")) {
return foldFailure(selectedSnapshot.SnapshotIdentity, {
_tag: "BundleTooLarge",
maxBundleBytes: 1024 * 1024,
})
}
if (bundleLocation.includes("budget-exceeded")) {
return foldFailure(selectedSnapshot.SnapshotIdentity, {
_tag: "ParseBudgetExceeded",
parseBudget: 50_000,
})
}
return null
}
export const parseBundleLocation = (
snapshotIdentity: SnapshotIdentity,
input: TaintedBundleInput,
): Either.Either<TrustedBundleLocation, Error> => {
const location = parseBundleLocationText(input.location as string)
return location === null
? Either.left(foldFailure(snapshotIdentity, "BundleNotParseable"))
: Either.right(makeTrustedBundleLocation(location))
}
const deriveRunIdentity = (
selectedSnapshot: SelectedSnapshot,
): Either.Either<DerivedRunIdentity, Error> => {
const snapshotIdentity = selectedSnapshot.SnapshotIdentity as string
return isNonEmptyString(snapshotIdentity)
? Either.right({ _tag: "DerivedRunIdentity", value: makeRunIdentity(`run:${snapshotIdentity}`) })
: Either.left(
foldFailure(
selectedSnapshot.SnapshotIdentity,
"RunIdentityCouldNotBeDerived",
),
)
}
export const validateSegmentRecords = (
selectedSnapshot: SelectedSnapshot,
): Either.Either<ReadonlyArray<SegmentRecord>, Error> => {
const snapshotIdentity = selectedSnapshot.SnapshotIdentity as string
const bundleLocation = selectedSnapshot.BundleLocation as string
const failure = decideSegmentRecordFailure(selectedSnapshot, bundleLocation)
if (failure) {
return Either.left(failure)
}
return Either.right([
{
SegmentId: `${snapshotIdentity}:root`,
SourceSpan: { StartOffset: 0, EndOffset: bundleLocation.length },
AstNodeKind: makeAstNodeKind("Program"),
CanonicalSource: `// canonical projection for ${snapshotIdentity}`,
Hashes: {
RawHash: makeRawHash(`raw:${snapshotIdentity}`),
NormalizedHash: makeNormalizedHash(`normalized:${snapshotIdentity}`),
ShapeHash: makeShapeHash(`shape:${snapshotIdentity}`),
},
},
])
}
export const validateBoundaryProofs = (
snapshotIdentity: SnapshotIdentity,
segmentRecords: ReadonlyArray<SegmentRecord>,
): Either.Either<ReadonlyArray<string>, Error> => {
const firstSegment = segmentRecords[0]
return firstSegment
? Either.right([`boundary:${firstSegment.SegmentId}`])
: Either.left(foldFailure(snapshotIdentity, "NoDeterministicBoundaryProven"))
}
export const validateRequiredArtifacts = (
snapshotIdentity: SnapshotIdentity,
requiredArtifacts: ReadonlyArray<RequiredArtifact>,
segmentRecords: ReadonlyArray<SegmentRecord>,
): Either.Either<ReadonlyArray<RequiredArtifact>, Error> => {
const missingArtifact = segmentRecords[0] ? null : requiredArtifacts[0]
return missingArtifact === null
? Either.right(requiredArtifacts)
: Either.left(
foldFailure(snapshotIdentity, {
_tag: "RequiredArtifactMissing",
artifact: missingArtifact ?? "RunManifestArtifact",
}),
)
}
export const decideIngestableSnapshot = (
trustedSnapshot: TrustedSnapshotSelected,
): Either.Either<IngestableSnapshot, Error> =>
Either.flatMap(deriveRunIdentity(trustedSnapshot.SelectedSnapshot), (derivedRunIdentity) =>
Either.flatMap(validateSegmentRecords(trustedSnapshot.SelectedSnapshot), (segmentRecords) =>
Either.flatMap(
validateBoundaryProofs(
trustedSnapshot.SelectedSnapshot.SnapshotIdentity,
segmentRecords,
),
(boundaryProofs) =>
Either.map(
validateRequiredArtifacts(
trustedSnapshot.SelectedSnapshot.SnapshotIdentity,
trustedSnapshot.RequiredArtifacts,
segmentRecords,
),
() => ({
_tag: "IngestableSnapshot" as const,
RunIdentity: derivedRunIdentity.value,
SelectedSnapshot: trustedSnapshot.SelectedSnapshot,
PreviousRunManifest: trustedSnapshot.PreviousRunManifest,
SegmentRecords: segmentRecords,
BoundaryProofs: boundaryProofs,
RequiredArtifacts: trustedSnapshot.RequiredArtifacts,
}),
),
),
),
)
export const deriveRequiredArtifactPaths = (
runIdentity: RunIdentity,
): {
readonly ManifestPath: TrustedManifestPath
readonly SegmentsPath: TrustedSegmentsPath
readonly CanonicalProjectionPath: TrustedCanonicalProjectionPath
readonly SummaryPath: TrustedSummaryPath
} => {
const basePath = `runs/${runIdentity as string}`
return {
ManifestPath: makeTrustedManifestPath(`${basePath}/manifest.json`),
SegmentsPath: makeTrustedSegmentsPath(`${basePath}/segments.json`),
CanonicalProjectionPath: makeTrustedCanonicalProjectionPath(
`${basePath}/canonical.ts`,
),
SummaryPath: makeTrustedSummaryPath(`${basePath}/summary.json`),
}
}
@@ -0,0 +1,102 @@
import { Schema } from "@effect/schema"
const NonEmptyString = Schema.String.pipe(
Schema.filter((value) => value.trim().length > 0),
)
export const SnapshotIdentity = Schema.String.pipe(Schema.brand("SnapshotIdentity"))
export type SnapshotIdentity = Schema.Schema.Type<typeof SnapshotIdentity>
export const TaintedBundleLocation = Schema.String.pipe(
Schema.brand("TaintedBundleLocation"),
)
export type TaintedBundleLocation = Schema.Schema.Type<typeof TaintedBundleLocation>
export const TrustedBundleLocation = Schema.String.pipe(
Schema.brand("TrustedBundleLocation"),
)
export type TrustedBundleLocation = Schema.Schema.Type<typeof TrustedBundleLocation>
export const RunIdentity = Schema.String.pipe(Schema.brand("RunIdentity"))
export type RunIdentity = Schema.Schema.Type<typeof RunIdentity>
export const AstNodeKind = Schema.String.pipe(Schema.brand("AstNodeKind"))
export type AstNodeKind = Schema.Schema.Type<typeof AstNodeKind>
export const RawHash = Schema.String.pipe(Schema.brand("RawHash"))
export type RawHash = Schema.Schema.Type<typeof RawHash>
export const NormalizedHash = Schema.String.pipe(Schema.brand("NormalizedHash"))
export type NormalizedHash = Schema.Schema.Type<typeof NormalizedHash>
export const ShapeHash = Schema.String.pipe(Schema.brand("ShapeHash"))
export type ShapeHash = Schema.Schema.Type<typeof ShapeHash>
export const TrustedManifestPath = Schema.String.pipe(
Schema.brand("TrustedManifestPath"),
)
export type TrustedManifestPath = Schema.Schema.Type<typeof TrustedManifestPath>
export const TrustedSegmentsPath = Schema.String.pipe(
Schema.brand("TrustedSegmentsPath"),
)
export type TrustedSegmentsPath = Schema.Schema.Type<typeof TrustedSegmentsPath>
export const TrustedCanonicalProjectionPath = Schema.String.pipe(
Schema.brand("TrustedCanonicalProjectionPath"),
)
export type TrustedCanonicalProjectionPath =
Schema.Schema.Type<typeof TrustedCanonicalProjectionPath>
export const TrustedSummaryPath = Schema.String.pipe(
Schema.brand("TrustedSummaryPath"),
)
export type TrustedSummaryPath = Schema.Schema.Type<typeof TrustedSummaryPath>
export const SourceSpan = Schema.Struct({
StartOffset: Schema.Number,
EndOffset: Schema.Number,
})
export type SourceSpan = Schema.Schema.Type<typeof SourceSpan>
export const SnapshotMetadata = Schema.Struct({
ReleaseNotesSource: Schema.NullOr(Schema.String),
CollectedAt: Schema.NullOr(Schema.String),
})
export type SnapshotMetadata = Schema.Schema.Type<typeof SnapshotMetadata>
export const SelectedSnapshot = Schema.Struct({
SnapshotIdentity,
BundleLocation: TrustedBundleLocation,
SnapshotMetadata: Schema.NullOr(SnapshotMetadata),
})
export type SelectedSnapshot = Schema.Schema.Type<typeof SelectedSnapshot>
export const SegmentHashes = Schema.Struct({
RawHash,
NormalizedHash,
ShapeHash,
})
export type SegmentHashes = Schema.Schema.Type<typeof SegmentHashes>
export const SegmentRecord = Schema.Struct({
SegmentId: Schema.String,
SourceSpan,
AstNodeKind,
CanonicalSource: Schema.String,
Hashes: SegmentHashes,
})
export type SegmentRecord = Schema.Schema.Type<typeof SegmentRecord>
export const RunManifest = Schema.Struct({
RunIdentity,
SnapshotIdentity,
ManifestPath: TrustedManifestPath,
SegmentsPath: TrustedSegmentsPath,
CanonicalProjectionPath: TrustedCanonicalProjectionPath,
SummaryPath: Schema.NullOr(TrustedSummaryPath),
})
export type RunManifest = Schema.Schema.Type<typeof RunManifest>
export const isNonEmptyString = (value: string): boolean =>
Schema.is(NonEmptyString)(value)
@@ -0,0 +1,103 @@
import type {
RunIdentity,
RunManifest,
SegmentRecord,
SelectedSnapshot,
SnapshotIdentity,
SnapshotMetadata,
TrustedCanonicalProjectionPath,
TrustedSummaryPath,
TaintedBundleLocation,
} from "./shared.js"
export type VerifiedPreviousRunManifest = {
readonly _tag: "VerifiedPreviousRunManifest"
readonly manifest: RunManifest
}
export type TaintedBundleInput = {
readonly _tag: "TaintedBundleInput"
readonly location: TaintedBundleLocation
}
export type DerivedRunIdentity = {
readonly _tag: "DerivedRunIdentity"
readonly value: RunIdentity
}
export type RequiredArtifact =
| "RunManifestArtifact"
| "SegmentRecordsArtifact"
| "CanonicalProjectionArtifact"
export type IngestFailureReason =
| "BundleNotParseable"
| "RunIdentityCouldNotBeDerived"
| "PreviousRunManifestNotVerified"
| { readonly _tag: "BundleTooLarge"; readonly maxBundleBytes: number }
| { readonly _tag: "ParseBudgetExceeded"; readonly parseBudget: number }
| "NoDeterministicBoundaryProven"
| { readonly _tag: "RequiredArtifactMissing"; readonly artifact: RequiredArtifact }
export type IngestUpstreamSnapshot = {
readonly SnapshotIdentity: SnapshotIdentity
readonly BundleInput: TaintedBundleInput
readonly SnapshotMetadata: SnapshotMetadata | null
readonly PreviousRunManifest: VerifiedPreviousRunManifest | null
}
export type UpstreamSnapshotIngested = {
readonly RunManifest: RunManifest
readonly SegmentRecords: ReadonlyArray<SegmentRecord>
readonly CanonicalProjectionPath: TrustedCanonicalProjectionPath
readonly SummaryPath: TrustedSummaryPath | null
}
export type SnapshotIngestHardStopped = {
readonly SnapshotIdentity: SnapshotIdentity
readonly Reason: IngestFailureReason
}
export type Event = {
readonly _tag: "UpstreamSnapshotIngested"
readonly payload: UpstreamSnapshotIngested
}
export type Error = {
readonly _tag: "SnapshotIngestHardStopped"
readonly payload: SnapshotIngestHardStopped
}
export type AwaitingTrustedBundle = {
readonly _tag: "AwaitingTrustedBundle"
readonly RunIdentityRulesDescription: string
readonly BoundaryRulesDescription: string
readonly RequiredArtifacts: ReadonlyArray<RequiredArtifact>
readonly MaxBundleBytes: number
readonly ParseBudget: number
}
export type TrustedSnapshotSelected = {
readonly _tag: "TrustedSnapshotSelected"
readonly SelectedSnapshot: SelectedSnapshot
readonly PreviousRunManifest: VerifiedPreviousRunManifest | null
readonly RequiredArtifacts: ReadonlyArray<RequiredArtifact>
readonly MaxBundleBytes: number
readonly ParseBudget: number
}
export type IngestableSnapshot = {
readonly _tag: "IngestableSnapshot"
readonly RunIdentity: RunIdentity
readonly SelectedSnapshot: SelectedSnapshot
readonly PreviousRunManifest: VerifiedPreviousRunManifest | null
readonly SegmentRecords: ReadonlyArray<SegmentRecord>
readonly BoundaryProofs: ReadonlyArray<string>
readonly RequiredArtifacts: ReadonlyArray<RequiredArtifact>
}
export type State =
| AwaitingTrustedBundle
| TrustedSnapshotSelected
| IngestableSnapshot
| ({ readonly _tag: "SnapshotIngested" } & UpstreamSnapshotIngested)
@@ -0,0 +1,112 @@
import { Either } from "effect"
import { foldFailure, makeVerifiedPreviousRunManifest } from "../models/factories.js"
import {
decideIngestableSnapshot,
deriveRequiredArtifactPaths,
parseBundleLocation,
} from "../models/ops.js"
import type {
AwaitingTrustedBundle,
Error,
Event,
IngestableSnapshot,
IngestUpstreamSnapshot,
State,
TrustedSnapshotSelected,
UpstreamSnapshotIngested,
} from "../models/types.js"
import type { RunManifest } from "../models/shared.js"
export const validatePreviousRunManifest = (
manifest: RunManifest,
): Either.Either<ReturnType<typeof makeVerifiedPreviousRunManifest>, Error> =>
manifest.ManifestPath && manifest.SegmentsPath && manifest.CanonicalProjectionPath
? Either.right(makeVerifiedPreviousRunManifest(manifest))
: Either.left(
foldFailure(manifest.SnapshotIdentity, "PreviousRunManifestNotVerified"),
)
const selectTrustedSnapshot = (
state: State,
command: IngestUpstreamSnapshot,
): Either.Either<TrustedSnapshotSelected, Error> => {
if (state._tag !== "AwaitingTrustedBundle") {
return Either.left(
foldFailure(command.SnapshotIdentity, "RunIdentityCouldNotBeDerived"),
)
}
return Either.map(
parseBundleLocation(command.SnapshotIdentity, command.BundleInput),
(bundleLocation) => ({
_tag: "TrustedSnapshotSelected" as const,
SelectedSnapshot: {
SnapshotIdentity: command.SnapshotIdentity,
BundleLocation: bundleLocation,
SnapshotMetadata: command.SnapshotMetadata,
},
PreviousRunManifest: command.PreviousRunManifest,
RequiredArtifacts: state.RequiredArtifacts,
MaxBundleBytes: state.MaxBundleBytes,
ParseBudget: state.ParseBudget,
}),
)
}
export const emitSnapshotIngested = (
ingestableSnapshot: IngestableSnapshot,
): Event => {
const artifactPaths = deriveRequiredArtifactPaths(ingestableSnapshot.RunIdentity)
const runManifest: RunManifest = {
RunIdentity: ingestableSnapshot.RunIdentity,
SnapshotIdentity: ingestableSnapshot.SelectedSnapshot.SnapshotIdentity,
ManifestPath: artifactPaths.ManifestPath,
SegmentsPath: artifactPaths.SegmentsPath,
CanonicalProjectionPath: artifactPaths.CanonicalProjectionPath,
SummaryPath: artifactPaths.SummaryPath,
}
const payload: UpstreamSnapshotIngested = {
RunManifest: runManifest,
SegmentRecords: ingestableSnapshot.SegmentRecords,
CanonicalProjectionPath: artifactPaths.CanonicalProjectionPath,
SummaryPath: artifactPaths.SummaryPath,
}
return { _tag: "UpstreamSnapshotIngested", payload }
}
export const decide = (
state: State,
command: IngestUpstreamSnapshot,
) =>
Either.flatMap(selectTrustedSnapshot(state, command), (trustedSnapshot) =>
Either.map(decideIngestableSnapshot(trustedSnapshot), emitSnapshotIngested),
)
export const apply = (_state: State, event: Event): State => {
switch (event._tag) {
case "UpstreamSnapshotIngested":
return { _tag: "SnapshotIngested", ...event.payload }
}
}
export const makeAwaitingTrustedBundle = (
overrides: Partial<AwaitingTrustedBundle> = {},
): AwaitingTrustedBundle => ({
_tag: "AwaitingTrustedBundle",
RunIdentityRulesDescription: "derive from snapshot identity",
BoundaryRulesDescription: "require at least one deterministic boundary proof",
RequiredArtifacts: [
"RunManifestArtifact",
"SegmentRecordsArtifact",
"CanonicalProjectionArtifact",
],
MaxBundleBytes: 1024 * 1024,
ParseBudget: 50_000,
...overrides,
})
export { decideIngestableSnapshot }
@@ -0,0 +1,24 @@
import { Effect, Either } from "effect"
import type {
Error,
Event,
IngestUpstreamSnapshot,
State,
} from "../models/types.js"
import {
decide,
makeAwaitingTrustedBundle,
} from "../policies/decideSnapshotIngest.js"
export const workflow = (
command: IngestUpstreamSnapshot,
state: State = makeAwaitingTrustedBundle(),
): Effect.Effect<Event, Error> =>
Effect.gen(function* () {
const decision = decide(state, command)
if (Either.isLeft(decision)) {
return yield* Effect.fail(decision.left)
}
return decision.right
})
+123
View File
@@ -0,0 +1,123 @@
import { Effect, Either } from "effect"
import { describe, expect, it } from "vitest"
import {
type IngestUpstreamSnapshot,
makeRunIdentity,
makeSnapshotIdentity,
makeTaintedBundleInput,
makeTaintedBundleLocation,
makeTrustedCanonicalProjectionPath,
makeTrustedManifestPath,
makeTrustedSegmentsPath,
makeTrustedSummaryPath,
makeVerifiedPreviousRunManifest,
} from "../src/contexts/ingest-snapshot/index.js"
import {
apply,
decide,
makeAwaitingTrustedBundle,
validatePreviousRunManifest,
workflow,
} from "../src/contexts/ingest-snapshot/index.js"
const makeCommand = (
overrides: Partial<IngestUpstreamSnapshot> = {},
): IngestUpstreamSnapshot => ({
SnapshotIdentity: makeSnapshotIdentity("snapshot-001"),
BundleInput: makeTaintedBundleInput(
makeTaintedBundleLocation("/tmp/bundle.js"),
),
SnapshotMetadata: null,
PreviousRunManifest: null,
...overrides,
})
describe("ingestSnapshot workflow", () => {
it("ingests a deterministic bundle snapshot", async () => {
const event = await Effect.runPromise(workflow(makeCommand()))
expect(event._tag).toBe("UpstreamSnapshotIngested")
expect(event.payload.RunManifest.RunIdentity).toBe("run:snapshot-001")
expect(event.payload.RunManifest.ManifestPath).toBe(
"runs/run:snapshot-001/manifest.json",
)
expect(event.payload.SegmentRecords).toHaveLength(1)
})
it("hard-stops when the bundle location is not parseable", () => {
const result = decide(
makeAwaitingTrustedBundle(),
makeCommand({
BundleInput: makeTaintedBundleInput(makeTaintedBundleLocation("not-a-path")),
}),
)
expect(Either.isLeft(result)).toBe(true)
if (Either.isLeft(result)) {
expect(result.left.payload.Reason).toBe("BundleNotParseable")
}
})
it("applies the ingested event into SnapshotIngested state", () => {
const result = decide(makeAwaitingTrustedBundle(), makeCommand())
expect(Either.isRight(result)).toBe(true)
if (Either.isRight(result)) {
const nextState = apply(makeAwaitingTrustedBundle(), result.right)
expect(nextState._tag).toBe("SnapshotIngested")
if (nextState._tag === "SnapshotIngested") {
expect(nextState.RunManifest.CanonicalProjectionPath).toBe(
"runs/run:snapshot-001/canonical.ts",
)
expect(nextState.SummaryPath).toBe(
"runs/run:snapshot-001/summary.json",
)
}
}
})
it("accepts a verified previous run manifest", () => {
const result = validatePreviousRunManifest({
RunIdentity: makeRunIdentity("run:previous"),
SnapshotIdentity: makeSnapshotIdentity("snapshot-000"),
ManifestPath: makeTrustedManifestPath("runs/run:previous/manifest.json"),
SegmentsPath: makeTrustedSegmentsPath("runs/run:previous/segments.json"),
CanonicalProjectionPath: makeTrustedCanonicalProjectionPath(
"runs/run:previous/canonical.ts",
),
SummaryPath: makeTrustedSummaryPath("runs/run:previous/summary.json"),
})
expect(Either.isRight(result)).toBe(true)
})
it("preserves segment evidence when a previous manifest is present", async () => {
const event = await Effect.runPromise(
workflow(
makeCommand({
PreviousRunManifest: makeVerifiedPreviousRunManifest({
RunIdentity: makeRunIdentity("run:previous"),
SnapshotIdentity: makeSnapshotIdentity("snapshot-000"),
ManifestPath: makeTrustedManifestPath("runs/run:previous/manifest.json"),
SegmentsPath: makeTrustedSegmentsPath("runs/run:previous/segments.json"),
CanonicalProjectionPath: makeTrustedCanonicalProjectionPath(
"runs/run:previous/canonical.ts",
),
SummaryPath: makeTrustedSummaryPath("runs/run:previous/summary.json"),
}),
}),
),
)
expect(event.payload.SegmentRecords[0]).toMatchObject({
SegmentId: "snapshot-001:root",
AstNodeKind: "Program",
Hashes: {
RawHash: "raw:snapshot-001",
NormalizedHash: "normalized:snapshot-001",
ShapeHash: "shape:snapshot-001",
},
})
expect(event.payload.SegmentRecords[0]?.Hashes.RawHash).toBe("raw:snapshot-001")
})
})