Initial commit

2026-05-25 05:47:28 +00:00
commit 4d6495ffda
97 changed files with 13403 additions and 0 deletions
@@ -0,0 +1,448 @@
+# Agent Pipeline Notes
+
+## 1. Reality Check
+
+- Fully automated code writing is close for bounded, low-risk, well-specified work.
+- Fully unsupervised ownership of large, evolving, high-stakes systems is not close.
+- Raw coding ability is improving faster than architectural consistency, uncertainty calibration, and trustworthy self-review.
+- In the near term, the goal is not to remove review entirely. The goal is to move review up a level.
+
+## 2. Current Bottleneck
+
+- The early design phases in this repository are already relatively strong.
+- The main bottleneck is assembly and the review/refactor thrash around assembly.
+- The biggest time sink is repeated loops around common implementation issues, potential refactors, and reviewing too much low-level detail.
+- Strong design artifacts reduce the need to reconstruct intent from code, but they do not yet fully remove the need for human judgment.
+
+## 3. What a Pipeline Adds Beyond Manual Skill Use
+
+Right now, the human is acting as the scheduler and state machine.
+A pipeline externalizes that work so it is explicit and enforceable.
+
+Useful additions that were not as necessary when doing the process manually:
+
+- machine-checkable approval state
+- explicit slice definitions
+- spec-to-code traceability rules
+- human-signoff criteria by phase
+- artifact diffs between stages
+- automatic verification bundles
+- replayable evaluation runs
+- thrash/change-war detection
+- audit trail for decisions and outcomes
+
+The main benefit is not “agents do more coding.”
+The main benefit is that the process becomes more stable, repeatable, and reviewable.
+
+## 4. What to Pipeline First
+
+Build the thinnest useful pipeline around the current process.
+Do not start with a large swarm system.
+
+Best initial targets:
+
+- phase state machine
+- artifact and gate manifests
+- assembly slice runner
+- verification bundle
+- thrash detection
+- small replay benchmark harness
+
+This should be built as a headless harness first, designed for Argo-style execution rather than an interactive coding CLI.
+Use existing coding agents and projects like opencode as reference implementations for inner-loop patterns, not as the architectural foundation.
+
+## 5. What Should Stay Human
+
+Humans should still own the judgment-heavy transitions:
+
+- requirements freeze
+- workflow or slice approval
+- architecture exceptions
+- security sign-off for risky changes
+- final sign-off for high-blast-radius changes
+- resolving ambiguity or contradictory requirements
+
+The pipeline should reduce line-by-line review, not eliminate human judgment.
+
+## 6. Review at a Higher Level
+
+The practical goal is to move review from low-level code inspection to higher-level conformance review.
+
+Instead of asking:
+- What does this code do?
+- Did the agent miss some implementation detail?
+
+Try to make review focus on:
+- Does this slice match the frozen artifact?
+- Does it violate any domain invariants or trust boundaries?
+- Did it introduce any risky seams or suspicious shortcuts?
+- Does it need redesign, human review, or is it safe to merge?
+
+This is realistic if artifacts become stricter and slices become smaller.
+
+## 7. Preventing Subtle Logic Errors
+
+The system will mostly catch errors early rather than prevent every error outright.
+That is still valuable because catching errors before merge is much cheaper than catching them later.
+
+Helpful layers:
+
+- frozen design artifacts
+- spec-to-code traceability
+- property and mutation tests where useful
+- risk-focused seam review
+- deterministic boundary and architecture checks
+- forcing agents to cite which invariant each change satisfies
+
+The point is not perfection.
+The point is making subtle logic flaws rarer and cheaper to catch.
+
+## 8. Slices and Assembly Scope
+
+Assembly often does too much when it translates a whole blueprint at once.
+A better pattern is:
+
+1. freeze the artifact
+2. choose one workflow slice
+3. implement one tracer-bullet path end to end
+4. run tests/checks/review
+5. expand behavior within that slice
+6. move to the next slice
+
+A slice is not necessarily one PR per policy or adapter.
+A slice is one coherent bounded vertical change that may touch a workflow, several policies, and several adapters if they belong to one contract.
+
+## 9. Cost Reality
+
+A pipeline can save money only if it reduces retries, review thrash, and change wars.
+If it creates uncontrolled loops, it can absolutely increase token costs.
+
+The key metric is not cost per token.
+The key metric is cost per accepted slice and time saved per accepted slice.
+
+Important ideas:
+
+- use expensive reasoning at phase boundaries
+- use cheaper models only when artifacts are tight and the task is narrow
+- detect thrash early
+- benchmark replay on a small representative set, not everything
+- optimize for time saved, not just token minimization
+
+If spending more on tokens saves multiple days of work, that can still be a clear win.
+
+## 10. Minimum Viable Build
+
+For the clarified goal, LangGraph is worth adopting in the MVP because orchestration, durable execution, and resumable state are now part of the project’s core value.
+Start with a small headless TypeScript orchestrator built for batch execution inside Argo workflows.
+Add Langfuse from the beginning so traces, spans, prompts, runs, and outcomes are observable in a way that is useful both operationally and on a resume.
+
+Suggested MVP pieces:
+
+- LangGraph workflow graph / state machine for bounded slices
+- markdown artifacts with small machine-readable frontmatter or JSON manifests
+- simple TypeScript parsers for artifacts and statuses
+- Langfuse tracing for each run, stage, model call, and gate decision
+- verification bundle assembled into one review packet
+- cheap external thrash-detection heuristics
+- small replay benchmark harness
+- one headless executor path suitable for Argo
+
+This is enough to improve the real workflow and create a strong portfolio project while staying aligned with the desired end-state architecture.
+
+## 11. Portable Process Core, Disposable Runner
+
+The right design is a portable process core with a replaceable runner.
+
+- process rules, states, gates, policies, artifacts, and evaluation logic should be portable
+- the runtime executor should be disposable or replaceable at the boundary level
+
+Even if LangGraph is used in the MVP, the value should still live primarily in the process definition, policies, artifacts, event schema, and evaluation logic rather than in LangGraph-specific node wiring.
+The graph runtime should be treated as an adapter for execution, checkpointing, and resumability, not as the place where domain process knowledge gets trapped.
+
+## 12. Workflow / Service Boundary for the Pipeline Itself
+
+Follow the repository’s existing separation when building the pipeline.
+
+Pure/domain-like pieces:
+- pipeline state
+- gate rules
+- slice metadata
+- approval rules
+- conformance decisions
+
+Workflow/service pieces:
+- run coding agents
+- invoke models
+- read and write artifacts
+- launch review jobs
+- open PRs
+- store logs and run records
+
+Example:
+- “Can assembly start?” is a pure policy decision.
+- “Launch the assembly agent with model X” is a service call inside a workflow.
+
+That makes the process logic portable and keeps LangGraph as a future adapter instead of a hard dependency.
+
+## 13. LangGraph: When It Helps
+
+LangGraph is useful here because orchestration is now part of the main project rather than a future optimization.
+Its real benefits are:
+
+- durable execution
+- resumability
+- retries
+- branching workflows
+- checkpointed state
+- human-in-the-loop pauses
+- better operational management for complex graphs
+- easier alignment with headless batch execution in Argo-style systems
+
+For this direction, it is reasonable to use LangGraph in the MVP.
+The main caution is to avoid letting graph node wiring become the only place where business process rules live.
+Keep process semantics portable even if LangGraph is the first runtime.
+
+## 14. Training Data Value
+
+A pipeline can produce much better training data than raw chat logs.
+The valuable data is not just “the model wrote code.”
+The valuable data is:
+
+- given this phase and artifact state
+- with these checks failing or passing
+- what action/tool/model choice was correct next
+- what human correction was needed
+- what result was ultimately accepted
+
+This is useful for future model training, replay evaluation, and process improvement.
+
+## 15. What to Log for Training and Replay
+
+Do not assume the orchestration engine will automatically create a clean training corpus.
+Design explicit logging for the data you care about.
+
+Minimum useful fields:
+
+- run ID
+- stage ID
+- slice ID
+- artifact versions and hashes
+- prompt or template version
+- model choice
+- full input context given to the model for that step
+- tool choice and tool arguments
+- tool outputs or summaries
+- verification results
+- gate decision
+- human corrections or overrides
+- final disposition (accepted, rejected, redesign, escalated)
+
+If prompt design or routing rules materially affect cost or quality, log those too.
+
+## 16. LangGraph vs Training Logs
+
+LangGraph can persist execution state and checkpoints, and can be useful for debugging.
+But that is different from having a clean, normalized dataset for:
+
+- training
+- replay
+- evaluation
+- analytics
+- cost analysis
+
+So the orchestration system and the training/event log should be treated as separate concerns.
+Use orchestration state for execution and recovery.
+Use explicit event logs for learning and analysis.
+
+## 17. Hooking Into the Executor Layer
+
+Start with a headless executor boundary, not with deep integration into an interactive coding CLI.
+The first goal is not to instrument every internal token.
+The first goal is to supervise bounded work reliably inside batch execution.
+
+What the executor boundary should do first:
+
+- launch a bounded task with a clear artifact bundle
+- capture the returned summary, diff, and verification results
+- classify the outcome as continue / needs-human / blocked / failed
+- stop after one coherent slice or checkpoint
+- emit structured events and traces to Langfuse
+
+If an existing coding agent can operate in a non-chatty batch mode for bounded tasks, it can sit behind this executor boundary.
+If it cannot, keep it for human-guided stages and use a lower-level headless executor for automated assembly later.
+
+## 18. What to Capture From the Executor
+
+If possible, capture at least:
+
+- exact task input sent
+- artifact bundle provided
+- model used
+- tool calls made
+- tool arguments
+- result summary
+- changed files
+- verification output
+- whether the run completed, asked for help, or thrashed
+- trace IDs, run IDs, span metadata, and checkpoint IDs
+
+For training-grade data, exact per-step context is better than only a final summary.
+LangGraph persistence and Langfuse traces are helpful, but they are still not the same thing as a clean replay/eval dataset.
+Explicit event capture is still required.
+
+## 19. Opencode-Specific Decision Point
+
+Opencode is now mainly a reference implementation for inner-loop agent behavior rather than the likely runtime foundation.
+
+Questions to answer before borrowing ideas from it or integrating with it:
+
+- Which parts are genuinely reusable in a headless Argo-oriented system?
+- Can its agent loop run cleanly in bounded, low-chatter automation?
+- Can you capture the effective prompt/context used?
+- Can you capture tool-use data well enough for replay and training?
+- Does borrowing from it reduce delivery time more than it increases architectural drag?
+
+If yes, copy patterns or adapt isolated pieces.
+If no, then the best path is to keep opencode as a reference and build a smaller dedicated executor for automated assembly and evaluation.
+
+## 20. Portfolio Value
+
+This can be a strong AgentOps portfolio project if it is real and measured.
+The impressive part is not size.
+The impressive part is demonstrating:
+
+- staged orchestration
+- model routing
+- evaluation and replay
+- cost and quality tradeoffs
+- review reduction
+- guardrails and risk controls
+- operational judgment about where humans stay in the loop
+
+A compact, sharp system with metrics is better than a giant “swarm” that is hard to explain.
+
+## 21. Practical Next Step
+
+Build a thin layer around the current repository process and use it on real work.
+Do not pause everything to build a big system.
+
+A good first direction:
+
+1. define machine-checkable phase and approval states
+2. define slice metadata and human sign-off criteria
+3. implement a small LangGraph state graph for one bounded slice
+4. wrap verification into one bundle
+5. add basic thrash detection and Langfuse tracing
+6. run it on one live workflow inside a headless batch path
+7. measure review time, retries, accepted-slice cost, and trace quality
+8. add explicit event logging for replay and evals
+9. decide whether any existing coding agent is good enough behind the executor boundary
+10. iterate based on real pain
+
+## 22. Six-Step Build Order for the First Effect-Template Use Case
+
+For the first use case, the target is developing software inside this repository structure rather than supporting a fully general coding environment from day one.
+The build order should stay tightly coupled to the minimum tool surface needed for one bounded workflow slice.
+
+### Step 1: Prove the End-to-End Slice Loop
+Goal:
+- get one bounded software-development slice running end to end in the effect-template style
+
+Tools to add:
+- read_file
+- list_files
+- write_file
+- edit_file
+- run_tests
+
+Why this first:
+- proves the basic harness can take a task, operate on repository artifacts, make code changes, run verification, and return a classified result
+- keeps the tool surface small enough to debug failures clearly
+
+### Step 2: Add First-Class Observability
+Goal:
+- make each run inspectable so failures and bottlenecks are visible immediately
+
+Tools to add:
+- trace_run or equivalent Langfuse instrumentation hooks
+
+Why now:
+- once the first loop works, observability becomes the fastest way to learn from real runs
+- tracing should come before adding much more autonomy so failures do not become opaque
+
+### Step 3: Add Replay and Evaluation Foundations
+Goal:
+- make runs reproducible and measurable rather than anecdotal
+
+Tools to add:
+- save_replay_record
+- replay_run
+
+Why now:
+- replayability is one of the core differentiators of the harness
+- this creates a basis for later evals, cost analysis, and regression checks
+
+### Step 4: Add Gates and Thrash Control
+Goal:
+- keep the system from looping uselessly or pushing low-quality output forward
+
+Tools to add:
+- gate_evaluator
+- thrash_detector
+
+Why now:
+- after replay exists, it becomes easier to define and tune failure heuristics
+- this is where the harness starts protecting time and token spend rather than only executing work
+
+### Step 5: Improve Execution Safety
+Goal:
+- make automated runs safer and more diagnosable in headless environments
+
+Tools to add:
+- safer shell or sandbox executor
+- git_diff
+
+Why now:
+- once the harness can already complete bounded slices, safety and change inspection become more valuable than adding more raw capability
+- diff visibility is especially useful for higher-level review
+
+### Step 6: Add Richer Orchestration
+Goal:
+- move from one bounded slice to more capable autonomous workflow behavior
+
+Tools to add:
+- planning or decomposition tool
+- multi-agent delegation or subtask dispatch
+
+Why last:
+- richer orchestration compounds complexity quickly
+- it should be added only after the basic loop, observability, replay, and control mechanisms already work
+
+## 23. Open Decisions Still Worth Discussing
+
+Before implementing too much, it would be useful to decide:
+
+- what exact human approvals are required at each phase
+- how small slices should be in practice
+- what counts as thrash vs healthy iteration
+- what minimum event schema is worth storing now
+- whether to store full prompts or prompt templates plus resolved context
+- whether to store full tool outputs or normalized summaries plus raw attachments
+- how much of the coding CLI can be wrapped without losing its advantages
+- whether automated assembly should use the same executor as human-guided work
+- what metrics will prove the pipeline is saving time rather than adding ceremony
+- what data retention and privacy rules apply if you later train on this data
+
+## 24. Bottom Line
+
+The opportunity is real.
+The trust barrier is still the main constraint.
+
+So the right move is:
+- do not wait for models to become magically trustworthy
+- do not overbuild a giant pipeline first
+- formalize the process you already use
+- move review up a level
+- keep human judgment at the risky seams
+- log the process in a training- and replay-friendly way
+- measure whether the pipeline actually saves time
@@ -0,0 +1,83 @@
+#  System Prompt: Evolutionary Architecture Pipeline Implementation
+
+**Goal:** Build a custom CI/CD script that combines spatial data (where code lives), temporal data (when code changes), and structural data (what the code does) to guide incremental refactoring in a TypeScript + Effect.ts codebase.
+
+**Tech Stack:**
+1.  **Dependency-Cruiser:** To map spatial boundaries (Seams).
+2.  **Hercules:** To map temporal evolutionary coupling. 
+3.  **Opengrep:** To identify data-flow anti-patterns (Tramp Data, Branching).
+4.  **TypeScript (Node.js):** The "glue" script (`refactor-bot.ts`) that synthesizes the data.
+
+---
+
+### Step 1: Establish Spatial Boundaries (Dependency-Cruiser)
+**Agent Instructions:**
+1. Install dependency-cruiser: `npm install -D dependency-cruiser`
+2. Initialize it `npx depcruise --init` and configure it for a TypeScript environment.
+3. Modify the `.dependency-cruiser.js` configuration to define "Seams". Group the application by folders (e.g., Domain boundaries, Layers, or Effect modules).
+4. Create an npm script named `"map:boundaries"` that runs depcruise and outputs the architecture to a JSON file: `npx depcruise src --output-type json > ./.architecture/seams.json`.
+
+---
+
+### Step 2: Establish Temporal Coupling (Hercules / Git Mining)
+**Agent Instructions:**
+1. Set up a tool to extract **Logical Coupling** from the git history. 
+   *(Note for Agent: Use the `src-d/hercules` binary via Docker or Go, OR use the simpler `code-maat` Python/Clojure alternative if Hercules is blocked by local environment constraints).*
+2. Execute a git log command to get history:
+   ```bash
+   git log --all --numstat --date=short --pretty=format:'--%h--%ad--%aN' --no-renames > ./.architecture/logfile.log
+   ```
+3. Run the coupling analysis tool against this log. The goal is to output a `.csv` or `.json` file (`coupling.json`) with three columns/fields: `[FileA, FileB, CoChangePercentage]`.
+4. Ensure this output is saved to `/.architecture/coupling.json`.
+
+---
+
+### Step 3: Implement Deep Static Analysis (Opengrep / Semgrep)
+**Agent Instructions:**
+1. Install the open-source CLI (e.g., `npm install -g opengrep` or use `pip install semgrep`).
+2. Create a folder `/.architecture/rules/`.
+3. Write custom YAML rules to detect code smells specific to **Effect.ts** and functional pipelines. 
+4. **Rule 1: Tramp Data in Pipelines.** Write a rule that looks for a variable passed into a `.pipe()` or `Effect.gen` that traverses across a recognized boundary un-mutated. Use the ellipsis operator (`...`).
+   *Example concept:*
+   ```yaml
+   rules:
+     - id: effect-tramp-data
+       pattern: pipe(..., $MOD_A.get($DATA), ..., $MOD_B.save($DATA), ...)
+       message: "Potential tramp data crossing seam boundaries."
+       languages: [typescript]
+       severity: WARNING
+   ```
+5. Ensure there is a command to run this and output to JSON: `opengrep --config ./.architecture/rules/ --json > ./.architecture/smells.json`
+
+---
+
+### Step 4: Write the Synthesis Script (`refactor-bot.ts`)
+**Agent Instructions:**
+Write a Node.js TypeScript file (`scripts/refactor-bot.ts`) that acts as the brain. This script must:
+
+**1. Load the Data:**
+*   Parse `seams.json` (Which files belong to which boundary).
+*   Parse `coupling.json` (How often files change together).
+*   Parse `smells.json` (Where the AST anti-patterns are).
+
+**2. Detect Cross-Seam Leaks (High External Coupling):**
+*   Look at pairs in `coupling.json` that co-change > 60% of the time, but live in *different* Seams according to `seams.json`.
+*   *Action:* Flag these as **Architectural Leaks**. Check if `smells.json` found tramp data in these specific files. 
+
+**3. Detect Splittable Contexts (Low Internal Cohesion):**
+*   Look at files that exist in the *same* Seam (e.g., both are in `src/payments/`), but have a co-change frequency of < 5%.
+*   *Action:* Flag this boundary as a candidate to be split into two smaller bounded contexts.
+
+**4. Generate the Output/Report:**
+*   Output a Markdown report summarizing:
+    *   **⚠️ High Priority Refactors:** Cross-layer tightly coupled files (includes Semgrep AST context).
+    *   **✂️ Suggested Seam Splits:** Modules with low internal cohesion.
+*   Exit with code `1` if the max coupling threshold is breached to fail the CI/CD pipeline, otherwise exit with `0`.
+
+---
+
+### Definition of Done for Agent:
+1. `dependency-cruiser` is configured and outputs `seams.json`.
+2. Git history pipeline outputs `coupling.json`.
+3. Opengrep/Semgrep YAML rules are written for Effect.ts and output `smells.json`.
+4. `refactor-bot.ts` successfully reads all three files and prints an Evolutionary Architecture Report in the console.
@@ -0,0 +1,5 @@
+the "domain-modeling.md" in agent core rules is way too long. should be in docs not filling agent context
+
+add notes for how agent should name commits. only makes commits after I review the code
+
+intra bounded contexts should be dry but different bounded contexts can implement the same function
@@ -0,0 +1,40 @@
+capability based use, not passing in the entire db
+- see wlaschin's presentations on capability based code
+
+label input as trusted or not, strings are no longer raw
+
+sanitization notes add trusted vs untrusted string input? ie save string source?
+
+are the security boundaries comprehensive enough? with 12 concerns I worry that is too many for one agent, I guess you can just keep doing security over and over again. do we cover enough concerns?
+add sanitization to strings
+mention seams from legacy code book
+make sure tests are property tests
+tdd for domain objects as well
+
+sub-agent / orchestration documentation ideas:
+- add shared orchestrator policy for scope control, restart criteria, and limits on agent freedom
+- add a formal failure taxonomy for agent runs, including drift, tool misuse, silent failure, and recovery notes
+- add evaluation thresholds and acceptance-check guidance for agent-produced work
+- add model-routing and cost rules for choosing cheaper vs stronger models
+- add escalation and recovery policies for when an agent should stop, hand off, or request a fresh context
+- add guidance for refactors: preserve backwards compatibility only with an explicit reason, otherwise prefer the direct change
+- add implementation guidance for future sub-agent benchmarks using real TypeScript tests as verification
+- add lightweight evidence-capture guidance so agent runs can be reviewed later without heavy documentation overhead
+
+
+# bounded seam identification tool:
+ it feels like you could have a tool that automatically identifies seams (ie seams are pretty obvious in this architecture right? they are layers in this template?) so you could have a tool that automatically traverses the changes so you can see what parts frequently need coordinated edits across seams? ie you could have this run every pr and then once there are places that get enough of a score you could start a refactor flow? or if there are pieces within a seam that always change independently then maybe they should be 2 bounded seams?
+And yes, your coordinated-change idea is strong. This repo’s explicit layers help, but seams are not just layers; they are places where behavior can vary independently. A tool could mine PRs/commits for:
+- files/functions frequently changed together
+- edits that cross module APIs
+- params/data passed through unchanged across many callers
+- repeated branching at the same boundary
+
+# need find bounded contexts step of feature
+need to find if there are new or existing bounded contexts. arrange repo by bounded context with shared primitives
+
+# do we have decompose feature into several workflows/tasks skill?
+define the difference between workflows and tasks
+
+# need process for refactoring
+ie I don't need to wait for the template to be done to use it I can just add the improvements later
@@ -0,0 +1,4 @@
+# ideas
+- use sub package.json to isolate bounded contexts
+- auto update and refactor, renovate updates dependencies, codemod cli does the refactor
+- use sourcebot or cocoindex-code to look at the semantic meaning and check if there are similar functions that can be removed. ie "write everything twice" (wet) for projects, "don't repeat yourself" (dry) in bounded contexts. would need to write own pipeline