- Fully automated code writing is close for bounded, low-risk, well-specified work.
- Fully unsupervised ownership of large, evolving, high-stakes systems is not close.
- Raw coding ability is improving faster than architectural consistency, uncertainty calibration, and trustworthy self-review.
- In the near term, the goal is not to remove review entirely. The goal is to move review up a level.
## 2. Current Bottleneck
- The early design phases in this repository are already relatively strong.
- The main bottleneck is assembly and the review/refactor thrash around assembly.
- The biggest time sink is repeated loops around common implementation issues, potential refactors, and reviewing too much low-level detail.
- Strong design artifacts reduce the need to reconstruct intent from code, but they do not yet fully remove the need for human judgment.
## 3. What a Pipeline Adds Beyond Manual Skill Use
Right now, the human is acting as the scheduler and state machine.
A pipeline externalizes that work so it is explicit and enforceable.
Useful additions that were not as necessary when doing the process manually:
- machine-checkable approval state
- explicit slice definitions
- spec-to-code traceability rules
- human-signoff criteria by phase
- artifact diffs between stages
- automatic verification bundles
- replayable evaluation runs
- thrash/change-war detection
- audit trail for decisions and outcomes
The main benefit is not “agents do more coding.”
The main benefit is that the process becomes more stable, repeatable, and reviewable.
## 4. What to Pipeline First
Build the thinnest useful pipeline around the current process.
Do not start with a large swarm system.
Best initial targets:
- phase state machine
- artifact and gate manifests
- assembly slice runner
- verification bundle
- thrash detection
- small replay benchmark harness
This should be built as a headless harness first, designed for Argo-style execution rather than an interactive coding CLI.
Use existing coding agents and projects like opencode as reference implementations for inner-loop patterns, not as the architectural foundation.
## 5. What Should Stay Human
Humans should still own the judgment-heavy transitions:
- requirements freeze
- workflow or slice approval
- architecture exceptions
- security sign-off for risky changes
- final sign-off for high-blast-radius changes
- resolving ambiguity or contradictory requirements
The pipeline should reduce line-by-line review, not eliminate human judgment.
## 6. Review at a Higher Level
The practical goal is to move review from low-level code inspection to higher-level conformance review.
Instead of asking:
- What does this code do?
- Did the agent miss some implementation detail?
Try to make review focus on:
- Does this slice match the frozen artifact?
- Does it violate any domain invariants or trust boundaries?
- Did it introduce any risky seams or suspicious shortcuts?
- Does it need redesign, human review, or is it safe to merge?
This is realistic if artifacts become stricter and slices become smaller.
## 7. Preventing Subtle Logic Errors
The system will mostly catch errors early rather than prevent every error outright.
That is still valuable because catching errors before merge is much cheaper than catching them later.
Helpful layers:
- frozen design artifacts
- spec-to-code traceability
- property and mutation tests where useful
- risk-focused seam review
- deterministic boundary and architecture checks
- forcing agents to cite which invariant each change satisfies
The point is not perfection.
The point is making subtle logic flaws rarer and cheaper to catch.
## 8. Slices and Assembly Scope
Assembly often does too much when it translates a whole blueprint at once.
A better pattern is:
1. freeze the artifact
2. choose one workflow slice
3. implement one tracer-bullet path end to end
4. run tests/checks/review
5. expand behavior within that slice
6. move to the next slice
A slice is not necessarily one PR per policy or adapter.
A slice is one coherent bounded vertical change that may touch a workflow, several policies, and several adapters if they belong to one contract.
## 9. Cost Reality
A pipeline can save money only if it reduces retries, review thrash, and change wars.
If it creates uncontrolled loops, it can absolutely increase token costs.
The key metric is not cost per token.
The key metric is cost per accepted slice and time saved per accepted slice.
Important ideas:
- use expensive reasoning at phase boundaries
- use cheaper models only when artifacts are tight and the task is narrow
- detect thrash early
- benchmark replay on a small representative set, not everything
- optimize for time saved, not just token minimization
If spending more on tokens saves multiple days of work, that can still be a clear win.
## 10. Minimum Viable Build
For the clarified goal, LangGraph is worth adopting in the MVP because orchestration, durable execution, and resumable state are now part of the project’s core value.
Start with a small headless TypeScript orchestrator built for batch execution inside Argo workflows.
Add Langfuse from the beginning so traces, spans, prompts, runs, and outcomes are observable in a way that is useful both operationally and on a resume.
Suggested MVP pieces:
- LangGraph workflow graph / state machine for bounded slices
- markdown artifacts with small machine-readable frontmatter or JSON manifests
- simple TypeScript parsers for artifacts and statuses
- Langfuse tracing for each run, stage, model call, and gate decision
- verification bundle assembled into one review packet
- cheap external thrash-detection heuristics
- small replay benchmark harness
- one headless executor path suitable for Argo
This is enough to improve the real workflow and create a strong portfolio project while staying aligned with the desired end-state architecture.
## 11. Portable Process Core, Disposable Runner
The right design is a portable process core with a replaceable runner.
- process rules, states, gates, policies, artifacts, and evaluation logic should be portable
- the runtime executor should be disposable or replaceable at the boundary level
Even if LangGraph is used in the MVP, the value should still live primarily in the process definition, policies, artifacts, event schema, and evaluation logic rather than in LangGraph-specific node wiring.
The graph runtime should be treated as an adapter for execution, checkpointing, and resumability, not as the place where domain process knowledge gets trapped.
## 12. Workflow / Service Boundary for the Pipeline Itself
Follow the repository’s existing separation when building the pipeline.
Pure/domain-like pieces:
- pipeline state
- gate rules
- slice metadata
- approval rules
- conformance decisions
Workflow/service pieces:
- run coding agents
- invoke models
- read and write artifacts
- launch review jobs
- open PRs
- store logs and run records
Example:
- “Can assembly start?” is a pure policy decision.
- “Launch the assembly agent with model X” is a service call inside a workflow.
That makes the process logic portable and keeps LangGraph as a future adapter instead of a hard dependency.
## 13. LangGraph: When It Helps
LangGraph is useful here because orchestration is now part of the main project rather than a future optimization.
Its real benefits are:
- durable execution
- resumability
- retries
- branching workflows
- checkpointed state
- human-in-the-loop pauses
- better operational management for complex graphs
- easier alignment with headless batch execution in Argo-style systems
For this direction, it is reasonable to use LangGraph in the MVP.
The main caution is to avoid letting graph node wiring become the only place where business process rules live.
Keep process semantics portable even if LangGraph is the first runtime.
## 14. Training Data Value
A pipeline can produce much better training data than raw chat logs.
The valuable data is not just “the model wrote code.”
The valuable data is:
- given this phase and artifact state
- with these checks failing or passing
- what action/tool/model choice was correct next
- what human correction was needed
- what result was ultimately accepted
This is useful for future model training, replay evaluation, and process improvement.
## 15. What to Log for Training and Replay
Do not assume the orchestration engine will automatically create a clean training corpus.
Design explicit logging for the data you care about.
Minimum useful fields:
- run ID
- stage ID
- slice ID
- artifact versions and hashes
- prompt or template version
- model choice
- full input context given to the model for that step
- tool choice and tool arguments
- tool outputs or summaries
- verification results
- gate decision
- human corrections or overrides
- final disposition (accepted, rejected, redesign, escalated)
If prompt design or routing rules materially affect cost or quality, log those too.
## 16. LangGraph vs Training Logs
LangGraph can persist execution state and checkpoints, and can be useful for debugging.
But that is different from having a clean, normalized dataset for:
- training
- replay
- evaluation
- analytics
- cost analysis
So the orchestration system and the training/event log should be treated as separate concerns.
Use orchestration state for execution and recovery.
Use explicit event logs for learning and analysis.
## 17. Hooking Into the Executor Layer
Start with a headless executor boundary, not with deep integration into an interactive coding CLI.
The first goal is not to instrument every internal token.
The first goal is to supervise bounded work reliably inside batch execution.
What the executor boundary should do first:
- launch a bounded task with a clear artifact bundle
- capture the returned summary, diff, and verification results
- classify the outcome as continue / needs-human / blocked / failed
- stop after one coherent slice or checkpoint
- emit structured events and traces to Langfuse
If an existing coding agent can operate in a non-chatty batch mode for bounded tasks, it can sit behind this executor boundary.
If it cannot, keep it for human-guided stages and use a lower-level headless executor for automated assembly later.
## 18. What to Capture From the Executor
If possible, capture at least:
- exact task input sent
- artifact bundle provided
- model used
- tool calls made
- tool arguments
- result summary
- changed files
- verification output
- whether the run completed, asked for help, or thrashed
- trace IDs, run IDs, span metadata, and checkpoint IDs
For training-grade data, exact per-step context is better than only a final summary.
LangGraph persistence and Langfuse traces are helpful, but they are still not the same thing as a clean replay/eval dataset.
Explicit event capture is still required.
## 19. Opencode-Specific Decision Point
Opencode is now mainly a reference implementation for inner-loop agent behavior rather than the likely runtime foundation.
Questions to answer before borrowing ideas from it or integrating with it:
- Which parts are genuinely reusable in a headless Argo-oriented system?
- Can its agent loop run cleanly in bounded, low-chatter automation?
- Can you capture the effective prompt/context used?
- Can you capture tool-use data well enough for replay and training?
- Does borrowing from it reduce delivery time more than it increases architectural drag?
If yes, copy patterns or adapt isolated pieces.
If no, then the best path is to keep opencode as a reference and build a smaller dedicated executor for automated assembly and evaluation.
## 20. Portfolio Value
This can be a strong AgentOps portfolio project if it is real and measured.
The impressive part is not size.
The impressive part is demonstrating:
- staged orchestration
- model routing
- evaluation and replay
- cost and quality tradeoffs
- review reduction
- guardrails and risk controls
- operational judgment about where humans stay in the loop
A compact, sharp system with metrics is better than a giant “swarm” that is hard to explain.
## 21. Practical Next Step
Build a thin layer around the current repository process and use it on real work.
Do not pause everything to build a big system.
A good first direction:
1. define machine-checkable phase and approval states
2. define slice metadata and human sign-off criteria
3. implement a small LangGraph state graph for one bounded slice
4. wrap verification into one bundle
5. add basic thrash detection and Langfuse tracing
6. run it on one live workflow inside a headless batch path
7. measure review time, retries, accepted-slice cost, and trace quality
8. add explicit event logging for replay and evals
9. decide whether any existing coding agent is good enough behind the executor boundary
10. iterate based on real pain
## 22. Six-Step Build Order for the First Effect-Template Use Case
For the first use case, the target is developing software inside this repository structure rather than supporting a fully general coding environment from day one.
The build order should stay tightly coupled to the minimum tool surface needed for one bounded workflow slice.
### Step 1: Prove the End-to-End Slice Loop
Goal:
- get one bounded software-development slice running end to end in the effect-template style
Tools to add:
- read_file
- list_files
- write_file
- edit_file
- run_tests
Why this first:
- proves the basic harness can take a task, operate on repository artifacts, make code changes, run verification, and return a classified result
- keeps the tool surface small enough to debug failures clearly
### Step 2: Add First-Class Observability
Goal:
- make each run inspectable so failures and bottlenecks are visible immediately
Tools to add:
- trace_run or equivalent Langfuse instrumentation hooks
Why now:
- once the first loop works, observability becomes the fastest way to learn from real runs
- tracing should come before adding much more autonomy so failures do not become opaque
### Step 3: Add Replay and Evaluation Foundations
Goal:
- make runs reproducible and measurable rather than anecdotal
Tools to add:
- save_replay_record
- replay_run
Why now:
- replayability is one of the core differentiators of the harness
- this creates a basis for later evals, cost analysis, and regression checks
### Step 4: Add Gates and Thrash Control
Goal:
- keep the system from looping uselessly or pushing low-quality output forward
Tools to add:
- gate_evaluator
- thrash_detector
Why now:
- after replay exists, it becomes easier to define and tune failure heuristics
- this is where the harness starts protecting time and token spend rather than only executing work
### Step 5: Improve Execution Safety
Goal:
- make automated runs safer and more diagnosable in headless environments
Tools to add:
- safer shell or sandbox executor
- git_diff
Why now:
- once the harness can already complete bounded slices, safety and change inspection become more valuable than adding more raw capability
- diff visibility is especially useful for higher-level review
### Step 6: Add Richer Orchestration
Goal:
- move from one bounded slice to more capable autonomous workflow behavior
# System Prompt: Evolutionary Architecture Pipeline Implementation
**Goal:** Build a custom CI/CD script that combines spatial data (where code lives), temporal data (when code changes), and structural data (what the code does) to guide incremental refactoring in a TypeScript + Effect.ts codebase.
**Tech Stack:**
1.**Dependency-Cruiser:** To map spatial boundaries (Seams).
2.**Hercules:** To map temporal evolutionary coupling.
3.**Opengrep:** To identify data-flow anti-patterns (Tramp Data, Branching).
4.**TypeScript (Node.js):** The "glue" script (`refactor-bot.ts`) that synthesizes the data.
2. Initialize it `npx depcruise --init` and configure it for a TypeScript environment.
3. Modify the `.dependency-cruiser.js` configuration to define "Seams". Group the application by folders (e.g., Domain boundaries, Layers, or Effect modules).
4. Create an npm script named `"map:boundaries"` that runs depcruise and outputs the architecture to a JSON file: `npx depcruise src --output-type json > ./.architecture/seams.json`.
1. Set up a tool to extract **Logical Coupling** from the git history.
*(Note for Agent: Use the `src-d/hercules` binary via Docker or Go, OR use the simpler `code-maat` Python/Clojure alternative if Hercules is blocked by local environment constraints).*
3. Run the coupling analysis tool against this log. The goal is to output a `.csv` or `.json` file (`coupling.json`) with three columns/fields: `[FileA, FileB, CoChangePercentage]`.
4. Ensure this output is saved to `/.architecture/coupling.json`.
---
### Step 3: Implement Deep Static Analysis (Opengrep / Semgrep)
**Agent Instructions:**
1. Install the open-source CLI (e.g., `npm install -g opengrep` or use `pip install semgrep`).
2. Create a folder `/.architecture/rules/`.
3. Write custom YAML rules to detect code smells specific to **Effect.ts** and functional pipelines.
4. **Rule 1: Tramp Data in Pipelines.** Write a rule that looks for a variable passed into a `.pipe()` or `Effect.gen` that traverses across a recognized boundary un-mutated. Use the ellipsis operator (`...`).
capability based use, not passing in the entire db
- see wlaschin's presentations on capability based code
label input as trusted or not, strings are no longer raw
sanitization notes add trusted vs untrusted string input? ie save string source?
are the security boundaries comprehensive enough? with 12 concerns I worry that is too many for one agent, I guess you can just keep doing security over and over again. do we cover enough concerns?
add sanitization to strings
mention seams from legacy code book
make sure tests are property tests
tdd for domain objects as well
sub-agent / orchestration documentation ideas:
- add shared orchestrator policy for scope control, restart criteria, and limits on agent freedom
- add a formal failure taxonomy for agent runs, including drift, tool misuse, silent failure, and recovery notes
- add evaluation thresholds and acceptance-check guidance for agent-produced work
- add model-routing and cost rules for choosing cheaper vs stronger models
- add escalation and recovery policies for when an agent should stop, hand off, or request a fresh context
- add guidance for refactors: preserve backwards compatibility only with an explicit reason, otherwise prefer the direct change
- add implementation guidance for future sub-agent benchmarks using real TypeScript tests as verification
- add lightweight evidence-capture guidance so agent runs can be reviewed later without heavy documentation overhead
# bounded seam identification tool:
it feels like you could have a tool that automatically identifies seams (ie seams are pretty obvious in this architecture right? they are layers in this template?) so you could have a tool that automatically traverses the changes so you can see what parts frequently need coordinated edits across seams? ie you could have this run every pr and then once there are places that get enough of a score you could start a refactor flow? or if there are pieces within a seam that always change independently then maybe they should be 2 bounded seams?
And yes, your coordinated-change idea is strong. This repo’s explicit layers help, but seams are not just layers; they are places where behavior can vary independently. A tool could mine PRs/commits for:
- files/functions frequently changed together
- edits that cross module APIs
- params/data passed through unchanged across many callers
- repeated branching at the same boundary
# need find bounded contexts step of feature
need to find if there are new or existing bounded contexts. arrange repo by bounded context with shared primitives
# do we have decompose feature into several workflows/tasks skill?
define the difference between workflows and tasks
# need process for refactoring
ie I don't need to wait for the template to be done to use it I can just add the improvements later
- use sub package.json to isolate bounded contexts
- auto update and refactor, renovate updates dependencies, codemod cli does the refactor
- use sourcebot or cocoindex-code to look at the semantic meaning and check if there are similar functions that can be removed. ie "write everything twice" (wet) for projects, "don't repeat yourself" (dry) in bounded contexts. would need to write own pipeline
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.