Initial commit
This commit is contained in:
@@ -0,0 +1,448 @@
|
||||
# Agent Pipeline Notes
|
||||
|
||||
## 1. Reality Check
|
||||
|
||||
- Fully automated code writing is close for bounded, low-risk, well-specified work.
|
||||
- Fully unsupervised ownership of large, evolving, high-stakes systems is not close.
|
||||
- Raw coding ability is improving faster than architectural consistency, uncertainty calibration, and trustworthy self-review.
|
||||
- In the near term, the goal is not to remove review entirely. The goal is to move review up a level.
|
||||
|
||||
## 2. Current Bottleneck
|
||||
|
||||
- The early design phases in this repository are already relatively strong.
|
||||
- The main bottleneck is assembly and the review/refactor thrash around assembly.
|
||||
- The biggest time sink is repeated loops around common implementation issues, potential refactors, and reviewing too much low-level detail.
|
||||
- Strong design artifacts reduce the need to reconstruct intent from code, but they do not yet fully remove the need for human judgment.
|
||||
|
||||
## 3. What a Pipeline Adds Beyond Manual Skill Use
|
||||
|
||||
Right now, the human is acting as the scheduler and state machine.
|
||||
A pipeline externalizes that work so it is explicit and enforceable.
|
||||
|
||||
Useful additions that were not as necessary when doing the process manually:
|
||||
|
||||
- machine-checkable approval state
|
||||
- explicit slice definitions
|
||||
- spec-to-code traceability rules
|
||||
- human-signoff criteria by phase
|
||||
- artifact diffs between stages
|
||||
- automatic verification bundles
|
||||
- replayable evaluation runs
|
||||
- thrash/change-war detection
|
||||
- audit trail for decisions and outcomes
|
||||
|
||||
The main benefit is not “agents do more coding.”
|
||||
The main benefit is that the process becomes more stable, repeatable, and reviewable.
|
||||
|
||||
## 4. What to Pipeline First
|
||||
|
||||
Build the thinnest useful pipeline around the current process.
|
||||
Do not start with a large swarm system.
|
||||
|
||||
Best initial targets:
|
||||
|
||||
- phase state machine
|
||||
- artifact and gate manifests
|
||||
- assembly slice runner
|
||||
- verification bundle
|
||||
- thrash detection
|
||||
- small replay benchmark harness
|
||||
|
||||
This should be built as a headless harness first, designed for Argo-style execution rather than an interactive coding CLI.
|
||||
Use existing coding agents and projects like opencode as reference implementations for inner-loop patterns, not as the architectural foundation.
|
||||
|
||||
## 5. What Should Stay Human
|
||||
|
||||
Humans should still own the judgment-heavy transitions:
|
||||
|
||||
- requirements freeze
|
||||
- workflow or slice approval
|
||||
- architecture exceptions
|
||||
- security sign-off for risky changes
|
||||
- final sign-off for high-blast-radius changes
|
||||
- resolving ambiguity or contradictory requirements
|
||||
|
||||
The pipeline should reduce line-by-line review, not eliminate human judgment.
|
||||
|
||||
## 6. Review at a Higher Level
|
||||
|
||||
The practical goal is to move review from low-level code inspection to higher-level conformance review.
|
||||
|
||||
Instead of asking:
|
||||
- What does this code do?
|
||||
- Did the agent miss some implementation detail?
|
||||
|
||||
Try to make review focus on:
|
||||
- Does this slice match the frozen artifact?
|
||||
- Does it violate any domain invariants or trust boundaries?
|
||||
- Did it introduce any risky seams or suspicious shortcuts?
|
||||
- Does it need redesign, human review, or is it safe to merge?
|
||||
|
||||
This is realistic if artifacts become stricter and slices become smaller.
|
||||
|
||||
## 7. Preventing Subtle Logic Errors
|
||||
|
||||
The system will mostly catch errors early rather than prevent every error outright.
|
||||
That is still valuable because catching errors before merge is much cheaper than catching them later.
|
||||
|
||||
Helpful layers:
|
||||
|
||||
- frozen design artifacts
|
||||
- spec-to-code traceability
|
||||
- property and mutation tests where useful
|
||||
- risk-focused seam review
|
||||
- deterministic boundary and architecture checks
|
||||
- forcing agents to cite which invariant each change satisfies
|
||||
|
||||
The point is not perfection.
|
||||
The point is making subtle logic flaws rarer and cheaper to catch.
|
||||
|
||||
## 8. Slices and Assembly Scope
|
||||
|
||||
Assembly often does too much when it translates a whole blueprint at once.
|
||||
A better pattern is:
|
||||
|
||||
1. freeze the artifact
|
||||
2. choose one workflow slice
|
||||
3. implement one tracer-bullet path end to end
|
||||
4. run tests/checks/review
|
||||
5. expand behavior within that slice
|
||||
6. move to the next slice
|
||||
|
||||
A slice is not necessarily one PR per policy or adapter.
|
||||
A slice is one coherent bounded vertical change that may touch a workflow, several policies, and several adapters if they belong to one contract.
|
||||
|
||||
## 9. Cost Reality
|
||||
|
||||
A pipeline can save money only if it reduces retries, review thrash, and change wars.
|
||||
If it creates uncontrolled loops, it can absolutely increase token costs.
|
||||
|
||||
The key metric is not cost per token.
|
||||
The key metric is cost per accepted slice and time saved per accepted slice.
|
||||
|
||||
Important ideas:
|
||||
|
||||
- use expensive reasoning at phase boundaries
|
||||
- use cheaper models only when artifacts are tight and the task is narrow
|
||||
- detect thrash early
|
||||
- benchmark replay on a small representative set, not everything
|
||||
- optimize for time saved, not just token minimization
|
||||
|
||||
If spending more on tokens saves multiple days of work, that can still be a clear win.
|
||||
|
||||
## 10. Minimum Viable Build
|
||||
|
||||
For the clarified goal, LangGraph is worth adopting in the MVP because orchestration, durable execution, and resumable state are now part of the project’s core value.
|
||||
Start with a small headless TypeScript orchestrator built for batch execution inside Argo workflows.
|
||||
Add Langfuse from the beginning so traces, spans, prompts, runs, and outcomes are observable in a way that is useful both operationally and on a resume.
|
||||
|
||||
Suggested MVP pieces:
|
||||
|
||||
- LangGraph workflow graph / state machine for bounded slices
|
||||
- markdown artifacts with small machine-readable frontmatter or JSON manifests
|
||||
- simple TypeScript parsers for artifacts and statuses
|
||||
- Langfuse tracing for each run, stage, model call, and gate decision
|
||||
- verification bundle assembled into one review packet
|
||||
- cheap external thrash-detection heuristics
|
||||
- small replay benchmark harness
|
||||
- one headless executor path suitable for Argo
|
||||
|
||||
This is enough to improve the real workflow and create a strong portfolio project while staying aligned with the desired end-state architecture.
|
||||
|
||||
## 11. Portable Process Core, Disposable Runner
|
||||
|
||||
The right design is a portable process core with a replaceable runner.
|
||||
|
||||
- process rules, states, gates, policies, artifacts, and evaluation logic should be portable
|
||||
- the runtime executor should be disposable or replaceable at the boundary level
|
||||
|
||||
Even if LangGraph is used in the MVP, the value should still live primarily in the process definition, policies, artifacts, event schema, and evaluation logic rather than in LangGraph-specific node wiring.
|
||||
The graph runtime should be treated as an adapter for execution, checkpointing, and resumability, not as the place where domain process knowledge gets trapped.
|
||||
|
||||
## 12. Workflow / Service Boundary for the Pipeline Itself
|
||||
|
||||
Follow the repository’s existing separation when building the pipeline.
|
||||
|
||||
Pure/domain-like pieces:
|
||||
- pipeline state
|
||||
- gate rules
|
||||
- slice metadata
|
||||
- approval rules
|
||||
- conformance decisions
|
||||
|
||||
Workflow/service pieces:
|
||||
- run coding agents
|
||||
- invoke models
|
||||
- read and write artifacts
|
||||
- launch review jobs
|
||||
- open PRs
|
||||
- store logs and run records
|
||||
|
||||
Example:
|
||||
- “Can assembly start?” is a pure policy decision.
|
||||
- “Launch the assembly agent with model X” is a service call inside a workflow.
|
||||
|
||||
That makes the process logic portable and keeps LangGraph as a future adapter instead of a hard dependency.
|
||||
|
||||
## 13. LangGraph: When It Helps
|
||||
|
||||
LangGraph is useful here because orchestration is now part of the main project rather than a future optimization.
|
||||
Its real benefits are:
|
||||
|
||||
- durable execution
|
||||
- resumability
|
||||
- retries
|
||||
- branching workflows
|
||||
- checkpointed state
|
||||
- human-in-the-loop pauses
|
||||
- better operational management for complex graphs
|
||||
- easier alignment with headless batch execution in Argo-style systems
|
||||
|
||||
For this direction, it is reasonable to use LangGraph in the MVP.
|
||||
The main caution is to avoid letting graph node wiring become the only place where business process rules live.
|
||||
Keep process semantics portable even if LangGraph is the first runtime.
|
||||
|
||||
## 14. Training Data Value
|
||||
|
||||
A pipeline can produce much better training data than raw chat logs.
|
||||
The valuable data is not just “the model wrote code.”
|
||||
The valuable data is:
|
||||
|
||||
- given this phase and artifact state
|
||||
- with these checks failing or passing
|
||||
- what action/tool/model choice was correct next
|
||||
- what human correction was needed
|
||||
- what result was ultimately accepted
|
||||
|
||||
This is useful for future model training, replay evaluation, and process improvement.
|
||||
|
||||
## 15. What to Log for Training and Replay
|
||||
|
||||
Do not assume the orchestration engine will automatically create a clean training corpus.
|
||||
Design explicit logging for the data you care about.
|
||||
|
||||
Minimum useful fields:
|
||||
|
||||
- run ID
|
||||
- stage ID
|
||||
- slice ID
|
||||
- artifact versions and hashes
|
||||
- prompt or template version
|
||||
- model choice
|
||||
- full input context given to the model for that step
|
||||
- tool choice and tool arguments
|
||||
- tool outputs or summaries
|
||||
- verification results
|
||||
- gate decision
|
||||
- human corrections or overrides
|
||||
- final disposition (accepted, rejected, redesign, escalated)
|
||||
|
||||
If prompt design or routing rules materially affect cost or quality, log those too.
|
||||
|
||||
## 16. LangGraph vs Training Logs
|
||||
|
||||
LangGraph can persist execution state and checkpoints, and can be useful for debugging.
|
||||
But that is different from having a clean, normalized dataset for:
|
||||
|
||||
- training
|
||||
- replay
|
||||
- evaluation
|
||||
- analytics
|
||||
- cost analysis
|
||||
|
||||
So the orchestration system and the training/event log should be treated as separate concerns.
|
||||
Use orchestration state for execution and recovery.
|
||||
Use explicit event logs for learning and analysis.
|
||||
|
||||
## 17. Hooking Into the Executor Layer
|
||||
|
||||
Start with a headless executor boundary, not with deep integration into an interactive coding CLI.
|
||||
The first goal is not to instrument every internal token.
|
||||
The first goal is to supervise bounded work reliably inside batch execution.
|
||||
|
||||
What the executor boundary should do first:
|
||||
|
||||
- launch a bounded task with a clear artifact bundle
|
||||
- capture the returned summary, diff, and verification results
|
||||
- classify the outcome as continue / needs-human / blocked / failed
|
||||
- stop after one coherent slice or checkpoint
|
||||
- emit structured events and traces to Langfuse
|
||||
|
||||
If an existing coding agent can operate in a non-chatty batch mode for bounded tasks, it can sit behind this executor boundary.
|
||||
If it cannot, keep it for human-guided stages and use a lower-level headless executor for automated assembly later.
|
||||
|
||||
## 18. What to Capture From the Executor
|
||||
|
||||
If possible, capture at least:
|
||||
|
||||
- exact task input sent
|
||||
- artifact bundle provided
|
||||
- model used
|
||||
- tool calls made
|
||||
- tool arguments
|
||||
- result summary
|
||||
- changed files
|
||||
- verification output
|
||||
- whether the run completed, asked for help, or thrashed
|
||||
- trace IDs, run IDs, span metadata, and checkpoint IDs
|
||||
|
||||
For training-grade data, exact per-step context is better than only a final summary.
|
||||
LangGraph persistence and Langfuse traces are helpful, but they are still not the same thing as a clean replay/eval dataset.
|
||||
Explicit event capture is still required.
|
||||
|
||||
## 19. Opencode-Specific Decision Point
|
||||
|
||||
Opencode is now mainly a reference implementation for inner-loop agent behavior rather than the likely runtime foundation.
|
||||
|
||||
Questions to answer before borrowing ideas from it or integrating with it:
|
||||
|
||||
- Which parts are genuinely reusable in a headless Argo-oriented system?
|
||||
- Can its agent loop run cleanly in bounded, low-chatter automation?
|
||||
- Can you capture the effective prompt/context used?
|
||||
- Can you capture tool-use data well enough for replay and training?
|
||||
- Does borrowing from it reduce delivery time more than it increases architectural drag?
|
||||
|
||||
If yes, copy patterns or adapt isolated pieces.
|
||||
If no, then the best path is to keep opencode as a reference and build a smaller dedicated executor for automated assembly and evaluation.
|
||||
|
||||
## 20. Portfolio Value
|
||||
|
||||
This can be a strong AgentOps portfolio project if it is real and measured.
|
||||
The impressive part is not size.
|
||||
The impressive part is demonstrating:
|
||||
|
||||
- staged orchestration
|
||||
- model routing
|
||||
- evaluation and replay
|
||||
- cost and quality tradeoffs
|
||||
- review reduction
|
||||
- guardrails and risk controls
|
||||
- operational judgment about where humans stay in the loop
|
||||
|
||||
A compact, sharp system with metrics is better than a giant “swarm” that is hard to explain.
|
||||
|
||||
## 21. Practical Next Step
|
||||
|
||||
Build a thin layer around the current repository process and use it on real work.
|
||||
Do not pause everything to build a big system.
|
||||
|
||||
A good first direction:
|
||||
|
||||
1. define machine-checkable phase and approval states
|
||||
2. define slice metadata and human sign-off criteria
|
||||
3. implement a small LangGraph state graph for one bounded slice
|
||||
4. wrap verification into one bundle
|
||||
5. add basic thrash detection and Langfuse tracing
|
||||
6. run it on one live workflow inside a headless batch path
|
||||
7. measure review time, retries, accepted-slice cost, and trace quality
|
||||
8. add explicit event logging for replay and evals
|
||||
9. decide whether any existing coding agent is good enough behind the executor boundary
|
||||
10. iterate based on real pain
|
||||
|
||||
## 22. Six-Step Build Order for the First Effect-Template Use Case
|
||||
|
||||
For the first use case, the target is developing software inside this repository structure rather than supporting a fully general coding environment from day one.
|
||||
The build order should stay tightly coupled to the minimum tool surface needed for one bounded workflow slice.
|
||||
|
||||
### Step 1: Prove the End-to-End Slice Loop
|
||||
Goal:
|
||||
- get one bounded software-development slice running end to end in the effect-template style
|
||||
|
||||
Tools to add:
|
||||
- read_file
|
||||
- list_files
|
||||
- write_file
|
||||
- edit_file
|
||||
- run_tests
|
||||
|
||||
Why this first:
|
||||
- proves the basic harness can take a task, operate on repository artifacts, make code changes, run verification, and return a classified result
|
||||
- keeps the tool surface small enough to debug failures clearly
|
||||
|
||||
### Step 2: Add First-Class Observability
|
||||
Goal:
|
||||
- make each run inspectable so failures and bottlenecks are visible immediately
|
||||
|
||||
Tools to add:
|
||||
- trace_run or equivalent Langfuse instrumentation hooks
|
||||
|
||||
Why now:
|
||||
- once the first loop works, observability becomes the fastest way to learn from real runs
|
||||
- tracing should come before adding much more autonomy so failures do not become opaque
|
||||
|
||||
### Step 3: Add Replay and Evaluation Foundations
|
||||
Goal:
|
||||
- make runs reproducible and measurable rather than anecdotal
|
||||
|
||||
Tools to add:
|
||||
- save_replay_record
|
||||
- replay_run
|
||||
|
||||
Why now:
|
||||
- replayability is one of the core differentiators of the harness
|
||||
- this creates a basis for later evals, cost analysis, and regression checks
|
||||
|
||||
### Step 4: Add Gates and Thrash Control
|
||||
Goal:
|
||||
- keep the system from looping uselessly or pushing low-quality output forward
|
||||
|
||||
Tools to add:
|
||||
- gate_evaluator
|
||||
- thrash_detector
|
||||
|
||||
Why now:
|
||||
- after replay exists, it becomes easier to define and tune failure heuristics
|
||||
- this is where the harness starts protecting time and token spend rather than only executing work
|
||||
|
||||
### Step 5: Improve Execution Safety
|
||||
Goal:
|
||||
- make automated runs safer and more diagnosable in headless environments
|
||||
|
||||
Tools to add:
|
||||
- safer shell or sandbox executor
|
||||
- git_diff
|
||||
|
||||
Why now:
|
||||
- once the harness can already complete bounded slices, safety and change inspection become more valuable than adding more raw capability
|
||||
- diff visibility is especially useful for higher-level review
|
||||
|
||||
### Step 6: Add Richer Orchestration
|
||||
Goal:
|
||||
- move from one bounded slice to more capable autonomous workflow behavior
|
||||
|
||||
Tools to add:
|
||||
- planning or decomposition tool
|
||||
- multi-agent delegation or subtask dispatch
|
||||
|
||||
Why last:
|
||||
- richer orchestration compounds complexity quickly
|
||||
- it should be added only after the basic loop, observability, replay, and control mechanisms already work
|
||||
|
||||
## 23. Open Decisions Still Worth Discussing
|
||||
|
||||
Before implementing too much, it would be useful to decide:
|
||||
|
||||
- what exact human approvals are required at each phase
|
||||
- how small slices should be in practice
|
||||
- what counts as thrash vs healthy iteration
|
||||
- what minimum event schema is worth storing now
|
||||
- whether to store full prompts or prompt templates plus resolved context
|
||||
- whether to store full tool outputs or normalized summaries plus raw attachments
|
||||
- how much of the coding CLI can be wrapped without losing its advantages
|
||||
- whether automated assembly should use the same executor as human-guided work
|
||||
- what metrics will prove the pipeline is saving time rather than adding ceremony
|
||||
- what data retention and privacy rules apply if you later train on this data
|
||||
|
||||
## 24. Bottom Line
|
||||
|
||||
The opportunity is real.
|
||||
The trust barrier is still the main constraint.
|
||||
|
||||
So the right move is:
|
||||
- do not wait for models to become magically trustworthy
|
||||
- do not overbuild a giant pipeline first
|
||||
- formalize the process you already use
|
||||
- move review up a level
|
||||
- keep human judgment at the risky seams
|
||||
- log the process in a training- and replay-friendly way
|
||||
- measure whether the pipeline actually saves time
|
||||
Reference in New Issue
Block a user