Files
2026-05-25 05:47:28 +00:00

16 KiB
Raw Permalink Blame History

Agent Pipeline Notes

1. Reality Check

  • Fully automated code writing is close for bounded, low-risk, well-specified work.
  • Fully unsupervised ownership of large, evolving, high-stakes systems is not close.
  • Raw coding ability is improving faster than architectural consistency, uncertainty calibration, and trustworthy self-review.
  • In the near term, the goal is not to remove review entirely. The goal is to move review up a level.

2. Current Bottleneck

  • The early design phases in this repository are already relatively strong.
  • The main bottleneck is assembly and the review/refactor thrash around assembly.
  • The biggest time sink is repeated loops around common implementation issues, potential refactors, and reviewing too much low-level detail.
  • Strong design artifacts reduce the need to reconstruct intent from code, but they do not yet fully remove the need for human judgment.

3. What a Pipeline Adds Beyond Manual Skill Use

Right now, the human is acting as the scheduler and state machine. A pipeline externalizes that work so it is explicit and enforceable.

Useful additions that were not as necessary when doing the process manually:

  • machine-checkable approval state
  • explicit slice definitions
  • spec-to-code traceability rules
  • human-signoff criteria by phase
  • artifact diffs between stages
  • automatic verification bundles
  • replayable evaluation runs
  • thrash/change-war detection
  • audit trail for decisions and outcomes

The main benefit is not “agents do more coding.” The main benefit is that the process becomes more stable, repeatable, and reviewable.

4. What to Pipeline First

Build the thinnest useful pipeline around the current process. Do not start with a large swarm system.

Best initial targets:

  • phase state machine
  • artifact and gate manifests
  • assembly slice runner
  • verification bundle
  • thrash detection
  • small replay benchmark harness

This should be built as a headless harness first, designed for Argo-style execution rather than an interactive coding CLI. Use existing coding agents and projects like opencode as reference implementations for inner-loop patterns, not as the architectural foundation.

5. What Should Stay Human

Humans should still own the judgment-heavy transitions:

  • requirements freeze
  • workflow or slice approval
  • architecture exceptions
  • security sign-off for risky changes
  • final sign-off for high-blast-radius changes
  • resolving ambiguity or contradictory requirements

The pipeline should reduce line-by-line review, not eliminate human judgment.

6. Review at a Higher Level

The practical goal is to move review from low-level code inspection to higher-level conformance review.

Instead of asking:

  • What does this code do?
  • Did the agent miss some implementation detail?

Try to make review focus on:

  • Does this slice match the frozen artifact?
  • Does it violate any domain invariants or trust boundaries?
  • Did it introduce any risky seams or suspicious shortcuts?
  • Does it need redesign, human review, or is it safe to merge?

This is realistic if artifacts become stricter and slices become smaller.

7. Preventing Subtle Logic Errors

The system will mostly catch errors early rather than prevent every error outright. That is still valuable because catching errors before merge is much cheaper than catching them later.

Helpful layers:

  • frozen design artifacts
  • spec-to-code traceability
  • property and mutation tests where useful
  • risk-focused seam review
  • deterministic boundary and architecture checks
  • forcing agents to cite which invariant each change satisfies

The point is not perfection. The point is making subtle logic flaws rarer and cheaper to catch.

8. Slices and Assembly Scope

Assembly often does too much when it translates a whole blueprint at once. A better pattern is:

  1. freeze the artifact
  2. choose one workflow slice
  3. implement one tracer-bullet path end to end
  4. run tests/checks/review
  5. expand behavior within that slice
  6. move to the next slice

A slice is not necessarily one PR per policy or adapter. A slice is one coherent bounded vertical change that may touch a workflow, several policies, and several adapters if they belong to one contract.

9. Cost Reality

A pipeline can save money only if it reduces retries, review thrash, and change wars. If it creates uncontrolled loops, it can absolutely increase token costs.

The key metric is not cost per token. The key metric is cost per accepted slice and time saved per accepted slice.

Important ideas:

  • use expensive reasoning at phase boundaries
  • use cheaper models only when artifacts are tight and the task is narrow
  • detect thrash early
  • benchmark replay on a small representative set, not everything
  • optimize for time saved, not just token minimization

If spending more on tokens saves multiple days of work, that can still be a clear win.

10. Minimum Viable Build

For the clarified goal, LangGraph is worth adopting in the MVP because orchestration, durable execution, and resumable state are now part of the projects core value. Start with a small headless TypeScript orchestrator built for batch execution inside Argo workflows. Add Langfuse from the beginning so traces, spans, prompts, runs, and outcomes are observable in a way that is useful both operationally and on a resume.

Suggested MVP pieces:

  • LangGraph workflow graph / state machine for bounded slices
  • markdown artifacts with small machine-readable frontmatter or JSON manifests
  • simple TypeScript parsers for artifacts and statuses
  • Langfuse tracing for each run, stage, model call, and gate decision
  • verification bundle assembled into one review packet
  • cheap external thrash-detection heuristics
  • small replay benchmark harness
  • one headless executor path suitable for Argo

This is enough to improve the real workflow and create a strong portfolio project while staying aligned with the desired end-state architecture.

11. Portable Process Core, Disposable Runner

The right design is a portable process core with a replaceable runner.

  • process rules, states, gates, policies, artifacts, and evaluation logic should be portable
  • the runtime executor should be disposable or replaceable at the boundary level

Even if LangGraph is used in the MVP, the value should still live primarily in the process definition, policies, artifacts, event schema, and evaluation logic rather than in LangGraph-specific node wiring. The graph runtime should be treated as an adapter for execution, checkpointing, and resumability, not as the place where domain process knowledge gets trapped.

12. Workflow / Service Boundary for the Pipeline Itself

Follow the repositorys existing separation when building the pipeline.

Pure/domain-like pieces:

  • pipeline state
  • gate rules
  • slice metadata
  • approval rules
  • conformance decisions

Workflow/service pieces:

  • run coding agents
  • invoke models
  • read and write artifacts
  • launch review jobs
  • open PRs
  • store logs and run records

Example:

  • “Can assembly start?” is a pure policy decision.
  • “Launch the assembly agent with model X” is a service call inside a workflow.

That makes the process logic portable and keeps LangGraph as a future adapter instead of a hard dependency.

13. LangGraph: When It Helps

LangGraph is useful here because orchestration is now part of the main project rather than a future optimization. Its real benefits are:

  • durable execution
  • resumability
  • retries
  • branching workflows
  • checkpointed state
  • human-in-the-loop pauses
  • better operational management for complex graphs
  • easier alignment with headless batch execution in Argo-style systems

For this direction, it is reasonable to use LangGraph in the MVP. The main caution is to avoid letting graph node wiring become the only place where business process rules live. Keep process semantics portable even if LangGraph is the first runtime.

14. Training Data Value

A pipeline can produce much better training data than raw chat logs. The valuable data is not just “the model wrote code.” The valuable data is:

  • given this phase and artifact state
  • with these checks failing or passing
  • what action/tool/model choice was correct next
  • what human correction was needed
  • what result was ultimately accepted

This is useful for future model training, replay evaluation, and process improvement.

15. What to Log for Training and Replay

Do not assume the orchestration engine will automatically create a clean training corpus. Design explicit logging for the data you care about.

Minimum useful fields:

  • run ID
  • stage ID
  • slice ID
  • artifact versions and hashes
  • prompt or template version
  • model choice
  • full input context given to the model for that step
  • tool choice and tool arguments
  • tool outputs or summaries
  • verification results
  • gate decision
  • human corrections or overrides
  • final disposition (accepted, rejected, redesign, escalated)

If prompt design or routing rules materially affect cost or quality, log those too.

16. LangGraph vs Training Logs

LangGraph can persist execution state and checkpoints, and can be useful for debugging. But that is different from having a clean, normalized dataset for:

  • training
  • replay
  • evaluation
  • analytics
  • cost analysis

So the orchestration system and the training/event log should be treated as separate concerns. Use orchestration state for execution and recovery. Use explicit event logs for learning and analysis.

17. Hooking Into the Executor Layer

Start with a headless executor boundary, not with deep integration into an interactive coding CLI. The first goal is not to instrument every internal token. The first goal is to supervise bounded work reliably inside batch execution.

What the executor boundary should do first:

  • launch a bounded task with a clear artifact bundle
  • capture the returned summary, diff, and verification results
  • classify the outcome as continue / needs-human / blocked / failed
  • stop after one coherent slice or checkpoint
  • emit structured events and traces to Langfuse

If an existing coding agent can operate in a non-chatty batch mode for bounded tasks, it can sit behind this executor boundary. If it cannot, keep it for human-guided stages and use a lower-level headless executor for automated assembly later.

18. What to Capture From the Executor

If possible, capture at least:

  • exact task input sent
  • artifact bundle provided
  • model used
  • tool calls made
  • tool arguments
  • result summary
  • changed files
  • verification output
  • whether the run completed, asked for help, or thrashed
  • trace IDs, run IDs, span metadata, and checkpoint IDs

For training-grade data, exact per-step context is better than only a final summary. LangGraph persistence and Langfuse traces are helpful, but they are still not the same thing as a clean replay/eval dataset. Explicit event capture is still required.

19. Opencode-Specific Decision Point

Opencode is now mainly a reference implementation for inner-loop agent behavior rather than the likely runtime foundation.

Questions to answer before borrowing ideas from it or integrating with it:

  • Which parts are genuinely reusable in a headless Argo-oriented system?
  • Can its agent loop run cleanly in bounded, low-chatter automation?
  • Can you capture the effective prompt/context used?
  • Can you capture tool-use data well enough for replay and training?
  • Does borrowing from it reduce delivery time more than it increases architectural drag?

If yes, copy patterns or adapt isolated pieces. If no, then the best path is to keep opencode as a reference and build a smaller dedicated executor for automated assembly and evaluation.

20. Portfolio Value

This can be a strong AgentOps portfolio project if it is real and measured. The impressive part is not size. The impressive part is demonstrating:

  • staged orchestration
  • model routing
  • evaluation and replay
  • cost and quality tradeoffs
  • review reduction
  • guardrails and risk controls
  • operational judgment about where humans stay in the loop

A compact, sharp system with metrics is better than a giant “swarm” that is hard to explain.

21. Practical Next Step

Build a thin layer around the current repository process and use it on real work. Do not pause everything to build a big system.

A good first direction:

  1. define machine-checkable phase and approval states
  2. define slice metadata and human sign-off criteria
  3. implement a small LangGraph state graph for one bounded slice
  4. wrap verification into one bundle
  5. add basic thrash detection and Langfuse tracing
  6. run it on one live workflow inside a headless batch path
  7. measure review time, retries, accepted-slice cost, and trace quality
  8. add explicit event logging for replay and evals
  9. decide whether any existing coding agent is good enough behind the executor boundary
  10. iterate based on real pain

22. Six-Step Build Order for the First Effect-Template Use Case

For the first use case, the target is developing software inside this repository structure rather than supporting a fully general coding environment from day one. The build order should stay tightly coupled to the minimum tool surface needed for one bounded workflow slice.

Step 1: Prove the End-to-End Slice Loop

Goal:

  • get one bounded software-development slice running end to end in the effect-template style

Tools to add:

  • read_file
  • list_files
  • write_file
  • edit_file
  • run_tests

Why this first:

  • proves the basic harness can take a task, operate on repository artifacts, make code changes, run verification, and return a classified result
  • keeps the tool surface small enough to debug failures clearly

Step 2: Add First-Class Observability

Goal:

  • make each run inspectable so failures and bottlenecks are visible immediately

Tools to add:

  • trace_run or equivalent Langfuse instrumentation hooks

Why now:

  • once the first loop works, observability becomes the fastest way to learn from real runs
  • tracing should come before adding much more autonomy so failures do not become opaque

Step 3: Add Replay and Evaluation Foundations

Goal:

  • make runs reproducible and measurable rather than anecdotal

Tools to add:

  • save_replay_record
  • replay_run

Why now:

  • replayability is one of the core differentiators of the harness
  • this creates a basis for later evals, cost analysis, and regression checks

Step 4: Add Gates and Thrash Control

Goal:

  • keep the system from looping uselessly or pushing low-quality output forward

Tools to add:

  • gate_evaluator
  • thrash_detector

Why now:

  • after replay exists, it becomes easier to define and tune failure heuristics
  • this is where the harness starts protecting time and token spend rather than only executing work

Step 5: Improve Execution Safety

Goal:

  • make automated runs safer and more diagnosable in headless environments

Tools to add:

  • safer shell or sandbox executor
  • git_diff

Why now:

  • once the harness can already complete bounded slices, safety and change inspection become more valuable than adding more raw capability
  • diff visibility is especially useful for higher-level review

Step 6: Add Richer Orchestration

Goal:

  • move from one bounded slice to more capable autonomous workflow behavior

Tools to add:

  • planning or decomposition tool
  • multi-agent delegation or subtask dispatch

Why last:

  • richer orchestration compounds complexity quickly
  • it should be added only after the basic loop, observability, replay, and control mechanisms already work

23. Open Decisions Still Worth Discussing

Before implementing too much, it would be useful to decide:

  • what exact human approvals are required at each phase
  • how small slices should be in practice
  • what counts as thrash vs healthy iteration
  • what minimum event schema is worth storing now
  • whether to store full prompts or prompt templates plus resolved context
  • whether to store full tool outputs or normalized summaries plus raw attachments
  • how much of the coding CLI can be wrapped without losing its advantages
  • whether automated assembly should use the same executor as human-guided work
  • what metrics will prove the pipeline is saving time rather than adding ceremony
  • what data retention and privacy rules apply if you later train on this data

24. Bottom Line

The opportunity is real. The trust barrier is still the main constraint.

So the right move is:

  • do not wait for models to become magically trustworthy
  • do not overbuild a giant pipeline first
  • formalize the process you already use
  • move review up a level
  • keep human judgment at the risky seams
  • log the process in a training- and replay-friendly way
  • measure whether the pipeline actually saves time