Initial commit

2026-05-25 05:47:28 +00:00
commit 4d6495ffda
97 changed files with 13403 additions and 0 deletions
@@ -0,0 +1,448 @@
+# Agent Pipeline Notes
+
+## 1. Reality Check
+
+- Fully automated code writing is close for bounded, low-risk, well-specified work.
+- Fully unsupervised ownership of large, evolving, high-stakes systems is not close.
+- Raw coding ability is improving faster than architectural consistency, uncertainty calibration, and trustworthy self-review.
+- In the near term, the goal is not to remove review entirely. The goal is to move review up a level.
+
+## 2. Current Bottleneck
+
+- The early design phases in this repository are already relatively strong.
+- The main bottleneck is assembly and the review/refactor thrash around assembly.
+- The biggest time sink is repeated loops around common implementation issues, potential refactors, and reviewing too much low-level detail.
+- Strong design artifacts reduce the need to reconstruct intent from code, but they do not yet fully remove the need for human judgment.
+
+## 3. What a Pipeline Adds Beyond Manual Skill Use
+
+Right now, the human is acting as the scheduler and state machine.
+A pipeline externalizes that work so it is explicit and enforceable.
+
+Useful additions that were not as necessary when doing the process manually:
+
+- machine-checkable approval state
+- explicit slice definitions
+- spec-to-code traceability rules
+- human-signoff criteria by phase
+- artifact diffs between stages
+- automatic verification bundles
+- replayable evaluation runs
+- thrash/change-war detection
+- audit trail for decisions and outcomes
+
+The main benefit is not “agents do more coding.”
+The main benefit is that the process becomes more stable, repeatable, and reviewable.
+
+## 4. What to Pipeline First
+
+Build the thinnest useful pipeline around the current process.
+Do not start with a large swarm system.
+
+Best initial targets:
+
+- phase state machine
+- artifact and gate manifests
+- assembly slice runner
+- verification bundle
+- thrash detection
+- small replay benchmark harness
+
+This should be built as a headless harness first, designed for Argo-style execution rather than an interactive coding CLI.
+Use existing coding agents and projects like opencode as reference implementations for inner-loop patterns, not as the architectural foundation.
+
+## 5. What Should Stay Human
+
+Humans should still own the judgment-heavy transitions:
+
+- requirements freeze
+- workflow or slice approval
+- architecture exceptions
+- security sign-off for risky changes
+- final sign-off for high-blast-radius changes
+- resolving ambiguity or contradictory requirements
+
+The pipeline should reduce line-by-line review, not eliminate human judgment.
+
+## 6. Review at a Higher Level
+
+The practical goal is to move review from low-level code inspection to higher-level conformance review.
+
+Instead of asking:
+- What does this code do?
+- Did the agent miss some implementation detail?
+
+Try to make review focus on:
+- Does this slice match the frozen artifact?
+- Does it violate any domain invariants or trust boundaries?
+- Did it introduce any risky seams or suspicious shortcuts?
+- Does it need redesign, human review, or is it safe to merge?
+
+This is realistic if artifacts become stricter and slices become smaller.
+
+## 7. Preventing Subtle Logic Errors
+
+The system will mostly catch errors early rather than prevent every error outright.
+That is still valuable because catching errors before merge is much cheaper than catching them later.
+
+Helpful layers:
+
+- frozen design artifacts
+- spec-to-code traceability
+- property and mutation tests where useful
+- risk-focused seam review
+- deterministic boundary and architecture checks
+- forcing agents to cite which invariant each change satisfies
+
+The point is not perfection.
+The point is making subtle logic flaws rarer and cheaper to catch.
+
+## 8. Slices and Assembly Scope
+
+Assembly often does too much when it translates a whole blueprint at once.
+A better pattern is:
+
+1. freeze the artifact
+2. choose one workflow slice
+3. implement one tracer-bullet path end to end
+4. run tests/checks/review
+5. expand behavior within that slice
+6. move to the next slice
+
+A slice is not necessarily one PR per policy or adapter.
+A slice is one coherent bounded vertical change that may touch a workflow, several policies, and several adapters if they belong to one contract.
+
+## 9. Cost Reality
+
+A pipeline can save money only if it reduces retries, review thrash, and change wars.
+If it creates uncontrolled loops, it can absolutely increase token costs.
+
+The key metric is not cost per token.
+The key metric is cost per accepted slice and time saved per accepted slice.
+
+Important ideas:
+
+- use expensive reasoning at phase boundaries
+- use cheaper models only when artifacts are tight and the task is narrow
+- detect thrash early
+- benchmark replay on a small representative set, not everything
+- optimize for time saved, not just token minimization
+
+If spending more on tokens saves multiple days of work, that can still be a clear win.
+
+## 10. Minimum Viable Build
+
+For the clarified goal, LangGraph is worth adopting in the MVP because orchestration, durable execution, and resumable state are now part of the project’s core value.
+Start with a small headless TypeScript orchestrator built for batch execution inside Argo workflows.
+Add Langfuse from the beginning so traces, spans, prompts, runs, and outcomes are observable in a way that is useful both operationally and on a resume.
+
+Suggested MVP pieces:
+
+- LangGraph workflow graph / state machine for bounded slices
+- markdown artifacts with small machine-readable frontmatter or JSON manifests
+- simple TypeScript parsers for artifacts and statuses
+- Langfuse tracing for each run, stage, model call, and gate decision
+- verification bundle assembled into one review packet
+- cheap external thrash-detection heuristics
+- small replay benchmark harness
+- one headless executor path suitable for Argo
+
+This is enough to improve the real workflow and create a strong portfolio project while staying aligned with the desired end-state architecture.
+
+## 11. Portable Process Core, Disposable Runner
+
+The right design is a portable process core with a replaceable runner.
+
+- process rules, states, gates, policies, artifacts, and evaluation logic should be portable
+- the runtime executor should be disposable or replaceable at the boundary level
+
+Even if LangGraph is used in the MVP, the value should still live primarily in the process definition, policies, artifacts, event schema, and evaluation logic rather than in LangGraph-specific node wiring.
+The graph runtime should be treated as an adapter for execution, checkpointing, and resumability, not as the place where domain process knowledge gets trapped.
+
+## 12. Workflow / Service Boundary for the Pipeline Itself
+
+Follow the repository’s existing separation when building the pipeline.
+
+Pure/domain-like pieces:
+- pipeline state
+- gate rules
+- slice metadata
+- approval rules
+- conformance decisions
+
+Workflow/service pieces:
+- run coding agents
+- invoke models
+- read and write artifacts
+- launch review jobs
+- open PRs
+- store logs and run records
+
+Example:
+- “Can assembly start?” is a pure policy decision.
+- “Launch the assembly agent with model X” is a service call inside a workflow.
+
+That makes the process logic portable and keeps LangGraph as a future adapter instead of a hard dependency.
+
+## 13. LangGraph: When It Helps
+
+LangGraph is useful here because orchestration is now part of the main project rather than a future optimization.
+Its real benefits are:
+
+- durable execution
+- resumability
+- retries
+- branching workflows
+- checkpointed state
+- human-in-the-loop pauses
+- better operational management for complex graphs
+- easier alignment with headless batch execution in Argo-style systems
+
+For this direction, it is reasonable to use LangGraph in the MVP.
+The main caution is to avoid letting graph node wiring become the only place where business process rules live.
+Keep process semantics portable even if LangGraph is the first runtime.
+
+## 14. Training Data Value
+
+A pipeline can produce much better training data than raw chat logs.
+The valuable data is not just “the model wrote code.”
+The valuable data is:
+
+- given this phase and artifact state
+- with these checks failing or passing
+- what action/tool/model choice was correct next
+- what human correction was needed
+- what result was ultimately accepted
+
+This is useful for future model training, replay evaluation, and process improvement.
+
+## 15. What to Log for Training and Replay
+
+Do not assume the orchestration engine will automatically create a clean training corpus.
+Design explicit logging for the data you care about.
+
+Minimum useful fields:
+
+- run ID
+- stage ID
+- slice ID
+- artifact versions and hashes
+- prompt or template version
+- model choice
+- full input context given to the model for that step
+- tool choice and tool arguments
+- tool outputs or summaries
+- verification results
+- gate decision
+- human corrections or overrides
+- final disposition (accepted, rejected, redesign, escalated)
+
+If prompt design or routing rules materially affect cost or quality, log those too.
+
+## 16. LangGraph vs Training Logs
+
+LangGraph can persist execution state and checkpoints, and can be useful for debugging.
+But that is different from having a clean, normalized dataset for:
+
+- training
+- replay
+- evaluation
+- analytics
+- cost analysis
+
+So the orchestration system and the training/event log should be treated as separate concerns.
+Use orchestration state for execution and recovery.
+Use explicit event logs for learning and analysis.
+
+## 17. Hooking Into the Executor Layer
+
+Start with a headless executor boundary, not with deep integration into an interactive coding CLI.
+The first goal is not to instrument every internal token.
+The first goal is to supervise bounded work reliably inside batch execution.
+
+What the executor boundary should do first:
+
+- launch a bounded task with a clear artifact bundle
+- capture the returned summary, diff, and verification results
+- classify the outcome as continue / needs-human / blocked / failed
+- stop after one coherent slice or checkpoint
+- emit structured events and traces to Langfuse
+
+If an existing coding agent can operate in a non-chatty batch mode for bounded tasks, it can sit behind this executor boundary.
+If it cannot, keep it for human-guided stages and use a lower-level headless executor for automated assembly later.
+
+## 18. What to Capture From the Executor
+
+If possible, capture at least:
+
+- exact task input sent
+- artifact bundle provided
+- model used
+- tool calls made
+- tool arguments
+- result summary
+- changed files
+- verification output
+- whether the run completed, asked for help, or thrashed
+- trace IDs, run IDs, span metadata, and checkpoint IDs
+
+For training-grade data, exact per-step context is better than only a final summary.
+LangGraph persistence and Langfuse traces are helpful, but they are still not the same thing as a clean replay/eval dataset.
+Explicit event capture is still required.
+
+## 19. Opencode-Specific Decision Point
+
+Opencode is now mainly a reference implementation for inner-loop agent behavior rather than the likely runtime foundation.
+
+Questions to answer before borrowing ideas from it or integrating with it:
+
+- Which parts are genuinely reusable in a headless Argo-oriented system?
+- Can its agent loop run cleanly in bounded, low-chatter automation?
+- Can you capture the effective prompt/context used?
+- Can you capture tool-use data well enough for replay and training?
+- Does borrowing from it reduce delivery time more than it increases architectural drag?
+
+If yes, copy patterns or adapt isolated pieces.
+If no, then the best path is to keep opencode as a reference and build a smaller dedicated executor for automated assembly and evaluation.
+
+## 20. Portfolio Value
+
+This can be a strong AgentOps portfolio project if it is real and measured.
+The impressive part is not size.
+The impressive part is demonstrating:
+
+- staged orchestration
+- model routing
+- evaluation and replay
+- cost and quality tradeoffs
+- review reduction
+- guardrails and risk controls
+- operational judgment about where humans stay in the loop
+
+A compact, sharp system with metrics is better than a giant “swarm” that is hard to explain.
+
+## 21. Practical Next Step
+
+Build a thin layer around the current repository process and use it on real work.
+Do not pause everything to build a big system.
+
+A good first direction:
+
+1. define machine-checkable phase and approval states
+2. define slice metadata and human sign-off criteria
+3. implement a small LangGraph state graph for one bounded slice
+4. wrap verification into one bundle
+5. add basic thrash detection and Langfuse tracing
+6. run it on one live workflow inside a headless batch path
+7. measure review time, retries, accepted-slice cost, and trace quality
+8. add explicit event logging for replay and evals
+9. decide whether any existing coding agent is good enough behind the executor boundary
+10. iterate based on real pain
+
+## 22. Six-Step Build Order for the First Effect-Template Use Case
+
+For the first use case, the target is developing software inside this repository structure rather than supporting a fully general coding environment from day one.
+The build order should stay tightly coupled to the minimum tool surface needed for one bounded workflow slice.
+
+### Step 1: Prove the End-to-End Slice Loop
+Goal:
+- get one bounded software-development slice running end to end in the effect-template style
+
+Tools to add:
+- read_file
+- list_files
+- write_file
+- edit_file
+- run_tests
+
+Why this first:
+- proves the basic harness can take a task, operate on repository artifacts, make code changes, run verification, and return a classified result
+- keeps the tool surface small enough to debug failures clearly
+
+### Step 2: Add First-Class Observability
+Goal:
+- make each run inspectable so failures and bottlenecks are visible immediately
+
+Tools to add:
+- trace_run or equivalent Langfuse instrumentation hooks
+
+Why now:
+- once the first loop works, observability becomes the fastest way to learn from real runs
+- tracing should come before adding much more autonomy so failures do not become opaque
+
+### Step 3: Add Replay and Evaluation Foundations
+Goal:
+- make runs reproducible and measurable rather than anecdotal
+
+Tools to add:
+- save_replay_record
+- replay_run
+
+Why now:
+- replayability is one of the core differentiators of the harness
+- this creates a basis for later evals, cost analysis, and regression checks
+
+### Step 4: Add Gates and Thrash Control
+Goal:
+- keep the system from looping uselessly or pushing low-quality output forward
+
+Tools to add:
+- gate_evaluator
+- thrash_detector
+
+Why now:
+- after replay exists, it becomes easier to define and tune failure heuristics
+- this is where the harness starts protecting time and token spend rather than only executing work
+
+### Step 5: Improve Execution Safety
+Goal:
+- make automated runs safer and more diagnosable in headless environments
+
+Tools to add:
+- safer shell or sandbox executor
+- git_diff
+
+Why now:
+- once the harness can already complete bounded slices, safety and change inspection become more valuable than adding more raw capability
+- diff visibility is especially useful for higher-level review
+
+### Step 6: Add Richer Orchestration
+Goal:
+- move from one bounded slice to more capable autonomous workflow behavior
+
+Tools to add:
+- planning or decomposition tool
+- multi-agent delegation or subtask dispatch
+
+Why last:
+- richer orchestration compounds complexity quickly
+- it should be added only after the basic loop, observability, replay, and control mechanisms already work
+
+## 23. Open Decisions Still Worth Discussing
+
+Before implementing too much, it would be useful to decide:
+
+- what exact human approvals are required at each phase
+- how small slices should be in practice
+- what counts as thrash vs healthy iteration
+- what minimum event schema is worth storing now
+- whether to store full prompts or prompt templates plus resolved context
+- whether to store full tool outputs or normalized summaries plus raw attachments
+- how much of the coding CLI can be wrapped without losing its advantages
+- whether automated assembly should use the same executor as human-guided work
+- what metrics will prove the pipeline is saving time rather than adding ceremony
+- what data retention and privacy rules apply if you later train on this data
+
+## 24. Bottom Line
+
+The opportunity is real.
+The trust barrier is still the main constraint.
+
+So the right move is:
+- do not wait for models to become magically trustworthy
+- do not overbuild a giant pipeline first
+- formalize the process you already use
+- move review up a level
+- keep human judgment at the risky seams
+- log the process in a training- and replay-friendly way
+- measure whether the pipeline actually saves time