ada/typescript-decompiler

Fork 0

Files

T

ada 4d6495ffda Initial commit

2026-05-25 05:47:28 +00:00

16 KiB

Raw Permalink Blame History

Agent Pipeline Notes

1. Reality Check

Fully automated code writing is close for bounded, low-risk, well-specified work.
Fully unsupervised ownership of large, evolving, high-stakes systems is not close.
Raw coding ability is improving faster than architectural consistency, uncertainty calibration, and trustworthy self-review.
In the near term, the goal is not to remove review entirely. The goal is to move review up a level.

2. Current Bottleneck

The early design phases in this repository are already relatively strong.
The main bottleneck is assembly and the review/refactor thrash around assembly.
The biggest time sink is repeated loops around common implementation issues, potential refactors, and reviewing too much low-level detail.
Strong design artifacts reduce the need to reconstruct intent from code, but they do not yet fully remove the need for human judgment.

3. What a Pipeline Adds Beyond Manual Skill Use

Right now, the human is acting as the scheduler and state machine. A pipeline externalizes that work so it is explicit and enforceable.

Useful additions that were not as necessary when doing the process manually:

machine-checkable approval state
explicit slice definitions
spec-to-code traceability rules
human-signoff criteria by phase
artifact diffs between stages
automatic verification bundles
replayable evaluation runs
thrash/change-war detection
audit trail for decisions and outcomes

The main benefit is not “agents do more coding.” The main benefit is that the process becomes more stable, repeatable, and reviewable.

4. What to Pipeline First

Build the thinnest useful pipeline around the current process. Do not start with a large swarm system.

Best initial targets:

phase state machine
artifact and gate manifests
assembly slice runner
verification bundle
thrash detection
small replay benchmark harness

This should be built as a headless harness first, designed for Argo-style execution rather than an interactive coding CLI. Use existing coding agents and projects like opencode as reference implementations for inner-loop patterns, not as the architectural foundation.

5. What Should Stay Human

Humans should still own the judgment-heavy transitions:

requirements freeze
workflow or slice approval
architecture exceptions
security sign-off for risky changes
final sign-off for high-blast-radius changes
resolving ambiguity or contradictory requirements

The pipeline should reduce line-by-line review, not eliminate human judgment.

6. Review at a Higher Level

The practical goal is to move review from low-level code inspection to higher-level conformance review.

Instead of asking:

What does this code do?
Did the agent miss some implementation detail?

Try to make review focus on:

Does this slice match the frozen artifact?
Does it violate any domain invariants or trust boundaries?
Did it introduce any risky seams or suspicious shortcuts?
Does it need redesign, human review, or is it safe to merge?

This is realistic if artifacts become stricter and slices become smaller.

7. Preventing Subtle Logic Errors

The system will mostly catch errors early rather than prevent every error outright. That is still valuable because catching errors before merge is much cheaper than catching them later.

Helpful layers:

frozen design artifacts
spec-to-code traceability
property and mutation tests where useful
risk-focused seam review
deterministic boundary and architecture checks
forcing agents to cite which invariant each change satisfies

The point is not perfection. The point is making subtle logic flaws rarer and cheaper to catch.

8. Slices and Assembly Scope

Assembly often does too much when it translates a whole blueprint at once. A better pattern is:

freeze the artifact
choose one workflow slice
implement one tracer-bullet path end to end
run tests/checks/review
expand behavior within that slice
move to the next slice

A slice is not necessarily one PR per policy or adapter. A slice is one coherent bounded vertical change that may touch a workflow, several policies, and several adapters if they belong to one contract.

9. Cost Reality

A pipeline can save money only if it reduces retries, review thrash, and change wars. If it creates uncontrolled loops, it can absolutely increase token costs.

The key metric is not cost per token. The key metric is cost per accepted slice and time saved per accepted slice.

Important ideas:

use expensive reasoning at phase boundaries
use cheaper models only when artifacts are tight and the task is narrow
detect thrash early
benchmark replay on a small representative set, not everything
optimize for time saved, not just token minimization

If spending more on tokens saves multiple days of work, that can still be a clear win.

10. Minimum Viable Build

For the clarified goal, LangGraph is worth adopting in the MVP because orchestration, durable execution, and resumable state are now part of the project’s core value. Start with a small headless TypeScript orchestrator built for batch execution inside Argo workflows. Add Langfuse from the beginning so traces, spans, prompts, runs, and outcomes are observable in a way that is useful both operationally and on a resume.

Suggested MVP pieces:

LangGraph workflow graph / state machine for bounded slices
markdown artifacts with small machine-readable frontmatter or JSON manifests
simple TypeScript parsers for artifacts and statuses
Langfuse tracing for each run, stage, model call, and gate decision
verification bundle assembled into one review packet
cheap external thrash-detection heuristics
small replay benchmark harness
one headless executor path suitable for Argo

This is enough to improve the real workflow and create a strong portfolio project while staying aligned with the desired end-state architecture.

11. Portable Process Core, Disposable Runner

The right design is a portable process core with a replaceable runner.

process rules, states, gates, policies, artifacts, and evaluation logic should be portable
the runtime executor should be disposable or replaceable at the boundary level

Even if LangGraph is used in the MVP, the value should still live primarily in the process definition, policies, artifacts, event schema, and evaluation logic rather than in LangGraph-specific node wiring. The graph runtime should be treated as an adapter for execution, checkpointing, and resumability, not as the place where domain process knowledge gets trapped.

12. Workflow / Service Boundary for the Pipeline Itself

Follow the repository’s existing separation when building the pipeline.

Pure/domain-like pieces:

pipeline state
gate rules
slice metadata
approval rules
conformance decisions

Workflow/service pieces:

run coding agents
invoke models
read and write artifacts
launch review jobs
open PRs
store logs and run records

Example:

“Can assembly start?” is a pure policy decision.
“Launch the assembly agent with model X” is a service call inside a workflow.

That makes the process logic portable and keeps LangGraph as a future adapter instead of a hard dependency.

13. LangGraph: When It Helps

LangGraph is useful here because orchestration is now part of the main project rather than a future optimization. Its real benefits are:

durable execution
resumability
retries
branching workflows
checkpointed state
human-in-the-loop pauses
better operational management for complex graphs
easier alignment with headless batch execution in Argo-style systems

For this direction, it is reasonable to use LangGraph in the MVP. The main caution is to avoid letting graph node wiring become the only place where business process rules live. Keep process semantics portable even if LangGraph is the first runtime.

14. Training Data Value

A pipeline can produce much better training data than raw chat logs. The valuable data is not just “the model wrote code.” The valuable data is:

given this phase and artifact state
with these checks failing or passing
what action/tool/model choice was correct next
what human correction was needed
what result was ultimately accepted

This is useful for future model training, replay evaluation, and process improvement.

15. What to Log for Training and Replay

Do not assume the orchestration engine will automatically create a clean training corpus. Design explicit logging for the data you care about.

Minimum useful fields:

run ID
stage ID
slice ID
artifact versions and hashes
prompt or template version
model choice
full input context given to the model for that step
tool choice and tool arguments
tool outputs or summaries
verification results
gate decision
human corrections or overrides
final disposition (accepted, rejected, redesign, escalated)

If prompt design or routing rules materially affect cost or quality, log those too.

16. LangGraph vs Training Logs

LangGraph can persist execution state and checkpoints, and can be useful for debugging. But that is different from having a clean, normalized dataset for:

training
replay
evaluation
analytics
cost analysis

So the orchestration system and the training/event log should be treated as separate concerns. Use orchestration state for execution and recovery. Use explicit event logs for learning and analysis.

17. Hooking Into the Executor Layer

Start with a headless executor boundary, not with deep integration into an interactive coding CLI. The first goal is not to instrument every internal token. The first goal is to supervise bounded work reliably inside batch execution.

What the executor boundary should do first:

launch a bounded task with a clear artifact bundle
capture the returned summary, diff, and verification results
classify the outcome as continue / needs-human / blocked / failed
stop after one coherent slice or checkpoint
emit structured events and traces to Langfuse

If an existing coding agent can operate in a non-chatty batch mode for bounded tasks, it can sit behind this executor boundary. If it cannot, keep it for human-guided stages and use a lower-level headless executor for automated assembly later.

18. What to Capture From the Executor

If possible, capture at least:

exact task input sent
artifact bundle provided
model used
tool calls made
tool arguments
result summary
changed files
verification output
whether the run completed, asked for help, or thrashed
trace IDs, run IDs, span metadata, and checkpoint IDs

For training-grade data, exact per-step context is better than only a final summary. LangGraph persistence and Langfuse traces are helpful, but they are still not the same thing as a clean replay/eval dataset. Explicit event capture is still required.

19. Opencode-Specific Decision Point

Opencode is now mainly a reference implementation for inner-loop agent behavior rather than the likely runtime foundation.

Questions to answer before borrowing ideas from it or integrating with it:

Which parts are genuinely reusable in a headless Argo-oriented system?
Can its agent loop run cleanly in bounded, low-chatter automation?
Can you capture the effective prompt/context used?
Can you capture tool-use data well enough for replay and training?
Does borrowing from it reduce delivery time more than it increases architectural drag?

If yes, copy patterns or adapt isolated pieces. If no, then the best path is to keep opencode as a reference and build a smaller dedicated executor for automated assembly and evaluation.

20. Portfolio Value

This can be a strong AgentOps portfolio project if it is real and measured. The impressive part is not size. The impressive part is demonstrating:

staged orchestration
model routing
evaluation and replay
cost and quality tradeoffs
review reduction
guardrails and risk controls
operational judgment about where humans stay in the loop

A compact, sharp system with metrics is better than a giant “swarm” that is hard to explain.

21. Practical Next Step

Build a thin layer around the current repository process and use it on real work. Do not pause everything to build a big system.

A good first direction:

define machine-checkable phase and approval states
define slice metadata and human sign-off criteria
implement a small LangGraph state graph for one bounded slice
wrap verification into one bundle
add basic thrash detection and Langfuse tracing
run it on one live workflow inside a headless batch path
measure review time, retries, accepted-slice cost, and trace quality
add explicit event logging for replay and evals
decide whether any existing coding agent is good enough behind the executor boundary
iterate based on real pain

22. Six-Step Build Order for the First Effect-Template Use Case

For the first use case, the target is developing software inside this repository structure rather than supporting a fully general coding environment from day one. The build order should stay tightly coupled to the minimum tool surface needed for one bounded workflow slice.

Step 1: Prove the End-to-End Slice Loop

Goal:

get one bounded software-development slice running end to end in the effect-template style

Tools to add:

read_file
list_files
write_file
edit_file
run_tests

Why this first:

proves the basic harness can take a task, operate on repository artifacts, make code changes, run verification, and return a classified result
keeps the tool surface small enough to debug failures clearly

Step 2: Add First-Class Observability

Goal:

make each run inspectable so failures and bottlenecks are visible immediately

Tools to add:

trace_run or equivalent Langfuse instrumentation hooks

Why now:

once the first loop works, observability becomes the fastest way to learn from real runs
tracing should come before adding much more autonomy so failures do not become opaque

Step 3: Add Replay and Evaluation Foundations

Goal:

make runs reproducible and measurable rather than anecdotal

Tools to add:

save_replay_record
replay_run

Why now:

replayability is one of the core differentiators of the harness
this creates a basis for later evals, cost analysis, and regression checks

Step 4: Add Gates and Thrash Control

Goal:

keep the system from looping uselessly or pushing low-quality output forward

Tools to add:

gate_evaluator
thrash_detector

Why now:

after replay exists, it becomes easier to define and tune failure heuristics
this is where the harness starts protecting time and token spend rather than only executing work

Step 5: Improve Execution Safety

Goal:

make automated runs safer and more diagnosable in headless environments

Tools to add:

safer shell or sandbox executor
git_diff

Why now:

once the harness can already complete bounded slices, safety and change inspection become more valuable than adding more raw capability
diff visibility is especially useful for higher-level review

Step 6: Add Richer Orchestration

Goal:

move from one bounded slice to more capable autonomous workflow behavior

Tools to add:

planning or decomposition tool
multi-agent delegation or subtask dispatch

Why last:

richer orchestration compounds complexity quickly
it should be added only after the basic loop, observability, replay, and control mechanisms already work

23. Open Decisions Still Worth Discussing

Before implementing too much, it would be useful to decide:

what exact human approvals are required at each phase
how small slices should be in practice
what counts as thrash vs healthy iteration
what minimum event schema is worth storing now
whether to store full prompts or prompt templates plus resolved context
whether to store full tool outputs or normalized summaries plus raw attachments
how much of the coding CLI can be wrapped without losing its advantages
whether automated assembly should use the same executor as human-guided work
what metrics will prove the pipeline is saving time rather than adding ceremony
what data retention and privacy rules apply if you later train on this data

24. Bottom Line

The opportunity is real. The trust barrier is still the main constraint.

So the right move is:

do not wait for models to become magically trustworthy
do not overbuild a giant pipeline first
formalize the process you already use
move review up a level
keep human judgment at the risky seams
log the process in a training- and replay-friendly way
measure whether the pipeline actually saves time

16 KiB Raw Permalink Blame History Unescape Escape

Agent Pipeline Notes

1. Reality Check

2. Current Bottleneck

3. What a Pipeline Adds Beyond Manual Skill Use

4. What to Pipeline First

5. What Should Stay Human

6. Review at a Higher Level

7. Preventing Subtle Logic Errors

8. Slices and Assembly Scope

9. Cost Reality

10. Minimum Viable Build

11. Portable Process Core, Disposable Runner

12. Workflow / Service Boundary for the Pipeline Itself

13. LangGraph: When It Helps

14. Training Data Value

15. What to Log for Training and Replay

16. LangGraph vs Training Logs

17. Hooking Into the Executor Layer

18. What to Capture From the Executor

19. Opencode-Specific Decision Point

20. Portfolio Value

21. Practical Next Step

22. Six-Step Build Order for the First Effect-Template Use Case

Step 1: Prove the End-to-End Slice Loop

Step 2: Add First-Class Observability

Step 3: Add Replay and Evaluation Foundations

Step 4: Add Gates and Thrash Control

Step 5: Improve Execution Safety

Step 6: Add Richer Orchestration

23. Open Decisions Still Worth Discussing

24. Bottom Line

16 KiB

Raw Permalink Blame History