블로그로 돌아가기

5 Hard Lessons from Building an AI Coding Agent

If you think the core competency of an AI coding agent is "picking a good model," you'll hit a wall in your first month. What actually determines whether an agent works is context management, error recovery, tool design, and safety boundaries — all engineering problems.


Lesson 1: Context Is Oxygen, Not Fuel

Our early intuition was "bigger context window = better, stuff everything in." Practice taught us otherwise: more context isn't better — the excess is all noise.

In a 200K-token context, if 150K is redundant tool-call history, formatted logs, and irrelevant file content, the model's effective attention gets diluted. PieBox's current approach:

  • AGENTS.md instructions are pinned at the prompt's very front — never compressed away
  • Tool-call history gets incremental compression — keep cause and effect, strip raw log output
  • Only load genuinely relevant file snippets, never full-file imports

Lesson: token budget is the agent's scarcest resource. Give the model only what it actually needs.


Lesson 2: Tool Design Sets the Agent's Ceiling

What an agent can do depends on what tools it has. But tool granularity is a trap: too coarse (one "write code" tool that does everything) forces the model to make too many decisions on its own, cratering accuracy; too fine (separate readFile, writeFile, appendFile tools) increases call count and latency.

PieBox's strategy is tool layering:

  • Coarse-grained tools (explore, deploy) for the planning layer — let the model make macro decisions
  • Fine-grained tools (bash, edit, read) for the execution layer — precise file system operations

Lesson: a good tool isn't one with the most features — it's one that lets the model complete a task in two steps or fewer.


Lesson 3: Failure Recovery Matters More Than Success

An agent runs for 10 minutes. Any single link breaks — file half-written, test fails, command lacks permissions — and the entire task is wasted.

Our most painful experience: the agent refactored 17 files. On the very last one, a type error broke the interface signature — and then it "confidently" proceeded to the next step. By the time we caught it, the previous 16 files' changes were already inconsistent with the new signature.

PieBox's current approach:

  • Run type checking immediately after every file operation
  • Auto-fix at most 2 rounds on failure; beyond that, stop and request human intervention
  • Critical steps (modifying public interfaces, deleting files) get confirmation checkpoints

Lesson: it's better for the agent to stop and ask you than to keep running with errors.


Lesson 4: Don't Worship a Single Model

Early on, we ran everything through GPT-4. Great results, unsustainable costs. Then we switched entirely to DeepSeek — costs dropped, stability followed suit. The eventual sweet spot was a federated model — different stages, different models.

PieBox's current default routing:

  • Planning → DeepSeek-R1 (strong reasoning)
  • Code generation → GPT-4o or Claude (precision)
  • File exploration → lightweight model (fast, cheap)

Lesson: no single model is optimal at every stage. An agent's capability shouldn't be capped by a single model's ceiling.


Lesson 5: Users Think It's "Smart" — That's Not About Engineering

The most counterintuitive lesson: how users evaluate an agent has almost nothing to do with its engineering complexity. Users feel an agent is "smart" when: it guesses what they meant without being told, it gets things right before you interrupt, and it clearly explains why it failed.

And all of these experiences sit on engineering details: fast failure feedback, clear error messages, recoverable operations, transparent state display that doesn't require users to guess. An agent's perceived "intelligence" isn't model capability — it's a byproduct of engineering UX.


Not one of these five lessons is about "how to train a better model." After more than a year, we've truly understood: an AI coding agent is fundamentally a distributed systems problem. Model reasoning is just one piece — what matters more is how the system safely and efficiently collaborates with users, file systems, and tool chains.

PieBox was built on exactly these lessons. If you're building an agent or looking for an engineering-driven AI coding tool, give PieBox a try.