AI Agent Architecture: From Harness to Self-Evolution

Let me start with a confession that should worry you slightly. The large language model at the heart of every “AI agent” you’ve read about is, on its own, utterly helpless. It cannot remember what it did five minutes ago, press a button, read a file, run a test, or check whether its own brilliant answer is actually correct. It is a disembodied oracle: a brain in a jar that wakes up, receives a prompt, hallucinates a plausible continuation, and then dies. Every single call is its first day on the job and its last.

And yet these same models now write pull requests, refactor codebases overnight, debug production systems, and, in the most ambitious research, rewrite themselves. How? The magic isn’t in the model. It’s in the elaborate exoskeleton we bolt around it, told here in four layers: the harness, the loop, the organization, and finally the evolution. By the end, I want you to see agents not as clever chatbots but as what they really are: a stochastic brain wearing an increasingly sophisticated mechanical suit, à la a very nerdy Iron Man.

The Ground Floor: Why a Brain in a Jar Can’t Get Anything Done

Let’s establish the problem from first principles.

A base language model is a function. You hand it a string of text (the “context window,” everything it can “see” at once), and it returns a probability distribution over the next token, samples one, appends it, and repeats until it stops. Three properties fall out of this, and all three are fatal for autonomy.

First, it is stateless. Nothing persists between calls. Open a fresh ChatGPT window and it has no memory of yesterday’s conversation; if you want it to “remember,” you must paste that chat back into today’s prompt. The model has no diary.

Second, it is disembodied. It has no hands. Ask it to clean up your project and it will happily write the words “run rm -rf build/,” but it has no mechanism to run that command.

Third, it cannot verify itself. Because generating text and executing text are different acts, it can produce a Python function that looks flawless and is silently off by one index, with no idea anything is wrong.

So the naive approach, “just ask the smart model to do the task,” fails not because the model is dumb, but because it’s sealed. The entire field of agentic AI is the engineering discipline of un-sealing it safely. Which brings us to the key equation:

Agent = Model + Harness

The model is the ghost. The harness is the body.

Layer One: The Harness, a Nervous System for a Ghost

The agent harness is all the deterministic (predictable, non-random) software that surrounds the model and connects it to the world. It gives the ghost hands, memory, senses, and, crucially, restraints. If you’ve used Claude Code or Cursor, you’ve used a harness: the model is one component inside a much larger program that reads your files, runs your commands, and manages your context. Here’s the counterintuitive punchline I keep coming back to: the harness often matters more than the model. A brilliant model in a leaky harness loses to a mediocre model in a beautifully engineered one.

What’s in this body? A few organs, each with a concrete form.

The system prompt and conventions are the injected instructions, the standing orders slipped invisibly into the context at the start of every run. In practice this is a file like CLAUDE.md or .cursorrules in your repo saying “always run the test suite before committing” or “this project uses pnpm, never npm.”

The tools are the hands. Through interfaces like the Model Context Protocol (MCP, an open standard for connecting models to external tools and data), the model can request actions: a GitHub MCP server lets an agent open a pull request; a Slack connector lets it post a message. The model doesn’t execute these itself. It emits a request, and the harness, the trustworthy deterministic part, runs it and hands back the result.

The durable infrastructure is the body’s persistent memory: a virtual filesystem, isolated Git worktrees, secure sandboxes. Since the model can’t remember, a coding agent might save its plan to a progress.md file so that even after its context is wiped, the next run picks up where it left off.

And the deterministic middleware is the immune system: hooks that fire at set moments. A pre-commit hook that blocks any command containing rm -rf /, or a linter that runs after every file edit, are middleware. When something fails, it feeds the error back as what engineers charmingly call backpressure: the polite word for “no, try again, here’s exactly what you broke.”

The Villain of This Story: Context Rot

The context window is finite, and, this surprises people, it’s not a clean bucket you can fill to the brim without consequence. As you stuff more text in (a long conversation, a giant tool output, a sprawling log), the model’s attention mechanism gets saturated. Picture an agent forty tool-calls deep into a debugging session that suddenly forgets the original bug and starts polishing an unrelated function. That’s context rot, and it is the single most important failure mode in practical agent engineering. Imagine holding a conversation while someone slowly wheels a hundred filing cabinets into the room and dumps their contents on the floor: at some point you can’t find the one document that matters.

The harness fights this with a few weapons. Compaction: when the context nears its limit, the harness summarizes older messages (“we established the bug is a null pointer in auth.py“) and drops the raw back-and-forth. Tool-call offloading: when npm install or a pytest run spits out four thousand lines, the harness keeps only the head and tail (where the error usually lives) and writes the full log to disk. Progressive disclosure: instead of loading every tool at startup (which rots the context before the agent even thinks), the harness reveals capabilities just in time. The “how to run a database migration” skill only gets injected when the agent is about to touch the database.

The Sandbox and the Ratchet

Remember the ghost’s third curse, that it can’t check its own work? The harness cures this with a sandbox: a safe, disposable environment with real compilers, test frameworks, and browsers, often a Docker container spun up for the task. Now the loop closes on reality: the agent writes a function, the sandbox runs pytest, and the harness feeds back the true result (pass, fail, or a specific traceback). The model is no longer guessing; it’s being told, by the unforgiving physics of an actual computer.

This enables what practitioners call the Ratchet Mindset. Say your agent keeps forgetting to run the linter before committing. The lazy interpretation is “the model is dumb, let’s wait for a smarter one.” The Ratchet Mindset says no: this is a skill issue in the scaffolding, not the model. So you fix the harness, here by adding a pre-commit hook that runs the linter automatically, so this exact failure can never happen again. Like a ratchet wrench, it only turns one way: toward more reliability. Every mistake becomes a permanent upgrade to the body.

There’s even a beautifully crude technique for very long tasks called the Ralph Loop, coined by Geoffrey Huntley in mid-2025. In its purest form it’s almost insultingly simple, a Bash loop like while :; do cat PROMPT.md | claude-code ; done. The agent reads its instructions from a file, does one chunk of work, saves progress to disk, and exits. Then the loop just restarts it with a completely fresh, un-rotted context. Progress lives on disk, so the amnesia becomes a feature.

Layer Two: Loop Engineering, Teaching the Suit to Move on Its Own

The harness is the body. But a body just standing there is a statue. Loop engineering is the discipline of designing the motion, the recursive cycle by which the agent acts on its own without a human turning the crank at each step.

Formally, a loop is this cycle: the agent acts on its environment, observes the feedback, reasons about whether that got it closer to the goal, and decides the next step, repeating until a termination condition is met. This is the famous ReAct pattern (Reason + Act). Concretely, a debugging agent thinks “the test probably fails because the config is missing” (reason), reads the config file (act), sees it’s actually present (observe), then updates its theory. The point is to move iteration off the human and onto the machine.

Here’s the simplest honest version of an agent loop in Python. Strip away every framework and this is the beating heart of all of them:

def agent_loop(goal, tools, max_steps=10):
    memory = []                      # the scratchpad; the model has no memory of its own
    for step in range(max_steps):    # termination guard: never loop forever
        # 1. REASON + ACT: ask the model what to do next, given the goal + history
        thought, action = model.decide(goal, memory, tools)

        # 2. Check if the model believes it's finished
        if action.type == "DONE":
            return action.result

        # 3. ACT for real: the *harness* executes; the model only requested it
        observation = tools.execute(action)   # e.g. run a test, read a file

        # 4. OBSERVE: fold the result back into memory for the next round
        memory.append((thought, action, observation))

    return "Stopped: hit step limit"   # honest failure beats an infinite loop

Notice what each safety rail does. max_steps stops the agent spiralling forever and torching your budget. The DONE check is the termination logic; without a crisp, testable definition of “done,” an agent quits too early or runs until the heat death of the universe. And memory is the compacted working state that survives across iterations.

A well-engineered loop rests on non-negotiable pillars. It needs a clear, testable goal you can check programmatically (“all tests pass” is testable; “make the code nicer” is not), a tool set, disciplined context management, hardcoded termination logic, and error handling that tells recoverable hiccups (a typo in generated code, so retry it) apart from hard blockers (a missing API key, so stop and ask a human rather than flail).

To run across long horizons, engineers assemble six primitives, each with a concrete form. Automations are schedulers and webhooks that start loops without a human, like a nightly cron job that triages new GitHub issues. Worktrees are isolated copies of the code so parallel agents don’t overwrite each other, like two chefs each with their own kitchen station instead of fighting over one cutting board. Skills are rules written into persistent files (a SKILL.md describing your deployment process), since the model forgets everything between runs. Plugins and connectors reach out to external APIs, like a Linear connector that reads your ticket queue. Sub-agents delegate. And state persistence writes the loop’s status to disk so it survives a session teardown.

That fifth primitive deserves a spotlight. The Maker-Checker split: the same stochastic model that wrote the code should never be the sole judge of whether the code is good. So you spawn a separate sub-agent to review it, exactly as a coding agent might hand its diff to a distinct “reviewer” agent before merging. It’s the ancient wisdom that you don’t let students mark their own exams, ported into AI.

And loops aren’t one-size-fits-all. Different problems demand different shapes:

Retry loops for short tasks with a binary pass/fail: write a function, run the unit test, retry until it goes green.
Plan-Execute-Verify loops for complex, sequential work: refactoring a whole module by first writing a plan to disk, then executing it step by step, verifying after each.
Explore-Narrow loops for ambiguous problems: to debug a weird undocumented error, send parallel scouts down different routes, then have an orchestrator collapse onto the best one.
Hill-climbing and event-driven loops for background maintenance: a monitor dormant until a “server CPU high” webhook fires, or one that quietly hunts for small performance wins overnight.
Human-in-the-loop for high-stakes work: a deploy-to-production agent runs autonomously until its uncertainty crosses a threshold, then stops and taps a human on the shoulder.

Layer Three: Agentic Engineering, When One Iron Man Becomes an Army

Zoom out. What happens when an entire company runs hundreds of these at once? That is Agentic Engineering, a term Andrej Karpathy coined in early 2026 as the disciplined, grown-up cousin of freewheeling “vibe coding.” Picture a team where twenty coding agents each work a separate ticket while three human engineers supervise the swarm.

The founding recognition is blunt: these tools are powerful and unreliable. So the philosophy flips the human’s job. The agents write the code; the human oversees, validates, and (the load-bearing word) assumes liability. Human judgment and taste remain the final arbiters. The machines do the typing; you sign the check.

Doing this at enterprise scale requires real infrastructure. Worker agents are digital counterparts to individual engineers, each bounded by strict rules and assigned, say, one Jira ticket at a time. Shared long-term memory means a lesson one agent learns (“the staging database rejects connections over IPv6”) is instantly available to all of them. And above all, global observability and tracing, using tools like LangSmith or a custom trace store, captures every action of every agent so the whole system is auditable end to end.

But here’s the part to sit with, because it’s about you, not the machines. When agents ship mountains of code you never wrote or read, two quiet dangers creep in. The first is comprehension debt: the fifty-thousand-line service that works fine until it breaks at 2 a.m. and nobody understands how it fits together. The second, more insidious, is cognitive surrender: the slow abdication of judgment where you stop forming your own technical opinion and just accept the agent’s word for “done.” I treat this as a genuine hazard, not a footnote. Agentic engineering is, in a sense, a set of guardrails against your own tempting laziness. The agents can do the labor. They cannot do the responsibility.

Layer Four: Self-Evolving Agents, When the Suit Redesigns Itself

Everything so far has been static. The harness is fixed, the loops are fixed, and the model’s weights (the billions of numbers encoding what it knows) are frozen. Now that assumption dies. Self-evolving agents autonomously rewrite their own components based on experience. This is where it stops feeling like engineering and starts feeling faintly alive.

To think clearly, I find it helps to organize the field around a taxonomy: what evolves, when, and how. The “what” is the sharpest lens. There are four loci (places) where change can happen, ordered by how deep and irreversible the surgery is, each with a concrete example:

External memory: the agent writes down facts for later, like a note that says “this repo uses pnpm, not npm.” Cheap, easily undone, but it doesn’t generalize far. (Keeping a notebook.)
Tool inventory: the agent writes and saves a new tool for itself, say a small resize_image() helper it keeps reusing. A bit deeper, and it transfers across related tasks. (Forging your own wrench.)
Decision policy: the agent changes the rules by which it decides what to retrieve or discard, for instance learning “when debugging, always read the stack trace before the source.” Broadly transferable. (Changing how you think, not just what you know.)
Parametric weights: the agent absorbs lessons directly into its neural weights via a technique like LoRA. Expensive, nearly irreversible, and the closest thing to genuine generalized learning. (Rewiring your brain.)

This ordering sets up the field’s juiciest fight.

The Great Debate: Is Memory the Same as Learning?

The Memory-First camp says an agent evolves by growing its external memory (systems like ReasoningBank live here). The Learning-First camp scoffs that this is just “memory in a costume”: the brain is still frozen, you’ve merely handed it a fatter cheat sheet, and real evolution must change the policy or the weights.

How do you tell them apart? With a clean thought experiment: the Memory Wipe Test. Erase the agent’s accumulated memory and see what happens. If performance collapses back to the base model’s baseline, you had a cheat sheet, not a smarter student. If it holds up, something real was learned.

One system I find genuinely illuminating here, MemRL, is honest about which side it’s on. MemRL keeps the model frozen but learns an optimal retrieval policy over its memory, treating “which memory should I pull up now?” as a value-based decision optimized with Bellman-style temporal-difference learning (the math behind reinforcement learning: estimate the long-term value of each choice, then nudge those estimates toward reality). On ALFWorld, a simulated household environment where an agent follows instructions like “put a clean mug in the coffee machine,” its authors reported a last-epoch accuracy near 0.507, roughly a 56% jump over a memory baseline and 82% over no memory. But because the model never changes, MemRL would fail the Memory Wipe Test: erase the memory bank and it falls back to baseline. It learned a great policy for using its notes, but the notes still do the heavy lifting. (Treat single-benchmark numbers like these as directional evidence from the source systems, not settled laws of nature.)

Two Ways to Evolve Without Lying to Yourself

Two mechanisms stand out. The first solves a nasty trap. When an agent evolves by generating natural-language context to steer its own frozen brain, it tends to overfit: it invents hyper-specific instructions that ace the exact problems it’s seen and faceplant on anything new, like a student who memorizes last year’s exam and panics when the questions change. RSEA (Recursive Self-Evolving Agents) fixes this with a trick from good machine-learning hygiene: held-out selection. Every proposed self-improvement is tested against a separate validation set the agent never trained on, and if it doesn’t beat the plain baseline on unseen problems, it’s rejected. The ablation results RSEA’s authors published are a perfect little morality tale: strip out this guardrail and the agent hits a perfect 100% on its training problems while cratering by 33 points on the real test. That gap is the mathematical signature of a system fooling itself.

The second, TMEM, goes all the way to the weights. Traditional memory-augmented agents store the past in prompt space, so anything that falls out of the context window is gone forever, like notes on a whiteboard erased at day’s end. TMEM blurs the line between “text I remember” and “who I am” by distilling lessons into small, fast LoRA updates. LoRA (Low-Rank Adaptation) nudges a giant model’s behavior by training a tiny number of extra parameters instead of all billions. You learn a small correction, written as Δ (delta), and add it on:

W_effective = W_frozen + Δ

In plain English: keep the huge original weight matrix W frozen, and learn a small, cheap adjustment Δ that steers behavior. TMEM computes these Δ updates online, within a single episode, so behavior genuinely changes mid-task. Because the lesson lives in the weights, not the prompt, TMEM sails through the Memory Wipe Test: wipe the notes and the learning remains, baked in. Related systems like SKILL0 and SDAR push the same idea, reporting gains even with zero external skills at inference time.

The Gyms Where Agents Train

None of this happens in a vacuum. Agents need reinforcement-learning environments: gymnasiums where they attempt tasks, get rewards, and learn from outcomes. I think of them in three tiers. Pure task libraries (like RLVE or Reasoning Gym) are bare problems plus a verifier, a punching bag. Environment frameworks (like OpenEnv) are rich, stateful worlds with multi-turn tool use, a full sparring ring; WebArena, which drops an agent into realistic websites to complete tasks, is exactly this. Environment-plus-training bundles (like Verifiers or SkyRL Gym) package the environment, the trajectory recording, and the training loop together: the whole gym. These deploy either as networked servers (which scale beautifully) or in-process (zero latency, but risks dependency squabbles). The trade-off is the eternal one: isolation versus speed.

The Whole Machine: How the Four Layers Feed Each Other

Here’s the synthesis. These four ideas aren’t a list; they’re a causal chain that closes into a self-improving spiral. The harness sets the ceiling of what the agent can perceive and do. Inside that ceiling, loops run and generate enormous volumes of execution traces, records of what was tried and what happened. Agentic engineering’s observability infrastructure captures those traces across the organization. And the beautiful part: those audited traces are exactly the training data that reinforcement-learning environments need. Fed into those gyms, self-evolving agents analyze their own failures, write new tools, sharpen their retrieval policies, and update their LoRA weights, so the next round of loops runs better than the last.

In other words, the agent slowly takes over the job of the human harness engineer and starts rewriting its own scaffolding. The Ratchet Mindset, where every failure becomes a permanent upgrade, stops being something we do to the agent and becomes something the agent does to itself. That’s the loop behind the loop. Whether that road leads anywhere as dramatic as the “superintelligence” some researchers speculate about, I’ll leave open. But the mechanism for continual self-improvement is no longer science fiction. It’s a Bash while loop with ambitions.

Test Yourself: Reflection Questions

If you actually understood this, not memorized it but understood it, you should be able to wrestle with these:

The harness-vs-model claim. From first principles, why would a great harness with a mediocre model beat a frontier model with a leaky one? What failure modes does the harness prevent that raw intelligence cannot?
The Memory Wipe Test, inverted. Design an agent that would pass the Memory Wipe Test but that you’d argue hasn’t truly “learned” in any meaningful sense. Does the test have a blind spot?
Context rot’s paradox. A bigger context window seems like pure upside. Explain why enlarging it doesn’t cure context rot, and might even worsen it.
Cognitive surrender. If the human never reads the code, in what real sense is their judgment still the “final arbiter”? Where does the philosophy quietly break?
Overfitting as self-deception. RSEA’s un-guarded agent scored 100% on training and cratered on the test. Connect this to the classic bias-variance trade-off: why is a self-evolving agent especially prone to it compared to ordinary supervised learning?

Where to Go Next

Three concrete moves. Build the tiny loop: wire that 15-line agent_loop to one real tool (a Python exec sandbox is enough) and give it a goal with a testable pass condition, like “write a function that passes these three assertions.” Watch it retry. Read the primary sources: skim the original ReAct and LoRA papers. Instrument a failure: break your loop on purpose and practice the Ratchet Mindset by fixing the harness so it can’t recur, not the prompt.

If you’d like to go deeper on the theory underneath all this, two books I keep recommending to my own teams are Chip Huyen’s Designing Machine Learning Systems, the clearest treatment of building reliable ML infrastructure I’ve found, and Jay Alammar and Maarten Grootendorst’s Hands-On Large Language Models, which grounds the model side of the “Model + Harness” equation in real, runnable intuition. Both are worth a permanent spot on the shelf next to your keyboard.

A quick note on me: I’m Dr. Amita Kapoor, an AI researcher focused on large language models and agentic AI, and an unrepentant enthusiast for making genuinely hard ideas feel obvious. I’m the founder of NePeur, where I build and research agentic AI systems, and co-founder of Retured, where we apply machine learning to real-world problems. If this one earned its keep, the highest compliment you can pay it is to go build the tiny loop.

Affiliate disclaimer: Some book recommendations above are Amazon affiliate links. If you purchase through them, I may earn a small commission at no extra cost to you, which helps keep these long-form explainers free. I only ever recommend resources I’d genuinely put in front of my own team.

Please follow and like us: