Geometric

By Jack Foxabbott, Founding Member of Technical Staff

AST Edits: The Code Editing Format Nobody Uses

Every AI coding tool makes a bet on how its model should express code edits. Claude Code uses search/replace. Codex CLI uses a patch format. Cursor uses a dedicated apply model to rewrite files. Aider picks per model. Can Bölük proposed hashlines. None of the large providers use AST-targeted edits.

We tested all of them. AST won.

The benchmark

We built 29 editing tasks, from one-line fixes in 100-line files up to multi-site rewrites in 4,200-line modules. Each task gives the model a Python file, an instruction, and a test suite. The model edits the file. We check if the tests pass.

The edits themselves aren’t exotic: add a parameter, fix a bug, wire up a dependency. What makes the harder tasks hard is the file size. At 4,200 lines, a one-character mistake in a context line tanks the whole edit.

Some examples, from simplest to hardest:

Fix an off-by-one in chunk_list: range(0, len(items) - 1, ...) to range(0, len(items), ...) (100-line file)
Add a ttl_seconds parameter to a cache’s __init__ (360-line file)
Change a list to an OrderedDict for O(1) access (520-line file)
Add TTL expiration to an LRU cache, editing both __init__ and get (770-line file)
Wire an EventBus into a MigrationManager across 3 edit sites (770-line file)
Add trace_id propagation through 5 functions (1,800-line file)
Wire a MetricsCollector into 3 distant classes, each ~1,000 lines apart (4,200-line file)
Add HealthChecker integration to ConnectionPool, RateLimiter, and TieredCache (4,200-line file)

We ran all 29 tasks with 7 edit formats on 4 models: Claude Haiku 4.5, OpenAI o4-mini, GPT-5.4, and Claude Opus 4.6.

The 7 formats

The first three are formats used by major coding tools today:

Whole file: the model generates the entire updated file in a fenced code block. This is our baseline, and the simplest approach. Aider uses it for some models. Cursor does something related but more sophisticated: a primary LLM generates a sketch of the changes, then a fine-tuned 70B “apply model” rewrites the full file to incorporate them, using speculative decoding to make this fast. In our benchmark we skip the two-model setup and just have a single model rewrite the file directly:

class LRUCache:
    def __init__(self, max_size: int = 128, ttl_seconds: int = 3600) -> None:
        self._max_size = max_size
        self._ttl_seconds = ttl_seconds  # new
        self._cache: dict[str, tuple[Any, float]] = {}
        # ... all 770 remaining lines reproduced verbatim ...

It always works, but on a 4,200-line file the model has to regenerate the entire thing to change 5 lines. Cursor makes this fast with speculative decoding, but the tokens still get generated and billed.

Search/replace: the model generates one or more search-and-replace blocks, each containing the exact old text and the new text to replace it with. Used by Claude Code and Aider:

<<<SEARCH
    def __init__(self, max_size: int = 128) -> None:
        self._max_size = max_size
>>>REPLACE
    def __init__(self, max_size: int = 128, ttl_seconds: int = 3600) -> None:
        self._max_size = max_size
        self._ttl_seconds = ttl_seconds
<<<END

The model has to reproduce the old code character-perfectly. One whitespace mismatch and the search fails.

Unified diff: the model generates a diff in a fenced block, with @@ hunk headers, +/- line prefixes, and context lines. Aider uses a simplified version of this. Codex CLI uses a related format called V4A which has the same +/-/context structure but wraps it in explicit file headers (*** Update File:) and uses contextual anchors in its @@ headers rather than line numbers:

@@ -45,7 +45,8 @@ class LRUCache:
-    def __init__(self, max_size: int = 128) -> None:
+    def __init__(self, max_size: int = 128, ttl_seconds: int = 3600) -> None:
         self._max_size = max_size
+        self._ttl_seconds = ttl_seconds
         self._cache: dict[str, tuple[Any, float]] = {}

Compact, but context lines have to match exactly and hunk headers have to be right.

AST edit: this one needs a bit of background. An AST (abstract syntax tree) is a tree representation of source code. When Python parses a file, it builds a tree where each node is a class, function, statement, etc. For a file like this:

class LRUCache:
    def __init__(self, max_size=128):
        self._max_size = max_size
        self._cache = {}

    def get(self, key):
        return self._cache.get(key)

The AST looks roughly like this:

Module
└── ClassDef: "LRUCache"
    ├── FunctionDef: "__init__"
    │   ├── args: (self, max_size=128)
    │   └── body:
    │       ├── self._max_size = max_size
    │       └── self._cache = {}
    └── FunctionDef: "get"
        ├── args: (self, key)
        └── body:
            └── return self._cache.get(key)

Every function and class has a name and a location in the file. So instead of identifying edit locations by line number or by matching text, we can just say “replace the body of LRUCache.get” and let the AST tell us where that is.

That’s what AST edit does. The model generates a JSON array of edit operations in a fenced block, where each operation targets a function or class by name. Proposed in aider#3206, implemented in tools like Codegen and AFT, but not used by any major coding assistant:

[
  {"operation": "replace_function_body",
   "target": "LRUCache.__init__",
   "content": "        self._max_size = max_size\n        self._ttl = ttl\n        ..."}
]

The model says what to change by name. The applier then parses the original file with Python’s ast module, walks the tree to find the node matching that name (using dotted names like LRUCache.__init__ to reach methods inside classes), reads its start and end line numbers from the AST, and splices in the new content. No context lines, no text matching, no line numbers.

We defined 9 operations:

Operation	Target	What it does
`replace_function_body`	`Class.method` or `function`	Replace just the body, keeping the signature and docstring
`replace_function`	`Class.method` or `function`	Replace the entire function/class including its signature
`add_method`	`ClassName`	Append a new method to the end of a class
`add_before`	`function` or `Class`	Insert code before a function or class
`add_after`	`function` or `Class`	Insert code after a function or class
`delete`	`function` or `Class`	Remove a function or class entirely
`add_import`	(none)	Add an import statement after existing imports
`replace_imports`	(none)	Replace the entire import block
`replace_global`	variable name	Replace a module-level variable assignment

The last three implement Can Bölük’s hashline approach. In all three, the model sees code with each line tagged by a short content hash:

d4|    def get(self, key: str) -> Any:
a3|        if key in self._cache:
f1|            self._hits += 1

Instead of reproducing old code, the model references those tags in its output. The three methods differ in how the model structures that output:

Hashline JSON ops: the model generates a JSON array of edit operations, like AST edit, but referencing line tags instead of function names:

[{"op": "replace", "range": ["45:d4", "52:a3"],
  "content": "    def get(self, key: str, trace_id: str = \"\") -> Any:\n        ..."}]

Hashline search/replace: the model generates search/replace blocks, but instead of reproducing the old code, it specifies a tag range identifying which lines to replace:

<<<SEARCH 45:d4..52:a3
>>>REPLACE
    def get(self, key: str, trace_id: str = "") -> Any:
        ...
<<<END

Hashline unified diff: the model generates a diff in a fenced block, but with tag ranges in the hunk headers instead of line numbers:

@@ 45:d4..52:a3 @@
+    def get(self, key: str, trace_id: str = "") -> Any:
+        ...

The results

Correctness (% of tasks where the edited code passes the test suite) for each model and method:

Method	Haiku 4.5	o4-mini	GPT-5.4	Opus 4.6
AST edit	86.2%	100.0%	100.0%	100.0%
Whole file	96.6%	82.8%	96.6%	100.0%
Hashline JSON ops	82.8%	79.3%	100.0%	89.7%
Search/replace	62.1%	75.9%	96.6%	100.0%
Hashline search/replace	65.5%	75.9%	86.2%	93.1%
Unified diff	58.6%	20.7%	89.7%	93.1%
Hashline unified diff	37.9%	69.0%	93.1%	79.3%

A few things stand out.

AST edit is the only format that hits 100% on three different models. Only “whole file” gets close.

The variation is massive for the weaker models. o4-mini goes from 100% with AST edit down to 20.7% with unified diff. Haiku spans 37.9% to 96.6%. Picking the right format can matter more than picking the right model.

Unified diff is all over the place. It scores 93.1% on Opus but 20.7% on o4-mini. If you need something that works across models, this is the riskiest choice, even though it’s what Codex CLI’s V4A format is built on.

Why the other formats fail

We looked at all failures across the four models and split them into two buckets: “format failures” where the edit couldn’t even be applied (the model’s output was syntactically broken), and “logic failures” where the edit applied fine but produced the wrong answer.

Method	Format failures	Logic failures	Total
Unified diff	31	9	40
Hashline unified diff	9	26	35
Hashline search/replace	15	8	23
Search/replace	11	8	19
Hashline JSON ops	8	6	14
Whole file	0	7	7
AST edit	0	4	4

Format failures are the harness problem. Logic failures are the model getting the actual edit wrong.

Unified diff is mostly format failures (31 out of 40). The error is almost always the same: Could not find matching context for hunk at line N. The model reproduces a context line with slightly wrong whitespace, or gets a hunk header’s line count off by one, and the whole patch is rejected. In a 4,200-line file, the model has to reproduce context lines from memory across thousands of lines of code it saw once in the prompt. That’s fundamentally a transcription task, and LLMs just aren’t very good at transcription.

Search/replace has 11 format failures. Sometimes the model gets a variable name slightly wrong, or adds an extra space. Sometimes the search string matches two places in a large file and the applier doesn’t know which one to replace.

Hashline methods fail when the model gets a hash wrong. It sees 483:d4 in the input, writes 483:3a in the output. Every model does this, including Opus.

AST edit has zero format failures across all four models. The JSON always parses. The function names always resolve. The 4 failures it does have (all on Haiku) are logic errors where the model got the code change itself wrong, not the format.

Example: 3 edits across a 4,200-line file

One of the hardest tasks asks the model to add trace_id support to three classes in a 4,200-line file (NotificationService, CacheManager, and RequestHandler, each about 1,000 lines apart).

With AST edit, the model outputs:

[
  {"operation": "replace_function_body",
   "target": "NotificationService.send",
   "content": "        if recipient in self._suppressed ..."},
  {"operation": "replace_function_body",
   "target": "CacheManager.get",
   "content": "        if key in self._cache ..."},
  {"operation": "replace_function_body",
   "target": "RequestHandler.handle",
   "content": "        trace_id = request.get('trace_id', '') ..."}
]

The model names each function and provides the new body. The applier figures out where they are in the file and splices the content in.

With unified diff, the model has to produce three separate hunks with correct @@ headers, matching context lines, and precise +/- prefixes across 4,200 lines. Opus got it right. o4-mini didn’t: Could not find matching context for hunk at line 1530.

With search/replace, the model has to reproduce exact chunks of old code from a 4,200-line file it saw once. GPT-5.4 managed it. o4-mini produced a search string that matched two locations and got rejected.

With hashline search/replace, the model wrote 3703:83 instead of 3703:c6, getting two hex characters wrong, and the edit was rejected.

AST edit doesn’t have any of these problems. It doesn’t copy text, match context, or reproduce hashes. It just names functions.

The whole file tradeoff

Whole file also does well, scoring 82-100% across all models. It avoids the expression problem entirely since the model just rewrites everything.

But it’s obviously far more expensive and far slower:

Method	Avg output tokens (Opus)	Avg latency
Hashline search/replace	311	6.8s
Hashline unified diff	346	6.9s
Hashline JSON ops	394	7.3s
Unified diff	432	7.9s
Search/replace	543	11.2s
AST edit	621	9.1s
Whole file	11,530	113.4s

On a 4,200-line file, whole file uses 18x the output tokens of AST edit and takes 12x longer.

What about Can Bölük’s hashlines?

Can Bölük’s The Harness Problem argues that tagging lines with content hashes lets models reference code instead of reproducing it. He saw big improvements: Grok Code Fast went from 6.7% to 68.3%.

Our results were more mixed. We can compare each hashline method directly against its plain equivalent:

	Search/replace	Hashline S/R	Delta	Unified diff	Hashline UD	Delta
Haiku	62.1%	65.5%	+3.4%	58.6%	37.9%	-20.7%
o4-mini	75.9%	75.9%	0%	20.7%	69.0%	+48.3%
GPT-5.4	96.6%	86.2%	-10.3%	89.7%	93.1%	+3.4%
Opus	100.0%	93.1%	-6.9%	93.1%	79.3%	-13.8%

There’s no consistent improvement. Hashline search/replace helps slightly on Haiku, hurts on GPT-5.4 and Opus, and makes no difference on o4-mini. Hashline unified diff is even more unpredictable: it’s a massive 48-point improvement on o4-mini but a 21-point regression on Haiku.

Hashline JSON ops told a better story, scoring 82.8-100% across all models. But that method uses a standard JSON structure, so it’s hard to separate the benefit of the hashline references from the benefit of outputting JSON.

Can Bölük suggests that fine-tuning on the hashline format would help, and that might well close this gap.

What this means

AST-targeted output was the most reliable edit format in our tests. Three models scored 100%. Nothing else came close to that level of consistency. It works on small files and 4,200-line files. It’s cheap on tokens and fast. And no major tool ships it.

We’re not the first to suggest this. The idea has been proposed for Aider, and tools like Codegen and AFT already implement tree-sitter-based “edit by name” for agents. What’s been missing is evidence that it actually works better across models and file sizes.

The reason it hasn’t been adopted is probably practical: you need an AST parser per language. We used Python’s ast module. Supporting other languages means reaching for tree-sitter, which Codegen, AFT, and others are already doing. The core idea, target by name, not by text position, works in any language.

The bigger picture: the edit format is not a solved problem. The formats that major tools ship today (search/replace in Claude Code, patch diffs in Codex CLI, full-file rewriting in Cursor) work, but they leave real performance on the table. Especially on larger files, and especially across different model architectures. A coding assistant that used AST-targeted output would, based on our numbers, produce more correct edits with fewer tokens and fewer retries.

All code and data are available at github.com/GeometricAGI/blog. The benchmark contains 29 tasks on files ranging from 100 to 4,200 lines, testing localised code edits.