Geometric

By Jack Foxabbott, Founding Member of Technical Staff

AST Edits: The Code Editing Format Nobody Uses

Every AI coding tool makes a bet on how its model should express code edits. Claude Code uses search/replace. Codex CLI uses a patch format. Cursor uses a dedicated apply model to rewrite files. Aider picks per model. Can Bölük proposed hashlines. None of the large providers use AST-targeted edits.

We tested all of them. AST won.


The benchmark

We built 29 editing tasks, from one-line fixes in 100-line files up to multi-site rewrites in 4,200-line modules. Each task gives the model a Python file, an instruction, and a test suite. The model edits the file. We check if the tests pass.

The edits themselves aren’t exotic: add a parameter, fix a bug, wire up a dependency. What makes the harder tasks hard is the file size. At 4,200 lines, a one-character mistake in a context line tanks the whole edit.

Some examples, from simplest to hardest:

We ran all 29 tasks with 7 edit formats on 4 models: Claude Haiku 4.5, OpenAI o4-mini, GPT-5.4, and Claude Opus 4.6.

The 7 formats

The first three are formats used by major coding tools today:

  1. Whole file: the model generates the entire updated file in a fenced code block. This is our baseline, and the simplest approach. Aider uses it for some models. Cursor does something related but more sophisticated: a primary LLM generates a sketch of the changes, then a fine-tuned 70B “apply model” rewrites the full file to incorporate them, using speculative decoding to make this fast. In our benchmark we skip the two-model setup and just have a single model rewrite the file directly:
class LRUCache:
    def __init__(self, max_size: int = 128, ttl_seconds: int = 3600) -> None:
        self._max_size = max_size
        self._ttl_seconds = ttl_seconds  # new
        self._cache: dict[str, tuple[Any, float]] = {}
        # ... all 770 remaining lines reproduced verbatim ...

It always works, but on a 4,200-line file the model has to regenerate the entire thing to change 5 lines. Cursor makes this fast with speculative decoding, but the tokens still get generated and billed.

  1. Search/replace: the model generates one or more search-and-replace blocks, each containing the exact old text and the new text to replace it with. Used by Claude Code and Aider:
<<<SEARCH
    def __init__(self, max_size: int = 128) -> None:
        self._max_size = max_size
>>>REPLACE
    def __init__(self, max_size: int = 128, ttl_seconds: int = 3600) -> None:
        self._max_size = max_size
        self._ttl_seconds = ttl_seconds
<<<END

The model has to reproduce the old code character-perfectly. One whitespace mismatch and the search fails.

  1. Unified diff: the model generates a diff in a fenced block, with @@ hunk headers, +/- line prefixes, and context lines. Aider uses a simplified version of this. Codex CLI uses a related format called V4A which has the same +/-/context structure but wraps it in explicit file headers (*** Update File:) and uses contextual anchors in its @@ headers rather than line numbers:
@@ -45,7 +45,8 @@ class LRUCache:
-    def __init__(self, max_size: int = 128) -> None:
+    def __init__(self, max_size: int = 128, ttl_seconds: int = 3600) -> None:
         self._max_size = max_size
+        self._ttl_seconds = ttl_seconds
         self._cache: dict[str, tuple[Any, float]] = {}

Compact, but context lines have to match exactly and hunk headers have to be right.

  1. AST edit: this one needs a bit of background. An AST (abstract syntax tree) is a tree representation of source code. When Python parses a file, it builds a tree where each node is a class, function, statement, etc. For a file like this:
class LRUCache:
    def __init__(self, max_size=128):
        self._max_size = max_size
        self._cache = {}

    def get(self, key):
        return self._cache.get(key)

The AST looks roughly like this:

Module
└── ClassDef: "LRUCache"
    ├── FunctionDef: "__init__"
    │   ├── args: (self, max_size=128)
    │   └── body:
    │       ├── self._max_size = max_size
    │       └── self._cache = {}
    └── FunctionDef: "get"
        ├── args: (self, key)
        └── body:
            └── return self._cache.get(key)

Every function and class has a name and a location in the file. So instead of identifying edit locations by line number or by matching text, we can just say “replace the body of LRUCache.get” and let the AST tell us where that is.

That’s what AST edit does. The model generates a JSON array of edit operations in a fenced block, where each operation targets a function or class by name. Proposed in aider#3206, implemented in tools like Codegen and AFT, but not used by any major coding assistant:

[
  {"operation": "replace_function_body",
   "target": "LRUCache.__init__",
   "content": "        self._max_size = max_size\n        self._ttl = ttl\n        ..."}
]

The model says what to change by name. The applier then parses the original file with Python’s ast module, walks the tree to find the node matching that name (using dotted names like LRUCache.__init__ to reach methods inside classes), reads its start and end line numbers from the AST, and splices in the new content. No context lines, no text matching, no line numbers.

We defined 9 operations:

Operation Target What it does
replace_function_body Class.method or function Replace just the body, keeping the signature and docstring
replace_function Class.method or function Replace the entire function/class including its signature
add_method ClassName Append a new method to the end of a class
add_before function or Class Insert code before a function or class
add_after function or Class Insert code after a function or class
delete function or Class Remove a function or class entirely
add_import (none) Add an import statement after existing imports
replace_imports (none) Replace the entire import block
replace_global variable name Replace a module-level variable assignment

The last three implement Can Bölük’s hashline approach. In all three, the model sees code with each line tagged by a short content hash:

45:d4|    def get(self, key: str) -> Any:
46:a3|        if key in self._cache:
47:f1|            self._hits += 1

Instead of reproducing old code, the model references those tags in its output. The three methods differ in how the model structures that output:

  1. Hashline JSON ops: the model generates a JSON array of edit operations, like AST edit, but referencing line tags instead of function names:
[{"op": "replace", "range": ["45:d4", "52:a3"],
  "content": "    def get(self, key: str, trace_id: str = \"\") -> Any:\n        ..."}]
  1. Hashline search/replace: the model generates search/replace blocks, but instead of reproducing the old code, it specifies a tag range identifying which lines to replace:
<<<SEARCH 45:d4..52:a3
>>>REPLACE
    def get(self, key: str, trace_id: str = "") -> Any:
        ...
<<<END
  1. Hashline unified diff: the model generates a diff in a fenced block, but with tag ranges in the hunk headers instead of line numbers:
@@ 45:d4..52:a3 @@
+    def get(self, key: str, trace_id: str = "") -> Any:
+        ...

The results

Correctness (% of tasks where the edited code passes the test suite) for each model and method:

Method Haiku 4.5 o4-mini GPT-5.4 Opus 4.6
AST edit 86.2% 100.0% 100.0% 100.0%
Whole file 96.6% 82.8% 96.6% 100.0%
Hashline JSON ops 82.8% 79.3% 100.0% 89.7%
Search/replace 62.1% 75.9% 96.6% 100.0%
Hashline search/replace 65.5% 75.9% 86.2% 93.1%
Unified diff 58.6% 20.7% 89.7% 93.1%
Hashline unified diff 37.9% 69.0% 93.1% 79.3%

A few things stand out.

AST edit is the only format that hits 100% on three different models. Only “whole file” gets close.

The variation is massive for the weaker models. o4-mini goes from 100% with AST edit down to 20.7% with unified diff. Haiku spans 37.9% to 96.6%. Picking the right format can matter more than picking the right model.

Unified diff is all over the place. It scores 93.1% on Opus but 20.7% on o4-mini. If you need something that works across models, this is the riskiest choice, even though it’s what Codex CLI’s V4A format is built on.

Why the other formats fail

We looked at all failures across the four models and split them into two buckets: “format failures” where the edit couldn’t even be applied (the model’s output was syntactically broken), and “logic failures” where the edit applied fine but produced the wrong answer.

Method Format failures Logic failures Total
Unified diff 31 9 40
Hashline unified diff 9 26 35
Hashline search/replace 15 8 23
Search/replace 11 8 19
Hashline JSON ops 8 6 14
Whole file 0 7 7
AST edit 0 4 4

Format failures are the harness problem. Logic failures are the model getting the actual edit wrong.

Unified diff is mostly format failures (31 out of 40). The error is almost always the same: Could not find matching context for hunk at line N. The model reproduces a context line with slightly wrong whitespace, or gets a hunk header’s line count off by one, and the whole patch is rejected. In a 4,200-line file, the model has to reproduce context lines from memory across thousands of lines of code it saw once in the prompt. That’s fundamentally a transcription task, and LLMs just aren’t very good at transcription.

Search/replace has 11 format failures. Sometimes the model gets a variable name slightly wrong, or adds an extra space. Sometimes the search string matches two places in a large file and the applier doesn’t know which one to replace.

Hashline methods fail when the model gets a hash wrong. It sees 483:d4 in the input, writes 483:3a in the output. Every model does this, including Opus.

AST edit has zero format failures across all four models. The JSON always parses. The function names always resolve. The 4 failures it does have (all on Haiku) are logic errors where the model got the code change itself wrong, not the format.

Example: 3 edits across a 4,200-line file

One of the hardest tasks asks the model to add trace_id support to three classes in a 4,200-line file (NotificationService, CacheManager, and RequestHandler, each about 1,000 lines apart).

With AST edit, the model outputs:

[
  {"operation": "replace_function_body",
   "target": "NotificationService.send",
   "content": "        if recipient in self._suppressed ..."},
  {"operation": "replace_function_body",
   "target": "CacheManager.get",
   "content": "        if key in self._cache ..."},
  {"operation": "replace_function_body",
   "target": "RequestHandler.handle",
   "content": "        trace_id = request.get('trace_id', '') ..."}
]

The model names each function and provides the new body. The applier figures out where they are in the file and splices the content in.

With unified diff, the model has to produce three separate hunks with correct @@ headers, matching context lines, and precise +/- prefixes across 4,200 lines. Opus got it right. o4-mini didn’t: Could not find matching context for hunk at line 1530.

With search/replace, the model has to reproduce exact chunks of old code from a 4,200-line file it saw once. GPT-5.4 managed it. o4-mini produced a search string that matched two locations and got rejected.

With hashline search/replace, the model wrote 3703:83 instead of 3703:c6, getting two hex characters wrong, and the edit was rejected.

AST edit doesn’t have any of these problems. It doesn’t copy text, match context, or reproduce hashes. It just names functions.

The whole file tradeoff

Whole file also does well, scoring 82-100% across all models. It avoids the expression problem entirely since the model just rewrites everything.

But it’s obviously far more expensive and far slower:

Method Avg output tokens (Opus) Avg latency
Hashline search/replace 311 6.8s
Hashline unified diff 346 6.9s
Hashline JSON ops 394 7.3s
Unified diff 432 7.9s
Search/replace 543 11.2s
AST edit 621 9.1s
Whole file 11,530 113.4s

On a 4,200-line file, whole file uses 18x the output tokens of AST edit and takes 12x longer.

What about Can Bölük’s hashlines?

Can Bölük’s The Harness Problem argues that tagging lines with content hashes lets models reference code instead of reproducing it. He saw big improvements: Grok Code Fast went from 6.7% to 68.3%.

Our results were more mixed. We can compare each hashline method directly against its plain equivalent:

  Search/replace Hashline S/R Delta   Unified diff Hashline UD Delta
Haiku 62.1% 65.5% +3.4%   58.6% 37.9% -20.7%
o4-mini 75.9% 75.9% 0%   20.7% 69.0% +48.3%
GPT-5.4 96.6% 86.2% -10.3%   89.7% 93.1% +3.4%
Opus 100.0% 93.1% -6.9%   93.1% 79.3% -13.8%

There’s no consistent improvement. Hashline search/replace helps slightly on Haiku, hurts on GPT-5.4 and Opus, and makes no difference on o4-mini. Hashline unified diff is even more unpredictable: it’s a massive 48-point improvement on o4-mini but a 21-point regression on Haiku.

Hashline JSON ops told a better story, scoring 82.8-100% across all models. But that method uses a standard JSON structure, so it’s hard to separate the benefit of the hashline references from the benefit of outputting JSON.

Can Bölük suggests that fine-tuning on the hashline format would help, and that might well close this gap.

What this means

AST-targeted output was the most reliable edit format in our tests. Three models scored 100%. Nothing else came close to that level of consistency. It works on small files and 4,200-line files. It’s cheap on tokens and fast. And no major tool ships it.

We’re not the first to suggest this. The idea has been proposed for Aider, and tools like Codegen and AFT already implement tree-sitter-based “edit by name” for agents. What’s been missing is evidence that it actually works better across models and file sizes.

The reason it hasn’t been adopted is probably practical: you need an AST parser per language. We used Python’s ast module. Supporting other languages means reaching for tree-sitter, which Codegen, AFT, and others are already doing. The core idea, target by name, not by text position, works in any language.

The bigger picture: the edit format is not a solved problem. The formats that major tools ship today (search/replace in Claude Code, patch diffs in Codex CLI, full-file rewriting in Cursor) work, but they leave real performance on the table. Especially on larger files, and especially across different model architectures. A coding assistant that used AST-targeted output would, based on our numbers, produce more correct edits with fewer tokens and fewer retries.


All code and data are available at github.com/GeometricAGI/blog. The benchmark contains 29 tasks on files ranging from 100 to 4,200 lines, testing localised code edits.