Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages

**Figure 1.** Task substrate and agentic runtime. *(a)* The same simple input and print task in Python, Brainfuck, and Befunge-98 shows how different esolang code looks from ordinary code. *(b)* Each model runs in a coding harness (Claude Code, Codex, or OpenCode) with file editing, shell access, benchmark commands, and a persistent workspace for local execution and hidden test submission.

TL;DR

We evaluate six contemporary coding agents on four esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Shakespeare) using a sequential setup with file editing, local execution, and hidden test grading. Agents that cluster within 6.6 pp on SWE-Bench Verified separate by 88 pp on this protocol. The strongest agents, Claude Opus 4.6 and GPT-5.4 xhigh, often avoid writing the target language directly. They write Python programs that generate target-language code and debug those generators locally before submission. Forbidding this metaprogramming strategy causes large drops. Transferring it as text guidance does not help, but transferring it as a small executable helper library does.

Abstract

LLM-based coding agents are usually evaluated in familiar software settings such as mainstream languages, common libraries, and public repositories. These benchmarks remain important, but they can hide how agents behave when the language itself is unfamiliar. We evaluate six contemporary coding agents on four esoteric programming languages using a sequential setup with file editing, local execution, and hidden test grading. Our protocol exposes capability differences between these agents that mainstream coding and agentic benchmarks such as SWE-Bench Verified and Terminal-Bench 2.0 compress into much narrower bands.

We observe that the strongest agents, Claude Opus 4.6 and GPT-5.4 xhigh, often avoid writing the target language directly. On Brainfuck and Befunge-98, they write Python programs that generate target-language code and debug those generators locally. Forbidding this metaprogramming strategy causes large performance drops. Text guidance distilled from this strategy does not materially improve weaker agents. In contrast, Opus-derived Python helper code for building generators, with no solved benchmark programs or hidden test answers, sharply improves Sonnet 4.6 and GPT-5.4 mini on the same problems, while Haiku 4.5 remains low. More interpreter calls and output tokens improve stronger agents but leave weaker agents near their original performance, indicating that these resources amplify useful strategies rather than create them.

Together, these results show that strong coding agents adapt to unfamiliar languages by using tools, feedback, and workspace state to build a working model of the target language. Metaprogramming is the clearest case, but the broader gap is constructing and debugging a strategy that works under the target language's rules.

Video

A short walk-through of the setup and the metaprogramming behaviour we observed in frontier agents.

Experimental setup

We use the four-language task substrate from EsoLang-Bench. Each model and language pairing is one sequential session over 80 problems. Problems are fetched in fixed forward order. For each problem, the agent receives the statement, edits files in an isolated workspace, runs candidates locally, and may make up to three hidden submissions. A problem is finalised when one submission passes all six hidden tests or when the three submissions are exhausted. Finalised problems are not revisited.

Local interpreter calls expose ordinary execution feedback such as stdout, stderr, and runtime errors. Hidden submissions return only the aggregate number of private tests passed, not the private inputs or expected outputs. The primary protocol uses 80 problems per language, six hidden tests per problem, up to three hidden submissions, unlimited local interpreter calls, a 32k token output budget per assistant turn, and isolated workspaces.

We evaluate deployed coding agents rather than bare models. Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5 run under Claude Code. GPT-5.4 xhigh and GPT-5.4 mini run under Codex. Kimi K2.5 runs under OpenCode. Every agent receives the same benchmark-facing operations and the same per-language system prompt (no problem-specific guidance, no solved examples, and no hidden test material).

**Figure 2.** Per-problem state machine under the primary protocol. Each model and language run is a fixed forward session over 80 problems. For each problem, the agent fetches the specification, edits and executes candidate programs locally, and makes up to three hidden submissions. Hidden submissions return only aggregate hidden test feedback. Finalised problems are not revisited.

Results

Table 1 reports Session-1 percentage solved on each of the four esoteric languages, for the six agents we evaluate. Each cell is the count out of 80 problems solved on that language, with Wilson 95% binomial intervals as subscripts. The Mean column averages the four esolangs. Two further sessions per cell are tabulated in the paper appendix; the qualitative ordering is unchanged.

Agent	Whitespace	Shakespeare	Befunge-98	Brainfuck	Mean
GPT-5.4 xhigh	100.0_−4.6	100.0_−4.6	100.0_−4.6	98.8_−5.5	99.7_−1.4
Opus 4.6	100.0_−4.6	87.5_±9.0	80.0_±10.0	80.0_±10.0	86.9_±4.1
Sonnet 4.6	100.0_−4.6	70.0_±10.8	80.0_±10.0	15.0_±9.4	66.3_±5.3
GPT-5.4 mini	88.8_±8.8	21.3_±10.2	13.8_±9.2	6.3_+7.6	32.5_±5.3
Haiku 4.5	81.3_±9.9	7.5_+7.9	5.0_+7.2	5.0_+7.2	24.7_±5.0
Kimi K2.5	31.3_±10.8	2.5_+6.2	6.3_+7.6	5.0_+7.2	11.3_±3.9

Table 1. Main EsoLang-Bench results under the primary protocol. Each cell is Session-1 percentage solved out of 80, with Wilson 95% binomial intervals as subscripts. Mean averages the four esolangs. Rows are sorted by mean (descending). Cells are tinted by value.

Mainstream benchmarks hide a gap unfamiliar languages expose

The same six agents land within 2.9 SD on SWE-Bench Verified and 17.2 SD on LiveCodeBench v6, but spread to 36.0 SD on EsoLang-Bench. That is roughly 12 times the SWE-V SD and 2 times the LiveCodeBench SD. Unfamiliar-language evaluation widens the band by an order of magnitude.

SWE-Bench Verified vs EsoLang-Bench scatter — **Figure 3.** The same six agents on SWE-Bench Verified (x-axis) versus our EsoLang-Bench mean (y-axis). On SWE-Bench they bunch together within a few points, so the leaderboard makes them look interchangeable. On EsoLang-Bench they fan out across the full range, from near-perfect to near-zero. The gap a familiar benchmark hides is the gap an unfamiliar language exposes.

Forbidding metaprogramming reduces performance sharply

Inspecting the logs shows a consistent pattern on low-level languages, especially Brainfuck and Befunge-98. The strongest agents avoid writing target-language source directly. They write a Python generator that emits target-language source, run it locally, and iterate on the generator. A representative within-session switch occurs on Brainfuck E04. Opus 4.6 first submits a hand-written 1,884 byte Brainfuck program that fails the hidden tests. After the failure, it writes a Python generator whose output is 24,500 bytes and passes all six hidden tests.

To test whether metaprogramming merely correlates with success or supports it, we run a no-metaprogramming variant for the two strongest agents. In this variant, agents must author the target esolang directly and may not use a host-language program to generate target source. All other aspects of the protocol are held fixed.

Forcing direct authoring sharply reduces performance — **Figure 4.** Forbidding metaprogramming and forcing direct authoring collapses performance on Brainfuck and Befunge-98, the two low-level languages where the generators were carrying the result. Solved problems out of 80 for Opus 4.6 and GPT-5.4 xhigh, metaprogramming allowed versus forbidden. Whitespace and Shakespeare, which the agents already solve without building a generator, are barely affected.

The benefit is host-language generation, not Python specifically

On Brainfuck, swapping the generator host language preserves most of the gain. Opus 4.6 solves 64/80 with Python, 63/80 with JavaScript, and 55/80 with Rust, while GPT-5.4 xhigh solves 79/80, 77/80, and 79/80, respectively. Direct authoring remains low for both agents (27/80 and 29/80). The critical ingredient is access to a familiar general-purpose host language for constructing target programs, not Python itself.

Cross-host-language generator results on Brainfuck — **Figure 5.** Cross-host-language Brainfuck results for Opus 4.6 and GPT-5.4 xhigh. The gain from metaprogramming is preserved across Python, JavaScript, and Rust generators, but collapses under direct authoring. The host language is substitutable. Access to *some* familiar host language is not.

Strategy transfer works through executable scaffolds, not text

We use Opus 4.6's traces to distill the metaprogramming strategy in two forms. In the +Text condition, we add a system-prompt preamble summarising the strategy: use a generator, build reusable primitives, verify locally, regenerate components rather than hand-patching target code. In the +Lib condition, we additionally provide a small Python helper library distilled from those traces, containing generic primitives (cell allocator, BCD-arithmetic helpers, decimal-print primitives, and a local Befunge-98 simulator). No per-problem generators, no solved benchmark programs, no hidden test inputs, no expected outputs, no ground-truth answers.

Strategy transfer across three tiers — **Figure 6.** Strategy transfer across three weaker agents on Brainfuck and Befunge-98. Written advice alone (+Text) produces little improvement. The reference library (+Lib), in contrast, substantially improves Sonnet 4.6 and GPT-5.4 mini, while Haiku 4.5 remains near the floor.

The lift is sharp where it happens. Sonnet 4.6 moves from 12/80 to 64/80 on Brainfuck and from 64/80 to 78/80 on Befunge-98. GPT-5.4 mini moves from 5/80 to 53/80 on Brainfuck and from 11/80 to 64/80 on Befunge-98. Haiku 4.5 stays near the floor on both languages regardless of which scaffold is provided.

Additional inference-time resources help only when agents can use them

We cap local interpreter calls per problem at 3, 5, 15, 30, or unlimited, holding the task substrate, hidden submissions, and scoring rule fixed. Additional interpreter access helps agents that already convert local feedback into progress. Opus 4.6 improves on both languages, and Sonnet 4.6 improves on Befunge-98. Agents that are near the floor at the smallest budget remain near the floor even when given many more local runs. Tool access is not a uniform substitute for strategy construction.

Interpreter-call budget vs solved — **Figure 7.** More interpreter calls help only agents that can use feedback. Problems solved out of 80 on Brainfuck and Befunge-98 under local-interpreter-call budgets of 3, 5, 15, 30, and unlimited.

We also ask whether the gap is explained simply by stronger models spending more output tokens. For the first 20 Brainfuck and Befunge-98 problems, we log cumulative API output tokens for the three Claude agents, including extended-thinking tokens. Opus 4.6 solves more problems with fewer tokens than Sonnet 4.6 on Brainfuck and reaches 20/20 on Befunge-98 with roughly half Sonnet's token use. The difference is therefore not just that Opus spends more. It finds a reusable strategy earlier, after which additional problems become cheaper to solve.

Cumulative solves versus cumulative output tokens — **Figure 8.** Output-token use does not explain the gap. Cumulative solves versus cumulative API output tokens on the first 20 Brainfuck and Befunge-98 problems for Claude agents. Opus reaches 20/20 on both languages with fewer tokens than Sonnet. Haiku saturates early.

Contact us

Questions, disagreements, or related results? We would love to hear from you. Drop us a note via the form below, or reach out directly by email.

Send us your doubts aman.sharma@lossfunk.com

BibTeX

@article{sharma2026metaprogramming,
  title         = {Frontier Coding Agents Use Metaprogramming to
                   Adapt to Unfamiliar Programming Languages},
  author        = {Sharma, Aman and Thorat, Sushrut and Chopra, Paras},
  year          = {2026},
  eprint        = {2606.10933},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2606.10933}
}