LoopForge — a Loop Engineering harness

An independent, general-purpose take on the closed-loop LLM optimization approach introduced in ComPilot (Merouani et al., PACT 2025) — for everyday optimization work instead of polyhedral loop nests. The relationship is conceptual, not code-derived: ComPilot is a compiler-specific system built on Tiramisu's polyhedral solvers, whereas LoopForge is a general-purpose, in-process Python harness. LoopForge is not affiliated with the ComPilot project. ComPilot's real contribution isn't loop transformations — it's a discipline: an LLM agent proposes structured, labeled transformations, an environment validates and measures them, and typed empirical feedback grounds the next move. That discipline transfers to any task with a correctness oracle and a stopwatch.

What maps to what

ComPilot	LoopForge
LLM agent (off-the-shelf)	Claude, via `AGENT_PROMPT.md` (chat or Claude Code)
Tiramisu compiler	`harness.py` — the environment driver
Polyhedral legality check	Correctness oracle: differential testing vs a trusted reference + invariants
Speedup measurement	Warm-up + median-of-trials benchmark, relative to the reference
5 feedback types	`MALFORMED / INCORRECT / NOT_APPLICABLE / RUNTIME_ERROR / SUCCESS`
`<schedule>` tags	`<transform label="...">` block: reasoning + drop-in candidate
Dialogue = episodic memory	`Ledger` — every attempt + verdict, replayed each turn
comp_IDs / anonymization	Target spec (file:function); optional symbol anonymization
Multi-run K, iterations T	Configurable; the prompt bakes in diminishing-returns discipline
Push past premature quits	The prompt's "push once more" rule

The one honest caveat (ComPilot RQ7)

ComPilot gets formal correctness from polyhedral dependence analysis. General app code has no such checker, so LoopForge substitutes the strongest practical thing: differential testing against a reference across a battery of inputs plus invariants. That is the weaker-guarantee regime RQ7 warns about — output comparison can pass a transformation that is subtly wrong on inputs you didn't test. So the oracle is the most important file you write. Make gen_inputs() adversarial: edge cases, empty, degenerate, boundary values, large random. In the demo, iteration 1 looked provably correct and was off by one pair — only a fat input battery caught it.

Where a real transformation engine exists, use it instead of LLM-written code and you recover ComPilot's stronger guarantees: e.g. SQL (rewrite validated by EXPLAIN ANALYZE for cost + result-set equality for correctness) is a near-perfect fit.

Quickstart

Requirements: Python 3 (tested on 3.13). harness.py and the two-sum loop (run_loop.py) use only the standard library. The proximity demo's final iteration imports numpy — that is the only external dependency. There is no requirements.txt in this repo, so install numpy directly:

pip install numpy

(If numpy is absent the demo still runs, but its last iteration reports RUNTIME_ERROR instead of a speedup.)

Run the proximity demo — four pre-baked transformation attempts:

python run_demo.py        # on Windows you can also use:  py run_demo.py

It optimizes a brute-force O(N^2) "users within R meters" function. The oracle rejects two attempts (a subtly-wrong latitude prefilter that drops one pair, then a greedy grid that drops boundary-straddling pairs), then accepts a correct 3x3-grid win, then a larger numpy-vectorised win. The exact speedup multipliers are hardware-dependent — they are baseline_time / candidate_time measured on your own CPU.

Run the live two-sum loop (the worked example below):

python run_loop.py

LLM-driven loop (the intended pattern)

run_demo.py and run_loop.py replay pre-written candidates. The loop is meant to be driven by an LLM acting as the architect (proposer): give the model AGENT_PROMPT.md as its system prompt, then on each turn send it the target source, the baseline time, and the running ledger. The model replies with a <transform> block; you extract the candidate, run it through evaluate() (passing its source via src=... so the banned check fires), record the typed verdict in the Ledger, and feed that verdict back to the model. Repeat until the model emits no_further_transformations, a turn cap is hit, or the best-so-far stalls.

This repo does not ship a runnable driver for that loop — evaluate() and the Ledger are the building blocks; wire them to whichever LLM client you use.

Safety: a candidate returned by a model is arbitrary code. If you evaluate it in-process, do so only in a trusted environment, on a target you control.

Defining a target (the `Task` schema)

A target is a Task (defined in harness.py). You supply a trusted reference, an input battery, and an equality test; the harness does the rest. The complete field list, with defaults exactly as they appear in harness.py:

field	type	required	meaning
`name`	`str`	yes	label for the target
`reference`	`callable(input) -> output`	yes	the trusted implementation — used as BOTH the correctness oracle and the speed baseline
`gen_inputs`	`callable() -> list`	yes	the input battery; `inputs[0]` is the largest (used for benchmarking), the rest are correctness cases. Make it adversarial.
`equal`	`callable(ref_out, cand_out) -> bool`	yes	semantic equality of two outputs
`invariants`	`callable(input, output) -> None`	no — default no-op	extra checks; raise to reject a candidate
`banned`	`tuple[str, ...]`	no — default `()`	source substrings forbidden in a candidate (e.g. importing the oracle)
`bench_trials`	`int`	no — default `5`	median-of-N timed runs for a candidate
`baseline_trials`	`int`	no — default `2`	median-of-N timed runs for the reference

Note: the banned check only runs when a candidate's source is supplied — i.e. propose(label, fn, src=...). If src is omitted, no substring check is performed.

A candidate is a drop-in replacement for reference with the same signature. The harness checks it in three stages (see evaluate() in harness.py): a banned- substring / callable pre-filter → the correctness oracle over every input in the battery → a benchmark on inputs[0]. It returns one of the five verdicts above.

Worked example (task_twosum.py — count index pairs i < j with a[i] + a[j] == target):

from harness import Task

def reference(case):
    a, target = case
    n = len(a); count = 0
    for i in range(n):
        for j in range(i + 1, n):
            if a[i] + a[j] == target:
                count += 1
    return count

def gen_inputs():
    # inputs[0] is the big benchmark case; the rest are adversarial correctness
    # cases (self-pairs, duplicates, empty, single element, asymmetric pairs).
    ...

def equal(ref_out, cand_out):
    return ref_out == cand_out

TASK = Task(
    name="twosum_pair_count",
    reference=reference,
    gen_inputs=gen_inputs,
    equal=equal,
    banned=("import task_twosum", "from task_twosum"),
    bench_trials=5,
    baseline_trials=2,
)

Then drive the loop (see run_loop.py): each agent turn becomes a propose(label, candidate_fn) call, which runs evaluate() and records the verdict + speedup in the Ledger. There is no CLI binary — you run the driver script directly (python run_loop.py) and read the printed verdicts.

Wiring into the architect/executor workflow

This is the loop you already run by hand, made explicit:

Architect (this chat / a planning Claude): holds AGENT_PROMPT.md, sees the target + ledger, emits the next <transform> candidate.
Executor (Claude Code): drops the candidate into the repo, runs harness.py against the task, captures the verdict.
The verdict + updated ledger go back to the architect. Repeat.

Keep the architect and executor as separate Claude instances exactly as you do now; LoopForge just standardizes the message format between them so feedback is typed and the best-so-far is never lost.

Good first targets

A Supabase/PostGIS proximity or "nearby" query (oracle = result-set equality, perf = EXPLAIN ANALYZE cost or wall time). Strongest fit — a real legality-ish checker.
A deterministic game/physics simulation step (oracle = identical state from a fixed seed; perf = frame time).
A transaction categorizer or record-matching pass (oracle = identical output on a fixed input set; perf = wall time over a batch).

References

Merouani, M., Kara Bernou, I., and Baghdadi, R. "Agentic Auto-Scheduling: An Experimental Study of LLM-Guided Loop Optimization." PACT 2025. arXiv:2511.00592. https://arxiv.org/abs/2511.00592

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
AGENT_PROMPT.md		AGENT_PROMPT.md
LICENSE		LICENSE
README.md		README.md
harness.py		harness.py
run_demo.py		run_demo.py
run_loop.py		run_loop.py
task_proximity.py		task_proximity.py
task_twosum.py		task_twosum.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LoopForge — a Loop Engineering harness

What maps to what

The one honest caveat (ComPilot RQ7)

Quickstart

LLM-driven loop (the intended pattern)

Defining a target (the `Task` schema)

Wiring into the architect/executor workflow

Good first targets

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LoopForge — a Loop Engineering harness

What maps to what

The one honest caveat (ComPilot RQ7)

Quickstart

LLM-driven loop (the intended pattern)

Defining a target (the Task schema)

Wiring into the architect/executor workflow

Good first targets

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Defining a target (the `Task` schema)

Packages