Awesome Harness Engineering

A curated list of articles, playbooks, benchmarks, specifications, and open-source projects for harness engineering: the practice of shaping the environment around AI agents so they can work reliably.

Harness engineering sits at the intersection of context engineering, evaluation, observability, orchestration, safe autonomy, and software architecture. This list focuses on resources that make agents more dependable in real workflows, especially long-running coding and research tasks.

Generic agent tooling is out of scope unless the page directly covers harness design, context management, evaluation, runtime control, or other reliability-critical harness primitives.

Courses & Learning Resources

walkinglabs/learn-harness-engineering - A project-based course repository on making Codex and Claude Code more reliable, centered on an Electron personal knowledge base app with lecture handouts, example artifacts, and practical harness projects.

Foundations

Harness engineering: leveraging Codex in an agent-first world - OpenAI's flagship field report on building a large application with Codex using architectural constraints, repo-local instructions, browser validation, and telemetry.
Effective harnesses for long-running agents - Anthropic's core article on initializer agents, feature lists, init.sh, self-verification, and handoff artifacts across many context windows.
Harness design for long-running application development - Anthropic follow-up focused on improving long-running app generation with better task state and evaluator design.
The Anatomy of an Agent Harness - LangChain's concise framing of an agent as model plus harness, with prompts, tools, middleware, orchestration, and runtime infrastructure.
Harness Engineering - Thoughtworks' framing of harness work into context engineering, architectural constraints, and "garbage collection" against entropy.
Building effective agents - Anthropic's broader guide to workflows, agents, tools, and when structured systems outperform raw prompting.
Skill Issue: Harness Engineering for Coding Agents - A practical argument that weak results from coding agents are often harness problems rather than model problems.
Your Agent Needs a Harness, Not a Framework - Inngest's case for treating state, retries, traces, and concurrency as first-class infrastructure.

Context, Memory & Working State

Effective context engineering for AI agents - Anthropic's guidance on managing the context window as a working memory budget rather than a dumping ground.
Context Engineering for AI Agents: Lessons from Building Manus - Manus' detailed playbook on KV-cache locality, tool masking, filesystem memory, and keeping useful failures in-context.
Context Engineering for Coding Agents - Thoughtworks guidance on shaping the task environment so coding agents can stay grounded and productive.
Advanced Context Engineering for Coding Agents - HumanLayer patterns for reducing context drift and making coding sessions easier to resume.
Context-Efficient Backpressure for Coding Agents - HumanLayer's ideas for preventing agents from burning context on noisy or low-value work.
OpenHands Context Condensensation for More Efficient AI Agents - OpenHands' design for bounded conversation memory that preserves goals, progress, critical files, and failing tests while keeping long-running coding sessions efficient.
Writing a good CLAUDE.md - A practical guide to creating durable, repo-local instructions that agents can repeatedly follow.

Constraints, Guardrails & Safe Autonomy

Beyond permission prompts: making Claude Code more secure and autonomous - Anthropic on reducing approval friction without losing control through better sandboxing and policy design.
Code execution with MCP: building more efficient agents - Anthropic's approach to giving agents controlled execution power through explicit, inspectable tool boundaries.
Writing effective tools for agents - Anthropic's guidance on tool interfaces that are easier for models to call correctly and safely.
Mitigating Prompt Injection Attacks in Software Agents - OpenHands' practical guide to confirmation mode, analyzers, sandboxing, and hard policies for reducing prompt-injection risk in autonomous coding agents.
Assessing internal quality while coding with an agent - Thoughtworks on moving quality checks into the loop instead of relying on after-the-fact manual review.
Anchoring AI to a reference application - Thoughtworks on constraining agents with concrete exemplars so they produce more consistent output.
Humans and Agents in Software Engineering Loops - A clear mental model for where humans should strengthen the harness instead of micromanaging every artifact.
Claude Code: Best practices for agentic coding - Anthropic's practical recommendations for repo structure, checkpoints, validation, and delegation in agentic coding workflows.

Specs, Agent Files & Workflow Design

AGENTS.md - A lightweight open format for repo-local instructions that tell agents how to work inside a codebase.
agent.md - A related standardization effort for machine-readable agent instructions across projects and tools.
GitHub Spec Kit - GitHub's toolkit for spec-driven development, useful when you want agents to execute against explicit product and engineering specs.
Understanding Spec-Driven-Development: Kiro, spec-kit, and Tessl - Thoughtworks on why strong specs make AI-assisted software delivery more dependable.
12 Factor Agents - HumanLayer's operating principles for production agents, including explicit prompts, state ownership, and clean pause-resume behavior.
12-Factor AgentOps - An operations-oriented companion focused on context discipline, validation, and reproducible agent workflows.

Evals & Observability

Testing Agent Skills Systematically with Evals - OpenAI's concrete guide to turning agent traces into repeatable evals with JSONL logs and deterministic checks.
How to Evaluate Agent Skills (And Why You Should) - OpenHands' hands-on playbook for measuring whether a skill actually helps using bounded tasks, deterministic verifiers, no-skill baselines, and trace review.
Agent evals - OpenAI's product guide for measuring agent quality with reproducible task-level and workflow-level evaluations.
Evaluation best practices - OpenAI's general guide to building eval suites that match real-world distributions and catch regressions early.
Trace grading - OpenAI documentation on grading agent traces directly, which is especially helpful for long multi-step tasks.
Learning to Verify AI-Generated Code - OpenHands' overview of a layered verification stack using trajectory critics trained on production traces for reranking, early stopping, and review-time quality control.
Demystifying Evals for AI Agents - Anthropic's guidance on what to measure when agents have many possible trajectories to success or failure.
Quantifying infrastructure noise in agentic coding evals - Anthropic on how runtime configuration can move coding benchmark scores by more than many leaderboard gaps.
Evaluating Deep Agents: Our Learnings - LangChain's practical breakdown of single-step, full-run, and multi-turn eval design for stateful agents.
Improving Deep Agents with harness engineering - LangChain's evidence that harness changes alone can significantly improve benchmark performance.

Benchmarks

These benchmarks are especially useful when you want to compare harness quality, not just model quality. They stress context handling, tool calling, environment control, verification logic, and the runtime scaffolding around the model.

Agent Arena - A leaderboard that ranks AI agents, models, tools, and frameworks using ELO-style ratings from head-to-head battles, providing a structured way to compare harness-level choices across categories.
AgentBench - A cross-environment benchmark spanning OS, databases, knowledge graphs, web browsing, and more, useful for seeing whether a harness generalizes beyond one narrow task loop.
AgentBoard - A benchmark for multi-turn LLM agents complemented by an analytical evaluation board for assessing model performance beyond final success rates, making partial-progress and trajectory quality visible.
AgentStudio - An integrated benchmark suite with realistic environments and comprehensive toolkits for evaluating virtual agents on real computer software, useful for measuring harness depth against a broad task surface.
AppWorld - A controllable world of apps and people for benchmarking interactive coding agents, with state-based and execution-based unit tests that surface harness quality around planning, code generation, and collateral-damage control.
AssistantBench - A benchmark that evaluates web agents on realistic, time-consuming research tasks requiring multi-step tool use and information synthesis, making it a good proxy for harness quality in long-horizon web scenarios.
BrowseComp - A benchmark that evaluates AI agents on locating hard-to-find information, stressing search strategy, context management, and retrieval harness design under difficult conditions.
BrowserGym Leaderboard - A gym environment and leaderboard for evaluating LLMs, VLMs, and agents on web navigation tasks, offering a reproducible framework for comparing harnesses across multiple web benchmarks in one place.
CharacterEval - A benchmark for evaluating role-playing conversational agents using multi-turn dialogues and character profiles, with metrics across four dimensions including character fidelity and conversational coherence.
ClawBench - A benchmark that evaluates AI agents across search, reasoning, coding, safety, and multi-turn conversation tasks, covering the breadth of harness demands in a single suite.
ClawWork - A real-world economic benchmark where AI agents complete professional tasks spanning 44 occupations, earning income while managing token costs and economic solvency, making it a direct test of harness efficiency under resource constraints.
Computer Agent Arena - An open evaluation platform where users compare LLM/VLM-based agents on real-world computer tasks ranging from general computer use to coding, data analysis, and video editing, surfacing harness differences across a wide task surface.
EvoClaw: Evaluating AI Agents on Continuous Software Evolution - A benchmark write-up on evaluating agents across dependent milestone sequences from real repository history, surfacing regression accumulation and long-horizon precision loss.
GAIA - A benchmark for general AI assistants that is often used to compare harness-level choices around tools, planning, verification, and long-horizon autonomy.
Galileo Agent Leaderboard - An open evaluation platform tracking LLM agents on task completion and tool calling across business domains, useful for comparing harness quality in enterprise-grade agentic scenarios.
GTA - A benchmark that evaluates the tool-use capability of LLM-based agents using human-written queries, real deployed tools, and authentic multimodal inputs, exposing harness gaps between isolated testing and real deployment.
HAL: Holistic Agent Leaderboard - A benchmark and leaderboard for agent systems with attention to reliability, cost, and broad task coverage, making it useful for comparing end-to-end harness behavior.
Introducing Terminal-Bench 2.0 and Harbor - The Terminal-Bench 2.0 announcement, useful for understanding the harder tasks and generalized evaluation harness behind Harbor.
LeetCode-Hard Gym - An RL environment interface to LeetCode's submission server for evaluating codegen agents, giving harnesses direct access to execution-based feedback on hard algorithmic problems.
LLM Colosseum Leaderboard - A platform that evaluates LLMs by having them fight in Street Fighter III, testing speed, adaptability, and real-time decision-making as proxies for harness responsiveness under tight latency constraints.
MAgIC - A benchmark measuring cognition, adaptability, rationality, and collaboration of LLMs in multi-agent systems, useful for evaluating how harnesses coordinate agent interactions and shared state.
MCP Bench - A benchmark for evaluating AI models on MCP server interactions, measuring tool accuracy, latency, and token use across server types, which directly reflects harness design choices around MCP integration.
MCP Universe - A leaderboard comparing AI model performance on MCP tasks, tracking how different models and harness configurations handle tool-augmented agent workflows.
MCPMark - A stress-testing benchmark for model and agent capabilities in real-world MCP tasks across tools like Notion, GitHub, and Postgres, making harness MCP integration quality directly measurable.
Olas Predict Benchmark - A benchmark for evaluating agents on historical prediction market data, testing harness design for research, retrieval, and forecasting in long-horizon reasoning tasks.
OSWorld - A real computer-use benchmark with 369 tasks across Ubuntu, Windows, and macOS, complete with initial-state setup and execution-based evaluators, making it excellent for testing desktop and multimodal harnesses.
OSWorld-MCP - An extension of OSWorld that evaluates AI agents on real-world computer tasks using the Model Context Protocol, making it useful for comparing MCP-enabled harnesses on a realistic desktop task suite.
SEC-bench - A benchmark for evaluating LLM agents on real-world software security tasks including vulnerability reproduction and patching, stressing harness design around code execution, containerized environments, and security-aware tooling.
SWE-bench Verified - A strong benchmark for software engineering agents working against real GitHub issues and tests, which makes harness choices around retrieval, patching, and validation highly visible.
τ-Bench - A benchmark that emulates dynamic conversations between a simulated user and a language agent equipped with domain-specific API tools and policy guidelines, making it useful for evaluating harnesses built around structured tool use and policy enforcement.
tau2-bench - A benchmark for realistic, multi-step agent tasks where success depends on tool use and execution quality rather than a single-shot answer.
Terminal-Bench - A benchmark suite for terminal-native agents operating in shells, filesystems, and verification-heavy environments, which is especially useful for comparing coding-agent harnesses.
TravelPlanner - A benchmark for evaluating LLM agents on tool use and complex planning within multiple constraints, revealing how harness design handles multi-constraint satisfaction and long-horizon planning.
VAB - VisualAgentBench evaluates large multimodal models as visual foundation agents across embodied, GUI, and visual design tasks, useful for comparing harnesses on visually grounded, multi-step agent workflows.
VisualWebArena - A benchmark for multimodal web agents on realistic visually grounded tasks, extending WebArena with image and screenshot inputs that stress harness support for visual context in browser environments.
WebArena - A standalone, self-hostable web environment for evaluating autonomous agents on realistic tasks, making it a reproducible baseline for comparing web-facing harness designs.
WebArena-Verified - A verified web-agent benchmark with curated tasks and deterministic evaluators over agent responses and captured network traces, making it a good fit for measuring web-facing harnesses.
WildClawBench - An in-the-wild benchmark running agents inside a live OpenClaw environment on 60 original tasks including multimodal, long-horizon, and safety-critical scenarios, making harness robustness under real-world conditions directly visible.
WorkArena - A benchmark for browser agents on common knowledge-work tasks, useful for comparing harnesses on realistic enterprise-style web workflows instead of toy browser tasks.

Runtimes, Harnesses & Reference Implementations

Agent Frameworks, Runtimes, and Harnesses, Oh My! - LangChain's decomposition of what belongs in a framework, a runtime, and a harness.
Building agents with the Claude Agent SDK - Anthropic's guide to a production-oriented agent SDK with sessions, tools, and orchestration support.
How we built our multi-agent research system - Anthropic's architecture write-up for a multi-agent system with separation of roles and structured coordination.
deepagents - LangChain's open-source project for building deeper, longer-running agents with middleware and harness patterns.
SWE-agent - A mature research coding agent that makes the harness, prompt, tools, and environment design directly inspectable.
SWE-ReX - Sandboxed code execution infrastructure for AI agents, useful when harness work starts to merge into execution runtime design.
AgentKit - Inngest's TypeScript toolkit for building durable, workflow-aware agents on top of event-driven infrastructure.
Harbor - A generalized harness for evaluating and improving agents at scale, released alongside Terminal-Bench 2.0.

Contributing

Contributions are welcome. Please prefer resources that are:

Specific about how agents are constrained, evaluated, resumed, observed, or orchestrated
Original implementations, primary-source articles, or high-signal technical write-ups
Useful to practitioners building real harnesses instead of generic AI commentary

If two links say the same thing, prefer the more primary, practical, and implementation-oriented one.

See CONTRIBUTING.md for contribution guidelines and the preferred entry format.

License

CC0 1.0

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github		.github
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Harness Engineering

Contents

Courses & Learning Resources

Foundations

Context, Memory & Working State

Constraints, Guardrails & Safe Autonomy

Specs, Agent Files & Workflow Design

Evals & Observability

Benchmarks

Runtimes, Harnesses & Reference Implementations

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Harness Engineering

Contents

Courses & Learning Resources

Foundations

Context, Memory & Working State

Constraints, Guardrails & Safe Autonomy

Specs, Agent Files & Workflow Design

Evals & Observability

Benchmarks

Runtimes, Harnesses & Reference Implementations

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages