Skip to content

walkinglabs/awesome-harness-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Harness Engineering Awesome

A curated list of articles, playbooks, benchmarks, specifications, and open-source projects for harness engineering: the practice of shaping the environment around AI agents so they can work reliably.

Harness engineering sits at the intersection of context engineering, evaluation, observability, orchestration, safe autonomy, and software architecture. This list focuses on resources that make agents more dependable in real workflows, especially long-running coding and research tasks.

Generic agent tooling is out of scope unless the page directly covers harness design, context management, evaluation, runtime control, or other reliability-critical harness primitives.

Contents

Courses & Learning Resources

  • walkinglabs/learn-harness-engineering - A project-based course repository on making Codex and Claude Code more reliable, centered on an Electron personal knowledge base app with lecture handouts, example artifacts, and practical harness projects.

Foundations

Context, Memory & Working State

Constraints, Guardrails & Safe Autonomy

Specs, Agent Files & Workflow Design

  • AGENTS.md - A lightweight open format for repo-local instructions that tell agents how to work inside a codebase.
  • agent.md - A related standardization effort for machine-readable agent instructions across projects and tools.
  • GitHub Spec Kit - GitHub's toolkit for spec-driven development, useful when you want agents to execute against explicit product and engineering specs.
  • Understanding Spec-Driven-Development: Kiro, spec-kit, and Tessl - Thoughtworks on why strong specs make AI-assisted software delivery more dependable.
  • 12 Factor Agents - HumanLayer's operating principles for production agents, including explicit prompts, state ownership, and clean pause-resume behavior.
  • 12-Factor AgentOps - An operations-oriented companion focused on context discipline, validation, and reproducible agent workflows.

Evals & Observability

Benchmarks

These benchmarks are especially useful when you want to compare harness quality, not just model quality. They stress context handling, tool calling, environment control, verification logic, and the runtime scaffolding around the model.

  • Agent Arena - A leaderboard that ranks AI agents, models, tools, and frameworks using ELO-style ratings from head-to-head battles, providing a structured way to compare harness-level choices across categories.
  • AgentBench - A cross-environment benchmark spanning OS, databases, knowledge graphs, web browsing, and more, useful for seeing whether a harness generalizes beyond one narrow task loop.
  • AgentBoard - A benchmark for multi-turn LLM agents complemented by an analytical evaluation board for assessing model performance beyond final success rates, making partial-progress and trajectory quality visible.
  • AgentStudio - An integrated benchmark suite with realistic environments and comprehensive toolkits for evaluating virtual agents on real computer software, useful for measuring harness depth against a broad task surface.
  • AppWorld - A controllable world of apps and people for benchmarking interactive coding agents, with state-based and execution-based unit tests that surface harness quality around planning, code generation, and collateral-damage control.
  • AssistantBench - A benchmark that evaluates web agents on realistic, time-consuming research tasks requiring multi-step tool use and information synthesis, making it a good proxy for harness quality in long-horizon web scenarios.
  • BrowseComp - A benchmark that evaluates AI agents on locating hard-to-find information, stressing search strategy, context management, and retrieval harness design under difficult conditions.
  • BrowserGym Leaderboard - A gym environment and leaderboard for evaluating LLMs, VLMs, and agents on web navigation tasks, offering a reproducible framework for comparing harnesses across multiple web benchmarks in one place.
  • CharacterEval - A benchmark for evaluating role-playing conversational agents using multi-turn dialogues and character profiles, with metrics across four dimensions including character fidelity and conversational coherence.
  • ClawBench - A benchmark that evaluates AI agents across search, reasoning, coding, safety, and multi-turn conversation tasks, covering the breadth of harness demands in a single suite.
  • ClawWork - A real-world economic benchmark where AI agents complete professional tasks spanning 44 occupations, earning income while managing token costs and economic solvency, making it a direct test of harness efficiency under resource constraints.
  • Computer Agent Arena - An open evaluation platform where users compare LLM/VLM-based agents on real-world computer tasks ranging from general computer use to coding, data analysis, and video editing, surfacing harness differences across a wide task surface.
  • EvoClaw: Evaluating AI Agents on Continuous Software Evolution - A benchmark write-up on evaluating agents across dependent milestone sequences from real repository history, surfacing regression accumulation and long-horizon precision loss.
  • GAIA - A benchmark for general AI assistants that is often used to compare harness-level choices around tools, planning, verification, and long-horizon autonomy.
  • Galileo Agent Leaderboard - An open evaluation platform tracking LLM agents on task completion and tool calling across business domains, useful for comparing harness quality in enterprise-grade agentic scenarios.
  • GTA - A benchmark that evaluates the tool-use capability of LLM-based agents using human-written queries, real deployed tools, and authentic multimodal inputs, exposing harness gaps between isolated testing and real deployment.
  • HAL: Holistic Agent Leaderboard - A benchmark and leaderboard for agent systems with attention to reliability, cost, and broad task coverage, making it useful for comparing end-to-end harness behavior.
  • Introducing Terminal-Bench 2.0 and Harbor - The Terminal-Bench 2.0 announcement, useful for understanding the harder tasks and generalized evaluation harness behind Harbor.
  • LeetCode-Hard Gym - An RL environment interface to LeetCode's submission server for evaluating codegen agents, giving harnesses direct access to execution-based feedback on hard algorithmic problems.
  • LLM Colosseum Leaderboard - A platform that evaluates LLMs by having them fight in Street Fighter III, testing speed, adaptability, and real-time decision-making as proxies for harness responsiveness under tight latency constraints.
  • MAgIC - A benchmark measuring cognition, adaptability, rationality, and collaboration of LLMs in multi-agent systems, useful for evaluating how harnesses coordinate agent interactions and shared state.
  • MCP Bench - A benchmark for evaluating AI models on MCP server interactions, measuring tool accuracy, latency, and token use across server types, which directly reflects harness design choices around MCP integration.
  • MCP Universe - A leaderboard comparing AI model performance on MCP tasks, tracking how different models and harness configurations handle tool-augmented agent workflows.
  • MCPMark - A stress-testing benchmark for model and agent capabilities in real-world MCP tasks across tools like Notion, GitHub, and Postgres, making harness MCP integration quality directly measurable.
  • Olas Predict Benchmark - A benchmark for evaluating agents on historical prediction market data, testing harness design for research, retrieval, and forecasting in long-horizon reasoning tasks.
  • OSWorld - A real computer-use benchmark with 369 tasks across Ubuntu, Windows, and macOS, complete with initial-state setup and execution-based evaluators, making it excellent for testing desktop and multimodal harnesses.
  • OSWorld-MCP - An extension of OSWorld that evaluates AI agents on real-world computer tasks using the Model Context Protocol, making it useful for comparing MCP-enabled harnesses on a realistic desktop task suite.
  • SEC-bench - A benchmark for evaluating LLM agents on real-world software security tasks including vulnerability reproduction and patching, stressing harness design around code execution, containerized environments, and security-aware tooling.
  • SWE-bench Verified - A strong benchmark for software engineering agents working against real GitHub issues and tests, which makes harness choices around retrieval, patching, and validation highly visible.
  • τ-Bench - A benchmark that emulates dynamic conversations between a simulated user and a language agent equipped with domain-specific API tools and policy guidelines, making it useful for evaluating harnesses built around structured tool use and policy enforcement.
  • tau2-bench - A benchmark for realistic, multi-step agent tasks where success depends on tool use and execution quality rather than a single-shot answer.
  • Terminal-Bench - A benchmark suite for terminal-native agents operating in shells, filesystems, and verification-heavy environments, which is especially useful for comparing coding-agent harnesses.
  • TravelPlanner - A benchmark for evaluating LLM agents on tool use and complex planning within multiple constraints, revealing how harness design handles multi-constraint satisfaction and long-horizon planning.
  • VAB - VisualAgentBench evaluates large multimodal models as visual foundation agents across embodied, GUI, and visual design tasks, useful for comparing harnesses on visually grounded, multi-step agent workflows.
  • VisualWebArena - A benchmark for multimodal web agents on realistic visually grounded tasks, extending WebArena with image and screenshot inputs that stress harness support for visual context in browser environments.
  • WebArena - A standalone, self-hostable web environment for evaluating autonomous agents on realistic tasks, making it a reproducible baseline for comparing web-facing harness designs.
  • WebArena-Verified - A verified web-agent benchmark with curated tasks and deterministic evaluators over agent responses and captured network traces, making it a good fit for measuring web-facing harnesses.
  • WildClawBench - An in-the-wild benchmark running agents inside a live OpenClaw environment on 60 original tasks including multimodal, long-horizon, and safety-critical scenarios, making harness robustness under real-world conditions directly visible.
  • WorkArena - A benchmark for browser agents on common knowledge-work tasks, useful for comparing harnesses on realistic enterprise-style web workflows instead of toy browser tasks.

Runtimes, Harnesses & Reference Implementations

  • Agent Frameworks, Runtimes, and Harnesses, Oh My! - LangChain's decomposition of what belongs in a framework, a runtime, and a harness.
  • Building agents with the Claude Agent SDK - Anthropic's guide to a production-oriented agent SDK with sessions, tools, and orchestration support.
  • How we built our multi-agent research system - Anthropic's architecture write-up for a multi-agent system with separation of roles and structured coordination.
  • deepagents - LangChain's open-source project for building deeper, longer-running agents with middleware and harness patterns.
  • SWE-agent - A mature research coding agent that makes the harness, prompt, tools, and environment design directly inspectable.
  • SWE-ReX - Sandboxed code execution infrastructure for AI agents, useful when harness work starts to merge into execution runtime design.
  • AgentKit - Inngest's TypeScript toolkit for building durable, workflow-aware agents on top of event-driven infrastructure.
  • Harbor - A generalized harness for evaluating and improving agents at scale, released alongside Terminal-Bench 2.0.

Contributing

Contributions are welcome. Please prefer resources that are:

  • Specific about how agents are constrained, evaluated, resumed, observed, or orchestrated
  • Original implementations, primary-source articles, or high-signal technical write-ups
  • Useful to practitioners building real harnesses instead of generic AI commentary

If two links say the same thing, prefer the more primary, practical, and implementation-oriented one.

See CONTRIBUTING.md for contribution guidelines and the preferred entry format.

License

CC0 1.0

About

🛠️ Awesome tools & guides for harness engineering.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors