Name	Name	Last commit message	Last commit date
parent directory ..
benchmarks	benchmarks
hermes_swe_env	hermes_swe_env
terminal_test_env	terminal_test_env
tool_call_parsers	tool_call_parsers
README.md	README.md
__init__.py	__init__.py
agent_loop.py	agent_loop.py
hermes_base_env.py	hermes_base_env.py
patches.py	patches.py
tool_context.py	tool_context.py

Hermes-Agent Atropos Environments

This directory contains the integration layer between hermes-agent's tool-calling capabilities and the Atropos RL training framework. It provides everything needed to run agentic LLMs through multi-turn tool-calling loops, score their output with arbitrary reward functions, and feed results into Atropos for training or evaluation.

Architecture Overview

                        Atropos Framework
                    ┌───────────────────────┐
                    │       BaseEnv          │  (atroposlib)
                    │  - Server management   │
                    │  - Worker scheduling   │
                    │  - Wandb logging       │
                    │  - CLI (serve/process/ │
                    │    evaluate)           │
                    └───────────┬───────────┘
                                │ inherits
                    ┌───────────┴───────────┐
                    │  HermesAgentBaseEnv    │  hermes_base_env.py
                    │  - Terminal backend    │
                    │  - Tool resolution     │
                    │  - Agent loop          │
                    │  - ToolContext          │
                    │  - Async patches       │
                    └───────────┬───────────┘
                                │ inherits
              ┌─────────────────┼─────────────────┐
              │                 │                  │
     TerminalTestEnv     HermesSweEnv    TerminalBench2EvalEnv
     (stack testing)     (SWE training)   (TB2 benchmark eval)

Inheritance Chain

BaseEnv (from atroposlib) is the Atropos base class. It provides:

Server management (OpenAI-compatible API servers, VLLM, SGLang)
Worker scheduling for parallel rollouts
Wandb integration for metrics and rollout logging
CLI interface with three subcommands: serve, process, evaluate
evaluate_log() for saving eval results to JSON + samples.jsonl

HermesAgentBaseEnv (hermes_base_env.py) extends BaseEnv with hermes-agent specifics:

Sets os.environ["TERMINAL_ENV"] to configure the terminal backend (local, docker, modal, ssh, singularity)
Resolves hermes-agent toolsets via _resolve_tools_for_group() (calls get_tool_definitions() which queries tools/registry.py)
Implements collect_trajectory() which runs the full agent loop and computes rewards
Supports two-phase operation (Phase 1: OpenAI server, Phase 2: VLLM ManagedServer)
Applies monkey patches for async-safe tool operation at import time

Concrete environments inherit from HermesAgentBaseEnv and implement:

setup() -- Load dataset, initialize state
get_next_item() -- Return the next item for rollout
format_prompt() -- Convert a dataset item into the user message
compute_reward() -- Score the rollout using ToolContext
evaluate() -- Periodic evaluation logic

Core Components

Agent Loop (`agent_loop.py`)

HermesAgentLoop is the reusable multi-turn agent engine. It runs the same pattern as hermes-agent's run_agent.py:

Send messages + tools to the API via server.chat_completion()
If the response contains tool_calls, execute each one via handle_function_call() (which delegates to tools/registry.py's dispatch())
Append tool results to the conversation and go back to step 1
If the response has no tool_calls, the agent is done

Tool calls are executed in a thread pool (run_in_executor) so backends that use asyncio.run() internally (Modal, Docker) don't deadlock inside Atropos's event loop.

Returns an AgentResult containing the full conversation history, turn count, reasoning content per turn, tool errors, and optional ManagedServer state (for Phase 2).

Tool Context (`tool_context.py`)

ToolContext is a per-rollout handle that gives reward/verification functions direct access to all hermes-agent tools, scoped to the rollout's task_id. The same task_id means the terminal/browser session is the SAME one the model used during its rollout -- all state (files, processes, browser tabs) is preserved.

async def compute_reward(self, item, result, ctx: ToolContext):
    # Run tests in the model's terminal sandbox
    test = ctx.terminal("pytest -v")
    if test["exit_code"] == 0:
        return 1.0

    # Check if a file was created
    content = ctx.read_file("/workspace/solution.py")
    if content.get("content"):
        return 0.5

    # Download files locally for verification (binary-safe)
    ctx.download_file("/remote/output.bin", "/local/output.bin")

    return 0.0

Available methods:

Terminal: terminal(command, timeout) -- run shell commands
Files: read_file(path), write_file(path, content), search(query, path)
Transfers: upload_file(), upload_dir(), download_file(), download_dir() -- binary-safe file transfers between host and sandbox
Web: web_search(query), web_extract(urls)
Browser: browser_navigate(url), browser_snapshot()
Generic: call_tool(name, args) -- call any hermes-agent tool by name
Cleanup: cleanup() -- release all resources (called automatically after compute_reward)

Patches (`patches.py`)

Problem: Some hermes-agent tools use asyncio.run() internally (e.g., mini-swe-agent's Modal backend via SWE-ReX). This crashes when called from inside Atropos's event loop because asyncio.run() cannot be nested.

Solution: patches.py monkey-patches SwerexModalEnvironment to use a dedicated background thread (_AsyncWorker) with its own event loop. The calling code sees the same sync interface, but internally the async work happens on a separate thread that doesn't conflict with Atropos's loop.

What gets patched:

SwerexModalEnvironment.__init__ -- creates Modal deployment on a background thread
SwerexModalEnvironment.execute -- runs commands on the same background thread
SwerexModalEnvironment.stop -- stops deployment on the background thread

The patches are:

Idempotent -- calling apply_patches() multiple times is safe
Transparent -- same interface and behavior, only the internal async execution changes
Universal -- works identically in normal CLI use (no running event loop)

Applied automatically at import time by hermes_base_env.py.

Tool Call Parsers (`tool_call_parsers/`)

Client-side parsers that extract structured tool_calls from raw model output text. Used in Phase 2 (VLLM server type) where ManagedServer's /generate endpoint returns raw text without tool call parsing.

Each parser is a standalone reimplementation of the corresponding VLLM parser's extract_tool_calls() logic. No VLLM dependency -- only standard library (re, json, uuid) and openai types.

Available parsers:

hermes -- Hermes/ChatML <tool_call> XML format
mistral -- Mistral [TOOL_CALLS] format
llama3_json -- Llama 3 JSON tool calling
qwen -- Qwen tool calling format
qwen3_coder -- Qwen3 Coder format
deepseek_v3 -- DeepSeek V3 format
deepseek_v3_1 -- DeepSeek V3.1 format
kimi_k2 -- Kimi K2 format
longcat -- Longcat format
glm45 / glm47 -- GLM model formats

Usage:

from environments.tool_call_parsers import get_parser

parser = get_parser("hermes")
content, tool_calls = parser.parse(raw_model_output)

In Phase 1 (OpenAI server type), these parsers are not needed -- the server handles tool call parsing natively.

Two-Phase Operation

Phase 1: OpenAI Server (Evaluation / SFT Data Generation)

Uses server.chat_completion() with tools= parameter. The server (VLLM, SGLang, OpenRouter, OpenAI) handles tool call parsing natively. Returns ChatCompletion objects with structured tool_calls.

Good for: evaluation, SFT data generation, testing
Run with: serve (with run-api), process, or evaluate subcommands
Placeholder tokens are created for the Atropos pipeline

Phase 2: VLLM ManagedServer (Full RL Training)

Uses ManagedServer for exact token IDs + logprobs via /generate. Client-side tool call parser (from tool_call_parsers/) reconstructs structured tool_calls from raw output.

Good for: full RL training with GRPO/PPO
Run with: serve subcommand
Real tokens, masks, and logprobs flow through the pipeline

Directory Structure

environments/
├── README.md                     # This file
├── __init__.py                   # Package exports
├── hermes_base_env.py            # Abstract base (HermesAgentBaseEnv)
├── agent_loop.py                 # Multi-turn agent engine (HermesAgentLoop)
├── tool_context.py               # Per-rollout tool access for reward functions
├── patches.py                    # Async-safety patches for Modal backend
│
├── tool_call_parsers/            # Phase 2 client-side parsers
│   ├── __init__.py               # Registry + base class
│   ├── hermes_parser.py
│   ├── mistral_parser.py
│   ├── llama_parser.py
│   ├── qwen_parser.py
│   ├── qwen3_coder_parser.py
│   ├── deepseek_v3_parser.py
│   ├── deepseek_v3_1_parser.py
│   ├── kimi_k2_parser.py
│   ├── longcat_parser.py
│   ├── glm45_parser.py
│   └── glm47_parser.py
│
├── terminal_test_env/            # Stack validation environment
│   └── terminal_test_env.py
│
├── hermes_swe_env/               # SWE-bench style training environment
│   └── hermes_swe_env.py
│
└── benchmarks/                   # Evaluation benchmarks
    └── terminalbench_2/
        └── terminalbench2_env.py

Concrete Environments

TerminalTestEnv (`terminal_test_env/`)

A self-contained environment with inline tasks (no external dataset needed) for validating the full stack end-to-end. Each task asks the model to create a file at a known path, and the verifier checks the content matches.

# Serve mode (needs run-api)
run-api
python environments/terminal_test_env/terminal_test_env.py serve

# Process mode (no run-api, saves to JSONL)
python environments/terminal_test_env/terminal_test_env.py process \
    --env.data_path_to_save_groups terminal_test_output.jsonl

HermesSweEnv (`hermes_swe_env/`)

SWE-bench style training environment. The model gets a coding task, uses terminal + file + web tools to solve it, and the reward function runs tests in the same Modal sandbox.

python environments/hermes_swe_env/hermes_swe_env.py serve \
    --openai.model_name YourModel \
    --env.dataset_name bigcode/humanevalpack \
    --env.terminal_backend modal

TerminalBench2EvalEnv (`benchmarks/terminalbench_2/`)

Eval-only environment for the Terminal-Bench 2.0 benchmark (89 tasks). Each task gets a pre-built Docker Hub image, a natural language instruction, and a test suite. The agent uses terminal + file tools to solve the task, then the test suite verifies correctness.

Follows the standard Atropos eval pattern (like GPQA, MMLU, etc.):

Run via evaluate subcommand (no run-api needed)
setup() loads the dataset, evaluate() runs all tasks
rollout_and_score_eval() handles per-task agent loop + test verification
Downloads verifier output locally for reliable reward checking (Harbor pattern)

# Run full benchmark
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
    --openai.model_name anthropic/claude-opus-4.6

# Run subset of tasks
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
    --openai.model_name anthropic/claude-opus-4.6 \
    --env.task_filter fix-git,git-multibranch

# Skip specific tasks
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
    --openai.model_name anthropic/claude-opus-4.6 \
    --env.skip_tasks heavy-task,slow-task

Creating a New Environment

Training Environment

Create a new directory under environments/
Create your env file inheriting from HermesAgentBaseEnv
Implement the four abstract methods + evaluate()

from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig

class MyEnvConfig(HermesAgentEnvConfig):
    pass  # Add custom fields as needed

class MyEnv(HermesAgentBaseEnv):
    name = "my-env"
    env_config_cls = MyEnvConfig

    @classmethod
    def config_init(cls):
        env_config = MyEnvConfig(
            enabled_toolsets=["terminal", "file"],
            terminal_backend="modal",
            # ... other config
        )
        server_configs = [APIServerConfig(...)]
        return env_config, server_configs

    async def setup(self):
        self.dataset = load_dataset(...)
        self.iter = 0

    async def get_next_item(self):
        item = self.dataset[self.iter % len(self.dataset)]
        self.iter += 1
        return item

    def format_prompt(self, item):
        return item["instruction"]

    async def compute_reward(self, item, result, ctx):
        # ctx gives you full tool access to the rollout's sandbox
        test = ctx.terminal("pytest -v")
        return 1.0 if test["exit_code"] == 0 else 0.0

    async def evaluate(self, *args, **kwargs):
        # Periodic evaluation logic
        ...

if __name__ == "__main__":
    MyEnv.cli()

Eval-Only Environment (Benchmark)

For eval benchmarks, follow the pattern in terminalbench2_env.py:

Create under environments/benchmarks/your-benchmark/
Inherit from HermesAgentBaseEnv
Set eval-only config: eval_handling=STOP_TRAIN, steps_per_eval=1, total_steps=1
Stub the training methods (collect_trajectories, score)
Implement rollout_and_score_eval() and evaluate()
Run with evaluate subcommand

Key Config Fields

Field	Description	Default
`enabled_toolsets`	Which hermes toolsets to enable	`None` (all)
`disabled_toolsets`	Toolsets to disable	`None`
`distribution`	Probabilistic toolset distribution name	`None`
`max_agent_turns`	Max LLM calls per rollout	`30`
`agent_temperature`	Sampling temperature	`1.0`
`terminal_backend`	`local`, `docker`, `modal`, `ssh`, `singularity`	`local`
`system_prompt`	System message for the agent	`None`
`tool_call_parser`	Parser name for Phase 2	`hermes`
`eval_handling`	`STOP_TRAIN`, `LIMIT_TRAIN`, `NONE`	`STOP_TRAIN`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Hermes-Agent Atropos Environments

Architecture Overview

Inheritance Chain

Core Components

Agent Loop (`agent_loop.py`)

Tool Context (`tool_context.py`)

Patches (`patches.py`)

Tool Call Parsers (`tool_call_parsers/`)

Two-Phase Operation

Phase 1: OpenAI Server (Evaluation / SFT Data Generation)

Phase 2: VLLM ManagedServer (Full RL Training)

Directory Structure

Concrete Environments

TerminalTestEnv (`terminal_test_env/`)

HermesSweEnv (`hermes_swe_env/`)

TerminalBench2EvalEnv (`benchmarks/terminalbench_2/`)

Creating a New Environment

Training Environment

Eval-Only Environment (Benchmark)

Key Config Fields

FilesExpand file tree

environments

Directory actions

More options

Directory actions

More options

Latest commit

History

environments

Folders and files

parent directory

README.md

Hermes-Agent Atropos Environments

Architecture Overview

Inheritance Chain

Core Components

Agent Loop (agent_loop.py)

Tool Context (tool_context.py)

Patches (patches.py)

Tool Call Parsers (tool_call_parsers/)

Two-Phase Operation

Phase 1: OpenAI Server (Evaluation / SFT Data Generation)

Phase 2: VLLM ManagedServer (Full RL Training)

Directory Structure

Concrete Environments

TerminalTestEnv (terminal_test_env/)

HermesSweEnv (hermes_swe_env/)

TerminalBench2EvalEnv (benchmarks/terminalbench_2/)

Creating a New Environment

Training Environment

Eval-Only Environment (Benchmark)

Key Config Fields

Agent Loop (`agent_loop.py`)

Tool Context (`tool_context.py`)

Patches (`patches.py`)

Tool Call Parsers (`tool_call_parsers/`)

TerminalTestEnv (`terminal_test_env/`)

HermesSweEnv (`hermes_swe_env/`)

TerminalBench2EvalEnv (`benchmarks/terminalbench_2/`)