MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

Zhiheng Song¹, Jingshuai Zhang¹, Chuan Qin†, Chao Wang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu, Hengshu Zhu†
AMAP, Alibaba Group

¹Equal contribution. †Corresponding authors.

Note: This work is currently under review. The full dataset will be released progressively.

📖 Overview

MobilityBench is a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. It is built from large-scale, anonymized user queries collected from Amap, covering a wide range of route-planning intents across multiple cities worldwide.

To support reproducible end-to-end evaluation, MobilityBench includes a deterministic API-replay sandbox that removes environmental variance from live services. It also introduce a multi-dimensional evaluation protocol centered on outcome validity, complemented by evaluations of instruction understanding, planning, tool use, and efficiency.

Figure 1: Overview of MobilityBench, a systematic benchmark for evaluating route-planning agents.

📂 Dataset

MobilityBench is a scalable benchmark for evaluating route-planning agents in real-world mobility scenarios. It is built from large-scale, anonymized mobility queries from Amap, organized with a comprehensive task taxonomy, and provides structured ground truth (required tool calls + verifiable evidence). All tool calls are executed in a deterministic replay sandbox for reproducible, multi-dimensional evaluation.

Scale & Coverage: 100,000 episodes across 22 countries and 350+ cities (including metropolitan areas), with a long-tailed geographic distribution.

Scenario Distribution (11 intents)

36.6% Basic Information Retrieval
9.6% Route-Dependent Information Retrieval
42.5% Basic Route Planning
11.3% Preference-Constrained Route Planning

Download: HuggingFace - MobilityBench -> place files into data/

Data Format

Field	Description
`query`	User query text
`context`	Context information (JSON, e.g., current location, city)
`task_scenario`	Fine-grained task category
`intent_family`	Coarse-grained intent category for evaluation aggregation
`tool_list`	Expected tool calls (JSON array)
`route_ans`	Ground truth route answer (JSON)

Sample Data (5 Examples)

Query	Task Scenario	Intent Family
去大石桥不走高速 Go to Dashiqiao without taking the highway.	Option-Constrained Route Planning	Preference-Constrained Route Planning
现在成都大道会堵车吗？看一下地图，会不会堵 Is Chengdu Avenue congested now? Looking at the map, is it likely to be congested?	Traffic Info Query	Basic Route Planning
我在哪 Where am I?	Geolocation Query	Basic Information Retrieval
知道离滇池会展中心有多远 How far it is from Dianchi Convention and Exhibition Center?	Route Property Query	Route-Dependent Information Retrieval
到寨河收费站入口不走高速 To reach the Zhaihe toll station entrance without taking the highway.	Option-Constrained Route Planning	Preference-Constrained Route Planning

🚀 Getting Started

1. Install

Requirements: Python 3.12+, uv (recommended) or pip

git clone https://github.com/your-org/mobility-bench.git
cd MobilityBench-main

# Install with uv (recommended)
uv sync

# Or install with pip
pip install -e .

# With evaluation dependencies
uv sync --extra eval

2. Configure Environment

Create a .env file with your LLM API credentials:

LLM_BASE_URL=https://api.openai.com/v1
LLM_API_KEY=your-api-key

3. Download Dataset

Note: Before running, download the dataset from HuggingFace and place the files into data/datasets/.

4. Run Benchmark

# Run benchmark with default settings (plan_and_execute framework)
mbench run --model gpt4.1 --dataset data/datasets/sample_10.csv

# Run with ReAct framework
mbench run --model gpt4.1 --framework react

# Run multiple models in parallel
mbench run --models gpt4.1,claude-opus-4-5 --parallel 4

# Enable sandbox mode (offline evaluation)
mbench run --model gpt4.1 --sandbox

# Resume an interrupted run
mbench run --model gpt4.1 --resume run_20260215_120000

5. Evaluate Results

# Evaluate a single run
mbench eval --run-id run_20260215_120000

# Evaluate with specific metrics
mbench eval --run-id run_20260215_120000 --metrics tool,answer,planning

6. Generate Reports

# Generate report (markdown / html / excel)
mbench report --run-id run_20260215_120000 --format excel

# Compare multiple runs
mbench report --run-id run_001 --compare run_002

💻 CLI Commands

Command	Description
`mbench run`	Run agent benchmark on dataset
`mbench eval`	Evaluate agent run results
`mbench report`	Generate evaluation reports
`mbench config`	Manage configuration files
`mbench version`	Show version information

Run Command Options

Option	Description	Default
`--model, -m`	Model name to use	-
`--models`	Multiple models (comma-separated)	-
`--dataset, -d`	Dataset name or path	`mobility_6262`
`--framework, -f`	Agent framework (`plan_and_execute` or `react`)	`plan_and_execute`
`--config, -c`	Configuration file path	-
`--output-dir, -o`	Output directory	-
`--parallel, -p`	Parallelism level	`1`
`--sandbox/--live`	Use sandbox or live tools	`--sandbox`
`--resume`	Resume from run ID	-
`--dry-run`	Validate config only	`false`

Examples

# Show all available options
mbench --help
mbench run --help

# Run with custom configuration
mbench run --config configs/models/default.yaml --model qwen3-235b

# Initialize configuration templates
mbench config init --template full

# Validate configuration
mbench config validate --config configs/models/default.yaml

⚙️ Configuration

Model Configuration

# configs/models/default.yaml
models:
  gpt4.1:
    provider: openai
    base_url: ${LLM_BASE_URL}
    api_key: ${LLM_API_KEY}
    temperature: 0.1
    max_tokens: 8192
    roles:
      planner: gpt-41-0414-global
      worker: gpt-41-0414-global
      reporter: gpt-41-0414-global

Environment Variables

Variable	Description
`LLM_BASE_URL`	Base URL for LLM API
`LLM_API_KEY`	API key for authentication

📁 Project Structure

mobility-bench/
├── configs/                        # YAML configuration files
│   ├── agent/                      # Agent behavior settings
│   ├── datasets/                   # Dataset configuration
│   ├── evaluation/                 # Evaluation settings
│   └── models/                     # Model provider configs
├── data/
│   ├── datasets/                   # Benchmark datasets
│   ├── sandbox/                    # Cached API responses for replay
│   └── results/                    # Run outputs & evaluation results
├── src/mobility_bench/
│   ├── cli/                        # CLI commands (run / eval / report / config)
│   ├── agent/                      # Agent implementation
│   │   ├── graph/                  # LangGraph state & builder
│   │   ├── roles/                  # LLM manager & agent factory
│   │   ├── frameworks/             # Plan-and-Execute / ReAct
│   │   └── prompts/                # Prompt templates (CN & EN)
│   ├── tools/                      # Tool registry & sandbox implementations
│   ├── evaluation/                 # Evaluation engine & 5 metric modules
│   ├── dataset/                    # Dataset schema & loader
│   ├── runner/                     # Batch execution (sequential & parallel)
│   ├── reporting/                  # Report generators (MD / HTML / Excel)
│   ├── config/                     # Configuration management
│   └── utils/                      # Shared utilities
├── pyproject.toml
└── uv.lock

🤖 Supported Models

Model	Provider
GPT-5.2	OpenAI
GPT-4.1	OpenAI
Claude-Opus-4.5	Anthropic
Claude-Sonnet-4.5	Anthropic
Gemini-3-Flash-Preview	Google
Gemini-3-Pro-Preview	Google
DeepSeek-V3.2-Exp	DeepSeek
DeepSeek-R1	DeepSeek
Qwen3-4B	Alibaba
Qwen3-30B-A3B	Alibaba
Qwen3-32B	Alibaba
Qwen3-235B-A22B	Alibaba

🏗️ Supported Architecture

MobilityBench supports two agent frameworks powered by LangGraph:

Plan-and-Execute Framework (Default)

A Planner-Worker-Reporter architecture for structured task execution:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Planner   │────▶│   Worker    │────▶│  Reporter   │
│  (Planning) │◀────│ (Execution) │     │  (Summary)  │
└─────────────┘     └─────────────┘     └─────────────┘
       │                   │                   │
       └───────────────────┴───────────────────┘
                           │
                    ┌──────┴──────┐
                    │  Tool Call  │
                    │ (Map APIs)  │
                    └─────────────┘

Planner: Analyzes user requirements, creates structured plans, dynamically adjusts based on results
Worker: Executes tool calls based on the plan, supports parallel task execution
Reporter: Generates comprehensive natural language reports from execution results

ReAct Framework

A Reasoning-Action-Observation loop for iterative problem solving:

┌─────────────────────────────────────────────┐
│                                             │
│  ┌──────────┐   ┌───────────┐   ┌─────────┐ │
│  │ Reasoning│──▶│  Action   │──▶│Observat.│ │
│  │ (Think)  │   │(Tool Call)│   │(Result) │ │
│  └──────────┘   └───────────┘   └────┬────┘ │
│       ▲                              │      │
│       └──────────────────────────────┘      │
│                                             │
└─────────────────────────────────────────────┘

Reasoning: Analyzes current state and decides next action
Action: Executes tool call or finishes task
Observation: Processes tool results and feeds back to reasoning

Benchmark Results

Figure 2: Overall performance.

📚 Citation

If you find our paper and code helpful for your research, please consider starring our repository ⭐ and citing our work ✏️.

@article{song2026mobilitybench,
  title={MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios},
  author={Song, Zhiheng and Zhang, Jingshuai and Qin, Chuan and Wang, Chao and Chen, Chao and Xu, Longfei and Liu, Kaikui and Chu, Xiangxiang and Zhu, Hengshu},
  journal={arXiv preprint arXiv:2602.22638},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
configs		configs
data		data
embedding-model		embedding-model
figure		figure
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

📖 Overview

📂 Dataset

Scenario Distribution (11 intents)

Data Format

Sample Data (5 Examples)

🚀 Getting Started

1. Install

2. Configure Environment

3. Download Dataset

4. Run Benchmark

5. Evaluate Results

6. Generate Reports

💻 CLI Commands

Run Command Options

Examples

⚙️ Configuration

Model Configuration

Environment Variables

📁 Project Structure

🤖 Supported Models

🏗️ Supported Architecture

Plan-and-Execute Framework (Default)

ReAct Framework

Benchmark Results

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

AMAP-ML/MobilityBench

Folders and files

Latest commit

History

Repository files navigation

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

📖 Overview

📂 Dataset

Scenario Distribution (11 intents)

Data Format

Sample Data (5 Examples)

🚀 Getting Started

1. Install

2. Configure Environment

3. Download Dataset

4. Run Benchmark

5. Evaluate Results

6. Generate Reports

💻 CLI Commands

Run Command Options

Examples

⚙️ Configuration

Model Configuration

Environment Variables

📁 Project Structure

🤖 Supported Models

🏗️ Supported Architecture

Plan-and-Execute Framework (Default)

ReAct Framework

Benchmark Results

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages