A benchmark to evaluate Computer-Use Agents on PowerPoint tasks.
We use docker for sandboxing GUI-based computer-use agents. Please make sure to install docker before running the benchmark. CLI-based benchmarking can skip docker installation.
Please follow the instructions in SETUP.md.
# With pip
pip install -e .
# Or with uv
uv syncWe recommend creating a .env file in the repo root to provide the needed environment variables.
CLIENT_ID=<See SETUP.md>
RUBRIC_DEFAULT_LLM="anthropic/claude-sonnet-4-20250514" # Model used for VLM calls in verifiers
ANTHROPIC_API_KEY="..."
ANTHROPIC_BASE_URL="..."
# Add any other endpoints/keys needed for benchmarking your specific model/agentTo populate your OneDrive account with the needed benchmark PPT files, use the following script.
hydrate_data.py is a single script that
- downloads source
.pptxfiles from a URL list into a temp folder, - uploads each one to OneDrive,
- opens it in PowerPoint Online (headless Playwright) and downloads both the
mutated
.pptxand the slide-image.zipinto--output-dir, - deletes the temp folder.
The default temp folder is data/files/PowerPoint, which already contains the
canonical files.txt:
Step 3 is important because opening the file in PowerPoint Online normalizes the metadata stored in the .pptx file.
This reduces spurious differences when benchmark verifiers compare agent-modified files against the original task files.
Note, this step requires playwright to be installed.
python -m playwright install chromium
python hydrate_data.py \
--urls-file data/files/PowerPoint/files.txt \
--local-folder _tmp_pptx_downloads \
--onedrive-folder /PPTEval \
--output-dir data/files/PowerPoint \
--allow-data-dir \
--cleanup-local-folderOverride --output-dir to write somewhere else; the temp folder
(--local-folder) can be a path to any temp folder and is removed at the end when
--cleanup-local-folder is passed. CLIENT_ID must be set (in the environment
or .env) for the OneDrive upload step.
# Run on Selected tasks
python -m ppteval.run_benchmark --agent-config ppteval/configs/cua.yaml --task-ids "3-002" --max-steps 30
# Run on Whole benchmark with 3 threads for concurrent task evaluation.
# We recommend --concurrent to be <= 3 to minimize infra failures/timeouts.
python -m ppteval.run_benchmark --agent-config ppteval/configs/claude-4-sonnet.yaml --concurrent 3 --max-steps 30 We have also added support for benchmarking models with the Claude Code CLI:
The claude-code-*.yaml presets (e.g. claude-code-opus-4-5.yaml,
claude-code-opus-4-7.yaml) drive the Claude Code CLI (claude) inside a
per-task workspace seeded from claude-workspace/.
Prerequisites:
- Install the Claude Code CLI and authenticate:
claude --version # confirm installed claude login # OAuth — runs once, credentials persist in keychain
- Install the Anthropic
pptxskill intoclaude-workspace/.claude/skills/pptx/. This skill is proprietary to Anthropic and is not redistributed with this repo. Obtain it from your Anthropic skills source and drop it in so the directory looks like:Anything underclaude-workspace/ CLAUDE.md .claude/ skills/ pptx/ SKILL.md LICENSE.txt ...claude-workspace/.claude/is git-ignored by*workspace/*. - Run the benchmark with the desired CLI agent config:
python -m ppteval.run_benchmark \ --agent-config ppteval/configs/claude-code-opus-4-5.yaml \ --concurrent 4
The CLI agent does NOT use a Docker sandbox or Office Online — it edits
.pptx files programmatically inside its workspace. No SSH tunnels or vLLM
endpoints are required.
@inproceedings{ppteval2026,
title = {PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks},
author = {Gandhi, Apurva and Suryanarayanan, Vishwas and Anwar, Raja Hasnain and Shaik, Firoz and Desai, Shubhang and Nguyen, Thong Q. and Raza, Muhammad Taqi and Chowdhary, Vishal and Neubig, Graham},
booktitle = {Forty-third International Conference on Machine Learning},
year = {2026},
series = {Proceedings of Machine Learning Research},
publisher = {PMLR}
}- Install development dependencies:
pip install -e ".[dev]" - Make your changes
- Run tests:
pytest - Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
