GitHub - microsoft/ppteval: Benchmark computer-use agents on PowerPoint tasks.

A benchmark to evaluate Computer-Use Agents on PowerPoint tasks.

Installation

Docker

We use docker for sandboxing GUI-based computer-use agents. Please make sure to install docker before running the benchmark. CLI-based benchmarking can skip docker installation.

PPTOnline and OneDrive setup

Please follow the instructions in SETUP.md.

ppteval Python Package

# With pip
pip install -e .
# Or with uv
uv sync

Env Vars

We recommend creating a .env file in the repo root to provide the needed environment variables.

CLIENT_ID=<See SETUP.md>
RUBRIC_DEFAULT_LLM="anthropic/claude-sonnet-4-20250514" # Model used for VLM calls in verifiers
ANTHROPIC_API_KEY="..."
ANTHROPIC_BASE_URL="..."

# Add any other endpoints/keys needed for benchmarking your specific model/agent

Hydrating PowerPoint data

To populate your OneDrive account with the needed benchmark PPT files, use the following script.

hydrate_data.py is a single script that

downloads source .pptx files from a URL list into a temp folder,
uploads each one to OneDrive,
opens it in PowerPoint Online (headless Playwright) and downloads both the mutated .pptx and the slide-image .zip into --output-dir,
deletes the temp folder.

The default temp folder is data/files/PowerPoint, which already contains the canonical files.txt:

Step 3 is important because opening the file in PowerPoint Online normalizes the metadata stored in the .pptx file. This reduces spurious differences when benchmark verifiers compare agent-modified files against the original task files.

Note, this step requires playwright to be installed.

python -m playwright install chromium

python hydrate_data.py \
    --urls-file data/files/PowerPoint/files.txt \
    --local-folder _tmp_pptx_downloads \
    --onedrive-folder /PPTEval \
    --output-dir data/files/PowerPoint \
    --allow-data-dir \
    --cleanup-local-folder

Override --output-dir to write somewhere else; the temp folder (--local-folder) can be a path to any temp folder and is removed at the end when --cleanup-local-folder is passed. CLIENT_ID must be set (in the environment or .env) for the OneDrive upload step.

Benchmarking a GUI-based Computer-Use Agent

# Run on Selected tasks
python -m ppteval.run_benchmark --agent-config ppteval/configs/cua.yaml --task-ids "3-002" --max-steps 30
# Run on Whole benchmark with 3 threads for concurrent task evaluation.
# We recommend --concurrent to be <= 3 to minimize infra failures/timeouts.
python -m ppteval.run_benchmark --agent-config ppteval/configs/claude-4-sonnet.yaml --concurrent 3 --max-steps 30

Benchmarking CLI-Based Agents

We have also added support for benchmarking models with the Claude Code CLI:

The claude-code-*.yaml presets (e.g. claude-code-opus-4-5.yaml, claude-code-opus-4-7.yaml) drive the Claude Code CLI (claude) inside a per-task workspace seeded from claude-workspace/.

Prerequisites:

Install the Claude Code CLI and authenticate:

claude --version          # confirm installed
claude login              # OAuth — runs once, credentials persist in keychain

Install the Anthropic pptx skill into claude-workspace/.claude/skills/pptx/. This skill is proprietary to Anthropic and is not redistributed with this repo. Obtain it from your Anthropic skills source and drop it in so the directory looks like:
```
claude-workspace/
  CLAUDE.md
  .claude/
    skills/
      pptx/
        SKILL.md
        LICENSE.txt
        ...
```
Anything under claude-workspace/.claude/ is git-ignored by *workspace/*.

Run the benchmark with the desired CLI agent config:

python -m ppteval.run_benchmark \
  --agent-config ppteval/configs/claude-code-opus-4-5.yaml \
  --concurrent 4

The CLI agent does NOT use a Docker sandbox or Office Online — it edits .pptx files programmatically inside its workspace. No SSH tunnels or vLLM endpoints are required.

Citation

@inproceedings{ppteval2026,
  title = {PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks},
  author = {Gandhi, Apurva and Suryanarayanan, Vishwas and Anwar, Raja Hasnain and Shaik, Firoz and Desai, Shubhang and Nguyen, Thong Q. and Raza, Muhammad Taqi and Chowdhary, Vishal and Neubig, Graham},
  booktitle = {Forty-third International Conference on Machine Learning},
  year = {2026},
  series    = {Proceedings of Machine Learning Research},
  publisher = {PMLR}
}

Contributing

Install development dependencies: pip install -e ".[dev]"
Make your changes
Run tests: pytest
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Trademark Notice

Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
claude-workspace		claude-workspace
data/files/PowerPoint		data/files/PowerPoint
ppteval		ppteval
proposed_tasks		proposed_tasks
task_registry		task_registry
tests		tests
.gitignore		.gitignore
ATTRIBUTION.md		ATTRIBUTION.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SETUP.md		SETUP.md
hydrate_data.py		hydrate_data.py
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation

Docker

PPTOnline and OneDrive setup

ppteval Python Package

Env Vars

Hydrating PowerPoint data

Benchmarking a GUI-based Computer-Use Agent

Benchmarking CLI-Based Agents

Citation

Contributing

License

Trademark Notice

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Installation

Docker

PPTOnline and OneDrive setup

ppteval Python Package

Env Vars

Hydrating PowerPoint data

Benchmarking a GUI-based Computer-Use Agent

Benchmarking CLI-Based Agents

Citation

Contributing

License

Trademark Notice

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages