پست

ورودنام‌نویسی

پست

user avatar
GitHub
@github
We benchmarked the GitHub Copilot agentic harness against the harnesses that ship leading models natively. Holding the model and task fixed across SWE-bench Verified, SWE-bench Pro, SkillsBench, TerminalBench, and Win-Hill, the results were clear: ✅ Task resolution on par with model-vendor harnesses ✅ Fewer tokens across most configurations 💡 A key learning: With GitHub Copilot supporting more than 20 models, you're free to pick efficiency or peak quality per task.
Scatter plot titled "Resolution rate vs. cost per task," plotting resolution rate (pass@1, y-axis, 62–78%) against mean cost per task in USD (x-axis, $0.40–$1.40). Each marker is one agent × model configuration (mean of 5 runs on Terminal-Bench 2), surrounded by a shaded ±1σ ellipse. A legend identifies three agents by color: Copilot CLI (purple), Codex CLI (gray), and Claude Code (teal). The subtitle notes that up and to the left is better: higher resolution at lower cost. A dashed vertical line splits the points into two clusters: lower-cost GPT-family models (GPT-5.4 and GPT-5.5) on the left, and higher-cost Claude-family models (Sonnet 4.6 and Opus 4.7) on the right.
۲۲:۱۴ · ۷ تیر ۱۴۰۵۶۹٫۱ هزاربازدیدها

در X تازه‌وارد هستید؟

همین حالا نام‌نویسی کنید تا خط زمان شخصی‌شده خودتان را داشته باشید!

ایجاد حساب کاربری

با نام‌نویسی کردن، با شرایط استفاده و سیاست‌های مربوط به حریم شخصی، ازجمله استفاده از کوکی‌ها موافقت می‌کنید.

افراد مرتبط

user avatar
GitHub@githubدنبال کردن

بحث داغ کنونی

Terms·Privacy·Cookies·دسترس‌پذیری·Ads Info·© 2026 X Corp.
Don't miss what's happening
افرادی که در X هستند نخستین افرادی هستند که باخبر می‌شوند.
ورودنام‌نویسی