Skip to content

mingrath/arena-workflow

 
 

Repository files navigation

CCG - Claude + Codex + Gemini Competitive Multi-Model System

npm version License: MIT Claude Code

简体中文 | English

This is a redesigned fork of fengshao1227/ccg-workflow. See What Changed below.

A competitive multi-model development system where every significant task is dispatched to ALL available models (Codex + Gemini) in parallel. Claude orchestrates, competes, compares outputs using benchmark-weighted evaluation criteria, and synthesizes the best solution.

Why competitive? Our benchmark analysis shows no single model dominates all tasks. Claude leads code quality (SWE-bench 80.8%), Codex leads autonomous execution (Terminal-Bench 77.3%), and Gemini leads visual design (WebDev Arena #1). Competitive dispatch ensures you always get the best output.


What Changed in This Fork

The original CCG uses static routing — frontend tasks always go to Gemini, backend tasks always go to Codex. This fork changes that completely.

Before (Original)

Frontend task → Gemini only    (assumed "frontend expert")
Backend task  → Codex only     (assumed "backend expert")
Claude        → Orchestrator   (doesn't compete)

The problem: these assumptions aren't backed by benchmarks. SWE-bench scores are nearly tied across all three models (~80%), and each model has different strengths that don't align with a simple frontend/backend split.

After (This Fork)

Any task → Send to ALL models in parallel → Compare outputs with weighted scores → Use the best one

Every task goes to Codex + Gemini + Claude at the same time. Claude then scores all three outputs using benchmark-weighted criteria and picks the best one (or combines the best parts from each).

What's different in practice

Feature Original This Fork
Routing Fixed: Gemini=frontend, Codex=backend All models compete on every task
How to pick the best output Trust rules: "Codex is backend authority" Weighted scoring based on actual benchmarks
Claude's role Orchestrator only (doesn't generate solutions) Orchestrator AND competitor (generates its own solution too)
Code review 2 models (Codex + Gemini) 3 models (Codex + Gemini + Claude), with consensus scoring (3/3, 2/3, 1/3)
Model strengths Assumed (no evidence cited) Documented with benchmark data and sources
Model weaknesses Not mentioned Each model's prompts list known limitations
Cost control Always 2 models 3 modes: Competitive (all 3), Focused (best-match), Quick (Claude only)

Why the change

We fact-checked the original claims against published benchmarks:

Original Claim Reality
"Gemini is the frontend expert" Gemini leads visual appeal (WebDev Arena #1) but Claude produces better responsive and accessible code (Index.dev test)
"Codex is the backend expert" Codex leads terminal workflows (77.3%) but Claude leads architecture planning and code quality (80.8% SWE-bench)
One model is always best for a domain No — each model wins on different dimensions, not domains

Architecture

                    ┌─→ Codex CLI  ─→ Output A ─┐
                    │                             │
Task ─→ Claude ─────┼─→ Gemini CLI ─→ Output B ─┼─→ Weighted ─→ Best
        (also       │                             │   Compare     Solution
         generates  └─→ Claude Self ─→ Output C ─┘
         Output C)

External models have no write access — they only return patches/analysis, which Claude evaluates and applies.

How It Works

A full /ccg:workflow run goes through 6 phases. At every phase that involves model output, all models compete in parallel:

flowchart TD
    Start["🚀 User runs /ccg:workflow"] --> P1

    subgraph P1["Phase 1 — Research"]
        R1["Analyze requirements"]
        R2["Search codebase via MCP"]
        R3["Score completeness (≥7 to proceed)"]
        R1 --> R2 --> R3
    end

    P1 --> P2

    subgraph P2["Phase 2 — Ideation (Competitive)"]
        direction LR
        C2["Codex analyzes"] & G2["Gemini analyzes"] & CL2["Claude analyzes"]
    end

    P2 --> W2["⚖️ Weighted scoring → Best analysis"]
    W2 --> P3

    subgraph P3["Phase 3 — Planning (Competitive)"]
        direction LR
        C3["Codex plans"] & G3["Gemini plans"] & CL3["Claude plans"]
    end

    P3 --> W3["⚖️ Weighted scoring → Best plan"]
    W3 --> Approve{"👤 User approves plan?"}
    Approve -- No --> P3
    Approve -- Yes --> P4

    subgraph P4["Phase 4 — Execution"]
        E1["Claude implements code"]
        E2["(sole code writer — external models have zero write access)"]
        E1 --> E2
    end

    P4 --> P5

    subgraph P5["Phase 5 — Review (Competitive)"]
        direction LR
        C5["Codex reviews"] & G5["Gemini reviews"] & CL5["Claude reviews"]
    end

    P5 --> W5["⚖️ Weighted consensus → Issues ranked by agreement (3/3, 2/3, 1/3)"]
    W5 --> Fix{"Critical issues?"}
    Fix -- Yes --> FixCode["Claude fixes"] --> P5
    Fix -- No --> P6

    subgraph P6["Phase 6 — Final Review"]
        F1["Verify against plan"]
        F2["Run tests"]
        F3["Report to user"]
        F1 --> F2 --> F3
    end

    P6 --> Done["✅ Done"]

    style P2 fill:#e8f4f8,stroke:#2196F3
    style P3 fill:#e8f4f8,stroke:#2196F3
    style P5 fill:#e8f4f8,stroke:#2196F3
    style W2 fill:#fff3e0,stroke:#FF9800
    style W3 fill:#fff3e0,stroke:#FF9800
    style W5 fill:#fff3e0,stroke:#FF9800
Loading

Weighted Scoring Example (Phase 2 — Analysis)

Each model's output is scored on multiple dimensions. The weights depend on the task type:

                    Codex    Gemini    Claude
                    ─────    ──────    ──────
Tech depth          0.30     0.20      0.35  ← Claude weighted highest
Risk detection      0.35     0.15      0.30  ← Codex weighted highest
UX / visual         0.10     0.40      0.20  ← Gemini weighted highest
Actionability       0.25     0.25      0.25  ← Equal

Score = Σ (dimension rating × weight)
Winner = highest total score
Final output = winner + cherry-picked improvements from others

Different task types use different weight tables — see /ccg:routing-guide for all of them.

Installation

npx ccg-workflow

Requirements: Claude Code CLI, Node.js 20+

Important: This project depends on ora@9.x and string-width@8.x, which require Node.js >= 20.

Optional: Codex CLI (competitive dispatch), Gemini CLI (competitive dispatch)

Benchmark Evidence

Routing decisions are based on published benchmarks, not assumptions:

Benchmark Claude Opus 4.6 GPT-5.3-Codex Gemini 3.1 Pro Source
SWE-bench Verified 80.8% 80.0% 80.6% marc0.dev
Terminal-Bench 2.0 74.7% 77.3% 78.4% (hard) SmartScope
WebDev Arena ~1510 ELO 1477 ELO 1487 ELO Google Dev Blog
Code Review Quality #1 tied Below Claude Below Claude Milvus Blog
Responsive Design Leader - Below Claude Index.dev

Model Strength Summary

Model Best At Evidence
Claude Code quality, architecture, responsive/a11y, code review SWE-bench leader, Milvus review #1, Index.dev responsive test
Codex Edge case detection, autonomous execution, Python, terminal workflows Terminal-Bench leader, catches bugs others miss, cloud sandbox
Gemini Visual design, rapid prototyping, large codebase analysis WebDev Arena #1, fastest iteration, 1M token context

Dispatch Modes

Mode Description Cost
Competitive (default) All models compete, weighted comparison ~3x
Focused Best-match model + Claude verification ~1-2x
Quick Claude only 1x

See /ccg:routing-guide for full weighted evaluation criteria per task type.

Commands

Command Description
/ccg:workflow Full 6-phase competitive workflow
/ccg:plan Competitive multi-model planning
/ccg:execute Competitive execution with output comparison
/ccg:feat Smart feature development
/ccg:frontend Frontend tasks (all models, visual weight)
/ccg:backend Backend tasks (all models, logic weight)
/ccg:analyze Technical analysis
/ccg:debug Competitive problem diagnosis
/ccg:optimize Competitive performance optimization
/ccg:test Test generation
/ccg:review Competitive code review (weighted consensus)
/ccg:routing-guide View benchmark evidence and routing criteria
/ccg:commit Git commit
/ccg:rollback Git rollback
/ccg:clean-branches Clean branches
/ccg:worktree Worktree management
/ccg:init Initialize CLAUDE.md
/ccg:enhance Prompt enhancement
/ccg:spec-init Initialize OPSX environment
/ccg:spec-research Requirements → Constraints
/ccg:spec-plan Constraints → Zero-decision plan
/ccg:spec-impl Execute plan + archive
/ccg:spec-review Dual-model cross-review
/ccg:team-research Agent Teams requirements → constraints
/ccg:team-plan Agent Teams constraints → parallel plan
/ccg:team-exec Agent Teams parallel execution
/ccg:team-review Agent Teams weighted consensus review
/ccg:codex-exec Codex full execution (plan → code → review)

Weighted Evaluation

Every competitive dispatch uses task-type-specific weights:

Analysis:    Tech depth (Claude .35) | Risk (Codex .35) | UX (Gemini .40) | Action (.25 each)
Planning:    Architecture (Claude .40) | Scale (Claude .30) | UI (Gemini .40) | Detail (Codex .30)
Implementation: Correctness (Claude .35) | Visual (Gemini .45) | Edge cases (Codex .35) | Maintain (Claude .35)
Review:      Security (Codex .35) | Visual (Gemini .40) | Logic (Claude .35) | Perf (Codex .30)

OPSX Spec-Driven (v1.7.52+)

Integrates OPSX architecture to turn requirements into constraints:

/ccg:spec-init
/ccg:spec-research implement user authentication
/ccg:spec-plan
/ccg:spec-impl
/ccg:spec-review

Agent Teams Parallel Execution (v1.7.60+)

/ccg:team-research implement real-time collaboration kanban API
/ccg:team-plan kanban-api
/ccg:team-exec
/ccg:team-review

Prerequisite: CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 in settings.json

Configuration

Directory Structure

~/.claude/
├── commands/ccg/       # Slash commands
├── agents/ccg/         # Sub-agents
├── skills/             # Quality gates
├── bin/codeagent-wrapper
└── .ccg/
    ├── config.toml
    └── prompts/{codex,gemini}/

Environment Variables

Variable Description Default
CODEAGENT_POST_MESSAGE_DELAY Wait time after Codex completion (seconds) 5
CODEX_TIMEOUT codeagent-wrapper execution timeout (seconds) 7200
BASH_DEFAULT_TIMEOUT_MS Claude Code Bash default timeout (ms) 120000
BASH_MAX_TIMEOUT_MS Claude Code Bash max timeout (ms) 600000

codeagent-wrapper Auto-Authorization

CCG automatically installs a Hook to auto-authorize codeagent-wrapper commands.

Requirement: jq must be installed.

MCP Configuration

Code retrieval MCP (choose one):

  • ace-tool (recommended)
  • ContextWeaver (alternative)

Optional: Context7, Playwright, DeepWiki, Exa

Update / Uninstall

npx ccg-workflow@latest          # Update
npx ccg-workflow                 # Select "Uninstall"

Credits

Star History

Star History Chart

License

MIT


v1.8.0 (Competitive Multi-Model Redesign) | Issues

About

CCG v1.8.0 — Competitive Multi-Model System. Redesigned from static routing (Gemini=frontend, Codex=backend) to evidence-based competitive dispatch where ALL models compete on every task. Fork of fengshao1227/ccg-workflow.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Go 57.1%
  • TypeScript 34.2%
  • JavaScript 8.6%
  • Shell 0.1%