Skip to content

feat: add experimental BinEval evaluation support#42100

Draft
Copilot wants to merge 4 commits into
mainfrom
copilot/add-eval-support-gh-aw
Draft

feat: add experimental BinEval evaluation support#42100
Copilot wants to merge 4 commits into
mainfrom
copilot/add-eval-support-gh-aw

Conversation

Copilot AI commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Adds native BinEval-style evaluations to gh-aw — small, binary questions declared in workflow frontmatter, executed post-run via an LLM harness, with results aggregated and reported as CI artifacts.

Schema (evals frontmatter)

  • New optional evals array with id + question fields; validated for unique IDs and non-empty questions
  • Emits experimental warning at compile time
evals:
  - id: builds
    question: Does the generated code compile?
  - id: focused
    question: Is the implementation limited to the requested change?

Evaluation model

  • EvalDefinition, EvalResult, EvalSummary types in frontmatter_types.go
  • WorkflowData.Evals []EvalDefinition for downstream consumers

Eval job

  • New eval job injected after agent + detection jobs in the compiled workflow
  • JS harness (eval_harness.cjs) calls GitHub Models API (gpt-4o-mini) per question independently — no MCPs, no checkout
  • Prompt generation produces per-question binary prompts with rationale; no holistic scoring
  • Results aggregated (total/passed/failed/pass-rate) and uploaded as a eval artifact with a markdown step summary

Not included

Phase 8 (persisting results to a git branch, à la experiments) is deferred.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Comment thread actions/setup/js/eval_harness.cjs Outdated
@pelikhan

Copy link
Copy Markdown
Collaborator

@copilot run inference in AWF and use /reflect to find a suitable inference endpoint.

@github-actions

This comment has been minimized.

@github-actions

Copy link
Copy Markdown
Contributor

Hey @Copilot 👋 — great work on the BinEval evaluation support! The end-to-end shape is clean: the evals frontmatter schema, the Go compiler module, the JS harness, and the wiring into the orchestrator all land as a cohesive unit.

A few things that stand out positively:

  • Well-scoped: every changed file is in service of the same feature — nothing unrelated was mixed in.
  • Tests at every layer: compiler_evals_test.go (Go unit tests), eval_harness.test.cjs (JS unit tests), and evals_experimental_warning_test.go (integration guard) give solid coverage across the stack.
  • Clear description: the PR body explains the schema, the evaluation model, the harness design, and explicitly calls out what was deferred (Phase 8 persistence) — exactly the context reviewers need.
  • Experimental gating: emitting a compile-time warning via emitExperimentalFeatureWarnings is the right pattern for a feature that isn't production-ready yet.

This looks ready for review. 🚀

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • patchdiff.githubusercontent.com

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "patchdiff.githubusercontent.com"

See Network Configuration for more information.

Generated by ✅ Contribution Check · 293.1 AIC · ⌖ 20.4 AIC · ⊞ 6K ·

- Remove direct GitHub Models API calls from eval_harness.cjs; keep only shared utility functions (readEvalSpec, buildEvalPrompt, aggregateResults, renderMarkdownSummary, sanitizeEvalError)
- Add actions/setup/md/eval.md: eval prompt template instructing the engine to output EVAL_RESULT:{...json...}
- Add actions/setup/js/setup_eval.cjs: prompt setup script (mirrors setup_threat_detection.cjs)
- Add actions/setup/js/parse_eval_results.cjs: result parser extracting EVAL_RESULT from engine log
- Update compiler_evals.go: eval job now follows the detection job pattern — pulls AWF containers, clears MCP config, installs the agentic engine, runs it inside AWF, parses results
- Add EvalLogPath, EvalDir, DefaultEvalMaxAICredits constants to pkg/constants

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot stopped work on behalf of pelikhan due to an error June 28, 2026 17:49
@pelikhan

Copy link
Copy Markdown
Collaborator

@copilot keep working

Add missing test files for the two BinEval JS modules introduced in the
AWF-engine refactor:

- setup_eval.test.cjs: 14 tests covering prompt template rendering,
  missing/empty context files, eval spec parsing, and step summary output
- parse_eval_results.test.cjs: 25 tests covering EVAL_RESULT extraction
  from plain and stream-json logs, main() error/success paths, and result
  normalisation

Mirrors the test coverage pattern of setup_threat_detection.test.cjs and
parse_threat_detection_results.test.cjs.

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

🤖 PR Triage — §28332039983

Field Value
Category feature
Risk high
Priority medium
Score 38 / 100 (impact 25 · urgency 5 · quality 8)
Action defer
Batch

Rationale: Large experimental draft (+2 181 lines). Adds a BinEval-style LLM evaluation harness in workflow frontmatter — significant new capability but high scope/risk. No CI results yet (merge state: UNSTABLE). Defer until draft is promoted to ready and CI passes.

Labels applied: pr-type:feature · pr-risk:high · pr-priority:medium · pr-action:defer · pr-agent:copilot-swe-agent

Generated by 🔧 PR Triage Agent · 65.9 AIC · ⌖ 11.5 AIC · ⊞ 5.4K ·

@github-actions

Copy link
Copy Markdown
Contributor

🤖 PR Triage — §28342769269

Field Value
Category feature
Risk high
Score 38/100
Priority low
Action defer
Status Draft — no CI, no reviews

Score breakdown: Impact 25 · Urgency 6 · Quality 7

Rationale: Experimental BinEval LLM evaluation harness (+2181 lines). Large addition, still draft with no CI checks passing. ~10h old with no new activity. Defer until promoted to ready and CI established.

i️ Carried over. Defer until draft promoted to ready.

Generated by 🔧 PR Triage Agent · 107.8 AIC · ⌖ 10.9 AIC · ⊞ 5.4K ·

@github-actions

Copy link
Copy Markdown
Contributor

🤖 PR Triage — §28357644191

Field Value
Category feature
Risk 🔴 High
Score 30/100 (Impact 20 · Urgency 5 · Quality 5)
Action defer

Carried over — 17.3h old. Experimental BinEval LLM evaluation harness (+2181 lines). Large addition, no CI yet, draft. Defer until promoted to ready and CI validates.

Generated by 🔧 PR Triage Agent · 89.9 AIC · ⌖ 12 AIC · ⊞ 5.4K ·

@github-actions github-actions Bot mentioned this pull request Jun 29, 2026
@github-actions

Copy link
Copy Markdown
Contributor

🤖 PR Triage — §28376613466

Field Value
Category feature (experimental)
Risk 🔴 High
Priority 🟢 Low
Score 30 / 100
Action ⏸️ defer
Age 23h

Score breakdown: Impact 20 + Urgency 5 + Quality 5

Rationale: Adds native BinEval-style evaluations — large experimental addition (2181+/0−, 19 files), draft, no CI. High risk, no reviewer engagement yet. Defer until promoted from draft, CI passes, and feature scope is scoped down or approved.

i️ pr-priority:medium label is stale — score is 30 (low boundary). Carried over from §28357644191.

Generated by 🔧 PR Triage Agent · 93.2 AIC · ⌖ 14.6 AIC · ⊞ 5.4K ·

@github-actions

Copy link
Copy Markdown
Contributor

🔍 PR Triage — §28395315609

Field Value
Category feature
Risk high
Score 26 / 100
Action defer
Batch

Score breakdown: impact 15 + urgency 3 + quality 8

Carried over (28h). Experimental BinEval evaluation support. Large draft (19 files, +2181/-0), no CI, Phase 8 deferred by author. Priority corrected: pr-priority:mediumpr-priority:low (score 26). Defer until out of draft with CI.

Generated by 🔧 PR Triage Agent · 99.1 AIC · ⌖ 11.6 AIC · ⊞ 5.4K ·

@github-actions

Copy link
Copy Markdown
Contributor

🤖 PR Triage — Run §28413530597

Field Value
Category feature
Risk 🔴 high
Score 30 / 100
Breakdown impact 20 + urgency 5 + quality 5
Action defer

Notes: Draft (34.5h, carried over). Experimental BinEval evaluation support: 19 files, +2181/-0. No CI yet. ⚠️ Label conflict detected: both pr-priority:medium and pr-priority:low present — please remove pr-priority:medium. Defer until promoted from draft.

Generated by 🔧 PR Triage Agent · 61.6 AIC · ⌖ 7.7 AIC · ⊞ 1.6K ·

@github-actions

Copy link
Copy Markdown
Contributor

🤖 PR Triage — Run §28445959231

Field Value
Category feature
Risk 🔴 High
Priority 🟡 Medium
Score 38 / 100
Action 🕐 defer
CI ⭕ NO_CI
Age 46.2h · 🚧 DRAFT
Batch experimental (#42426, #42314, #42100)

Score breakdown: Impact 20/50 · Urgency 10/30 · Quality 8/20

Experimental BinEval evaluation support (2181+, 19 files) — large DRAFT going stale at 46h. Part of experimental batch. ⚠️ Label conflict: has both pr-priority:low (stale) and pr-priority:medium (correct) — pr-priority:low should be removed manually. Defer until out of draft and CI established. Labels partially applied (conflict needs manual fix).

Generated by 🔧 PR Triage Agent · 52.3 AIC · ⌖ 8.71 AIC · ⊞ 1.6K ·

@github-actions

Copy link
Copy Markdown
Contributor

🤖 PR Triage

Field Value
Category feature
Risk 🔴 High
Score 36 / 100
Action defer
Batch experimental

Score breakdown: Impact 22/50 · Urgency 5/30 · Quality 9/20

Rationale: DRAFT. Experimental BinEval evaluation support (+2181/-0). Now ~52h stale with no CI. Carried over. ⚠️ Note: conflicting pr-priority:low label from prior run — correct priority is medium. Consider closing if not actively worked.

Generated by 🔧 PR Triage Agent · 83.9 AIC · ⌖ 17.1 AIC · ⊞ 1.6K ·

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

🤖 PR Triage — §28486872548

Field Value
Category feature
Risk high
Score 35 / 100
Priority low
Action defer

Score breakdown: Impact 22/50 · Urgency 5/30 · Quality 8/20

Batch: experimental (PRs #42100, #42314)

Rationale: Large experimental BinEval evaluation support (+2181/-0, 19 files). Draft, minimal reviews. High risk due to entirely new evaluation subsystem. Defer until design is settled and PR is ready.

Labels applied: pr-type:feature pr-risk:high pr-priority:low pr-action:defer pr-batch:experimental pr-agent:copilot-swe-agent

Generated by 🔧 PR Triage Agent · 77.2 AIC · ⌖ 9.82 AIC · ⊞ 1.6K ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment