Detect skill install failures and surface them as agent failure context; fix needex/skills fallback crash#42642
Conversation
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
✅ PR Code Quality Reviewer completed the code quality review. |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #42642 does not have the 'implementation' label and has only 21 new lines of code in business logic directories (threshold: 100). |
|
✅ Test Quality Sentinel completed test quality analysis. No test files were added or modified in this PR. Test Quality Sentinel skipped. |
There was a problem hiding this comment.
Pull request overview
This pull request hardens frontmatter skill installation in gh-aw workflows by preventing activation-step crashes on gh skill install failures, capturing those failures, and surfacing them in the agent-failure issue/comment context. It also removes a broken fallback (needex/skills) that could resolve to a non-existent repo and cause unhandled rejections.
Changes:
- Wrap each
gh skill installinvocation to continue on failure while recording{skill, error}into a shared runner file and step summary. - Add an
always()collector step + activation job outputs so conclusion handling can pass skill-install failure context intohandle_agent_failure.cjs. - Extend failure category matching/schema to include
skill_install_failures, and remove theneedex/skillsfallback from the Impeccable workflow.
Show a summary per file
| File | Description |
|---|---|
| pkg/workflow/notify_comment_conclusion_helpers.go | Conditionally passes skill-install failure outputs into the agent-failure handler env when skills are configured. |
| pkg/workflow/compiler_activation_job_builder.go | Adds an always() collector step and exposes skill_install_failure_count / skill_install_errors as activation outputs. |
| pkg/parser/schemas/main_workflow_schema.json | Adds skill_install_failures to the allowed failure_categories pattern. |
| actions/setup/md/skill_install_failure.md | New failure-context template rendered into agent failure issues/comments. |
| actions/setup/js/install_frontmatter_skills.cjs | Records per-skill install failures instead of hard-failing the activation step. |
| actions/setup/js/collect_skill_install_failures.cjs | New script that reads the shared failures file and publishes step outputs. |
| actions/setup/js/handle_agent_failure.cjs | Adds failure category + renders skill_install_failure.md context when installs fail. |
| .github/workflows/impeccable-skills-reviewer.md | Removes ` |
| .github/workflows/impeccable-skills-reviewer.lock.yml | Regenerated workflow reflecting the removed fallback and new collector step/outputs. |
| .github/workflows/mattpocock-skills-reviewer.lock.yml | Regenerated workflow reflecting the new collector step/outputs. |
Review details
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 10/10 changed files
- Comments generated: 1
- Review effort level: Low
| function appendSkillInstallFailure(skillSpec, errorMessage) { | ||
| let failures = []; | ||
| try { | ||
| if (fs.existsSync(SKILL_FAILURES_FILE)) { | ||
| const raw = fs.readFileSync(SKILL_FAILURES_FILE, "utf8"); | ||
| failures = JSON.parse(raw); | ||
| } | ||
| } catch (parseErr) { | ||
| // If reading/parsing fails, start fresh and warn so the issue is visible in logs | ||
| core.warning(`Could not read skill install failures file, starting fresh: ${parseErr instanceof Error ? parseErr.message : String(parseErr)}`); | ||
| failures = []; | ||
| } | ||
| failures.push({ skill: skillSpec, error: errorMessage }); | ||
| fs.writeFileSync(SKILL_FAILURES_FILE, JSON.stringify(failures, null, 2), "utf8"); | ||
| } |
There was a problem hiding this comment.
Review: #42642 — Skill install failure detection + needex/skills fallback fix
Overall this is a well-structured change. The root cause fix (removing || 'needex/skills') is correct and the end-to-end failure-surfacing pipeline (write → collect → output → env var → conclusion job → issue/comment) is sound. A few non-blocking suggestions below.
What's good
- Early return at Go level (
len(skillRefs) == 0→return nil) ensures the collect step is never injected for workflows with no skills configured — avoids needless overhead. if: always()on the collect step correctly captures failures even if a prior install step hard-fails.- Template syntax (
{skills}) matchesrenderTemplate's/\{(\w+)\}/gregex — no mismatch. - Conditional env var injection in
notify_comment_conclusion_helpers.go(guarded bylen(SkillReferences) > 0 || len(Skills) > 0) prevents pollution of workflows without skills. skill_install_failuresadded tofailure_categoriesschema pattern — correctly allows users to opt-in to issue creation for this category.
Inline suggestions (3 flagged):
collect_skill_install_failures.cjsL24 — missing element-level type validation; malformed entries produceundefined:undefinedoutput.collect_skill_install_failures.cjsL39 andinstall_frontmatter_skills.cjsL153 — error messages with embedded newlines corrupt the\n-delimited serialization format used to pass errors between steps.handle_agent_failure.cjsL2136 —indexOf(":")parsing is fragile if skill names ever include a colon (e.g., URL-style refs); worth using a more robust delimiter long-term.
🧵 Reviewed using Impeccable skills by Impeccable Skills Reviewer · 93.3 AIC · ⌖ 7.02 AIC · ⊞ 4.9K
| if (!Array.isArray(parsed)) { | ||
| return []; | ||
| } | ||
| return parsed; |
There was a problem hiding this comment.
The return parsed here passes through without element-level validation. If the JSON file contains entries with missing or non-string skill/error fields (e.g., due to a partial write or crash mid-flush), the .map(f => ${f.skill}:${f.error}) on line 39 will silently produce undefined:undefined entries in the output.
Consider adding a guard before returning:
return parsed.filter(
(e) => e && typeof e.skill === 'string' && typeof e.error === 'string'
);@copilot please address this.
| core.info(`Skill install failures detected: ${failureCount}`); | ||
|
|
||
| core.setOutput("failure_count", String(failureCount)); | ||
| core.setOutput("errors", failures.map(f => `${f.skill}:${f.error}`).join("\n")); |
There was a problem hiding this comment.
If f.error contains embedded newlines (which can happen if exec.exec captures multi-line stderr into its Error message), the join("\\n") here will produce a multi-entry string where lines belonging to one failure spill into adjacent entries. The parser in handle_agent_failure.cjs splits on "\\n" to identify entries, so a newline inside an error message would corrupt the format.
Sanitize before serializing:
core.setOutput(
'errors',
failures
.map(f => `${f.skill}:${f.error.replace(/\n/g, ' ')}`)
.join('\n')
);@copilot please address this.
| try { | ||
| await exec.exec("gh", command.args); | ||
| } catch (err) { | ||
| const errorMessage = err instanceof Error ? err.message : String(err); |
There was a problem hiding this comment.
Same serialization concern as in collect_skill_install_failures.cjs: err.message from a failed exec.exec can include multi-line stderr output. Newlines in this value will corrupt the "\\n"-delimited format when the errors are later joined for the step output.
Consider normalizing at the source:
const errorMessage = (err instanceof Error ? err.message : String(err)).replace(/\n/g, ' ');@copilot please address this.
| for (const errorLine of errorLines) { | ||
| const colonIdx = errorLine.indexOf(":"); | ||
| if (colonIdx >= 0) { | ||
| const skill = errorLine.slice(0, colonIdx); |
There was a problem hiding this comment.
The indexOf(":") parser is subtly fragile when paired with the current serialization format. The skill name is taken as everything before the first colon, but the error portion (everything after) inherits all remaining text — including any colon-containing prefixes like Error: ....
This currently works because skill names (owner/repo) don't contain colons. However, if the upstream serialization ever changes to include a URL-style skill ref (e.g., https://github.com/org/repo), skill extraction would silently break.
A more defensive alternative would be to use a structured delimiter (e.g., a tab or a null byte) or serialize as JSON in the step output. Minor suggestion — no change required if the skill-name contract is stable.
@copilot please address this.
There was a problem hiding this comment.
REQUEST_CHANGES — One blocking bug kills the core feature; two more can re-introduce the crash this PR is meant to prevent.
Blocking issues
Missing template placeholder (blocks the entire feature)
skill_install_failure_context is computed and passed to renderTemplateFromFile, but neither agent_failure_comment.md nor agent_failure_issue.md contains the {skill_install_failure_context} placeholder. renderTemplate silently drops unknown keys. Users will never see skill install failure details in the failure comment/issue.
Colon delimiter fragility
collect_skill_install_failures.cjs serialises failures as skill:error joined by newlines; the consumer in handle_agent_failure.cjs splits on the first colon. gh CLI error messages always contain colons (exit status 1: HTTP 404: Not Found), so the skill name and error text will be split at the wrong position. Use a tab or JSON serialisation instead.
Unguarded writeFileSync with missing directory prerequisite
appendSkillInstallFailure calls fs.writeFileSync with no mkdirSync guard and no try/catch. If /tmp/gh-aw/ does not exist (fresh runner), or a write error occurs, the exception propagates out of the catch block that called it and crashes the install step — exactly the regression this PR aims to avoid.
Non-blocking concern
!hasSkillInstallFailures in the isSuccessful guard will suppress success notifications for agents that completed their task correctly but happened to have a skill install failure. Skill install failures should probably be surfaced as warnings without overriding the agent's own success/failure signal.
🔎 Code quality review by PR Code Quality Reviewer · 90.7 AIC · ⌖ 7.54 AIC · ⊞ 1.6K
Comment /review to run again
| credential_auth_error_context: credentialAuthErrorContext, | ||
| assignment_errors_context: assignmentErrorsContext, | ||
| assign_copilot_failure_context: assignCopilotFailureContext, | ||
| skill_install_failure_context: skillInstallFailureContext, |
There was a problem hiding this comment.
skill_install_failure_context is passed to the template renderer but neither agent_failure_comment.md nor agent_failure_issue.md contains the {skill_install_failure_context} placeholder — the failure context is silently discarded and users never see it.
💡 Fix
Both template files end with a tightly-packed chain of context placeholders on a single line. {skill_install_failure_context} must be inserted into that chain so the rendered output actually includes it.
In agent_failure_comment.md and agent_failure_issue.md, find the placeholder chain (the line that starts with {secret_verification_context}) and add {skill_install_failure_context} in the appropriate position — right after {assign_copilot_failure_context} to match the insertion order used throughout the rest of the file:
-...{assign_copilot_failure_context}{create_discussion_errors_context}...
+...{assign_copilot_failure_context}{skill_install_failure_context}{create_discussion_errors_context}...Without this change, buildSkillInstallFailureContext() runs, populates the context object, and is then silently dropped by renderTemplate because there is no matching placeholder. The entire feature introduced by this PR — surfacing skill install failures in the failure comment/issue — does not work end-to-end.
| core.info(`Skill install failures detected: ${failureCount}`); | ||
|
|
||
| core.setOutput("failure_count", String(failureCount)); | ||
| core.setOutput("errors", failures.map(f => `${f.skill}:${f.error}`).join("\n")); |
There was a problem hiding this comment.
Colon-delimited skill:error output is ambiguous — gh error messages routinely contain colons, so the parser in handle_agent_failure.cjs will split at the wrong position and produce a truncated skill name and garbled error text.
💡 Fix
The serialization here:
failures.map(f => `${f.skill}:${f.error}`).join("\n")collides with any colon inside an error message like "exit status 1: HTTP 404: Not Found". The consumer uses indexOf(":"), so it splits at the first colon regardless.
The simplest fix is to use a tab character as the delimiter (rare in skill specs and error messages):
core.setOutput("errors", failures.map(f => `${f.skill}\t${f.error}`).join("\n"));And in handle_agent_failure.cjs, replace indexOf(":") with indexOf("\t"):
const tabIdx = errorLine.indexOf("\t");
if (tabIdx >= 0) {
const skill = errorLine.slice(0, tabIdx);
const error = errorLine.slice(tabIdx + 1);
skillList += `- \`${skill}\`: ${error}\n`;
}Alternatively, serialise as JSON and parse it back — the data is already structured.
| failures = []; | ||
| } | ||
| failures.push({ skill: skillSpec, error: errorMessage }); | ||
| fs.writeFileSync(SKILL_FAILURES_FILE, JSON.stringify(failures, null, 2), "utf8"); |
There was a problem hiding this comment.
/tmp/gh-aw/ is never created before writeFileSync, so the very first call to appendSkillInstallFailure will throw ENOENT if the directory does not exist — which then propagates out of the catch block and crashes the install loop.
💡 Fix
Add a mkdirSync call with { recursive: true } before the write, or at the top of appendSkillInstallFailure:
function appendSkillInstallFailure(skillSpec, errorMessage) {
let failures = [];
try {
if (fs.existsSync(SKILL_FAILURES_FILE)) {
const raw = fs.readFileSync(SKILL_FAILURES_FILE, "utf8");
const parsed = JSON.parse(raw);
failures = Array.isArray(parsed) ? parsed : [];
}
} catch (parseErr) {
core.warning(`...`);
failures = [];
}
failures.push({ skill: skillSpec, error: errorMessage });
try {
fs.mkdirSync(require("path").dirname(SKILL_FAILURES_FILE), { recursive: true });
fs.writeFileSync(SKILL_FAILURES_FILE, JSON.stringify(failures, null, 2), "utf8");
} catch (writeErr) {
core.warning(`Could not write skill install failures file: ${writeErr instanceof Error ? writeErr.message : String(writeErr)}`);
}
}Note that the existing review comment already points out the missing Array.isArray guard on parsed and the unguarded write — this is the directory-existence gap that makes both failure paths worse: even if writeFileSync is later wrapped in try/catch, without the mkdirSync the first write in a fresh runner will always throw.
| !isTimedOut && | ||
| !hasAssignmentErrors && | ||
| !hasAssignCopilotFailures && | ||
| !hasSkillInstallFailures && |
There was a problem hiding this comment.
A skill install failure marks an otherwise-successful agent run as not successful, suppressing success notifications even when the agent completed its task correctly.
💡 Rationale
!hasSkillInstallFailures is ANDed into the isSuccessful condition. The issue: skill install failures are activation-time decorators — they affect what context the agent had available, not whether the agent succeeded at its assigned task. If a skill fails to install but the agent completes normally, the workflow should still be marked successful (with a warning surfaced via skill_install_failure_context).
Consider removing !hasSkillInstallFailures from the isSuccessful guard and surfacing skill install failures purely as warning context in the comment/issue body. If a skill failure should always be treated as a hard failure, that intent should be explicit in a comment here.
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /tdd and /codebase-design — requesting changes primarily on test coverage gaps and one latent encoding bug in the error serialization protocol.
📋 Key Themes & Highlights
Key Themes
- Missing tests for new failure paths: The three files that form the failure-detection pipeline (
install_frontmatter_skills.cjs,collect_skill_install_failures.cjs,handle_agent_failure.cjs) each introduce new behaviour with no corresponding test coverage. A regression in any of them would fail silently. - Fragile error serialization: The
skill:error+ newline-join protocol betweencollect_skill_install_failures.cjsandbuildSkillInstallFailureContextwill misformat errors that contain newlines (e.g. multi-line shell output or Node stack traces). - Minor condition asymmetry in Go compiler:
notify_comment_conclusion_helpers.goguards the env-var injection with the raw (unfiltered) skills list while the compiler filters blanks, making the two sides of the pipeline diverge on edge inputs.
Positive Highlights
- ✅ Root cause fix is clean and minimal — removing the
|| 'needex/skills'fallback is exactly the right call. - ✅
if: always()on the collect step is correct and necessary — good defensive thinking. - ✅
appendSkillInstallFailurecorrectly handles corrupt/missing file via try/catch and warns before resetting. - ✅ The failure template
skill_install_failure.mdcovers all the common resolution paths clearly. - ✅ Adding
skill_install_failuresto both the schema pattern andbuildFailureMatchCategoriescloses the loop end-to-end.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 109.9 AIC · ⌖ 8.93 AIC · ⊞ 6.6K
Comment /matt to run again
| core.warning(`Failed to install skill '${skillSpec}': ${errorMessage}`); | ||
| failures.push({ skill: skillSpec, error: errorMessage }); | ||
| appendSkillInstallFailure(skillSpec, errorMessage); | ||
| } |
There was a problem hiding this comment.
[/tdd] The new try/catch error-recording path in main() has no test coverage — an exec failure is never exercised in the test suite.
💡 Suggested test addition
Add a case to install_frontmatter_skills.test.cjs where exec.exec rejects for one skill and assert:
core.warningis called with the skill name- The shared failures file is written with the correct
{skill, error}entry writeSkillSummaryreceives the non-emptyfailuresarray
it('records a failure entry when a skill install throws', async () => {
process.env.GH_AW_SKILL_DIR = '.claude/skills';
process.env.GH_AW_FRONTMATTER_SKILLS = 'bad/repo@abc';
global.exec.exec = vi.fn().mockRejectedValue(new Error('exit code 1'));
global.core.warning = vi.fn();
await script.main();
expect(global.core.warning).toHaveBeenCalledWith(expect.stringContaining('bad/repo@abc'));
const written = JSON.parse(fs.readFileSync('/tmp/gh-aw/skill_install_failures.json', 'utf8'));
expect(written).toEqual([{ skill: 'bad/repo@abc', error: expect.stringContaining('exit code 1') }]);
});Without this, a regression in the catch branch (e.g. silently swallowing the error) would go undetected.
@copilot please address this.
| core.info(`Skill install failures detected: ${failureCount}`); | ||
|
|
||
| core.setOutput("failure_count", String(failureCount)); | ||
| core.setOutput("errors", failures.map(f => `${f.skill}:${f.error}`).join("\n")); |
There was a problem hiding this comment.
[/tdd] New file with zero test coverage — and the errors output encoding is also a latent parsing bug.
The format \${f.skill}:${f.error}`joined by\nwill be mishandled bybuildSkillInstallFailureContextinhandle_agent_failure.cjs` if an error message itself contains a newline (e.g. a multi-line shell error or Node stack trace). The split-on-newline in the consumer will fragment the error into multiple lines, some of which won't have a colon, and will be rendered as bare bullet points.
💡 Suggested fixes
Option A – JSON-encode the errors output (most robust):
core.setOutput('errors', JSON.stringify(failures));Then in handle_agent_failure.cjs:
const errorEntries = JSON.parse(skillInstallErrors || '[]');
for (const { skill, error } of errorEntries) {
skillList += `- \`${skill}\`: ${error}\n`;
}Option B – strip newlines from the error string before joining:
core.setOutput('errors', failures.map(f => `${f.skill}:${f.error.replace(/\n/g, ' ')}`).join('\n'));Also add a test file collect_skill_install_failures.test.cjs that covers:
- happy path: no failures file →
failure_count=0, emptyerrors - single failure recorded → correct outputs
- corrupt/unreadable failures file → graceful warning, no throw
@copilot please address this.
| * @param {string} skillInstallErrors - Newline-separated list of "skill:error" entries | ||
| * @returns {string} Formatted context string, or empty string if no failures | ||
| */ | ||
| function buildSkillInstallFailureContext(hasSkillInstallFailures, skillInstallErrors) { |
There was a problem hiding this comment.
[/tdd] buildSkillInstallFailureContext and the new skill_install_failures category in buildFailureMatchCategories are not covered in handle_agent_failure.test.cjs.
💡 Suggested test additions
Add scenarios to handle_agent_failure.test.cjs (following the existing category test pattern):
it('includes skill install failure context when GH_AW_SKILL_INSTALL_FAILURE_COUNT is set', async () => {
process.env.GH_AW_SKILL_INSTALL_FAILURE_COUNT = '1';
process.env.GH_AW_SKILL_INSTALL_ERRORS = 'bad/repo:Error: exit code 1';
// ... run handler and assert issue/comment body contains skill failure text
expect(createdIssueBody).toContain('Skill Installation Failed');
expect(createdIssueBody).toContain('bad/repo');
});
it('excludes skill install context when failure count is 0', async () => {
process.env.GH_AW_SKILL_INSTALL_FAILURE_COUNT = '0';
// ... run handler and assert issue body does NOT contain skill failure text
expect(createdIssueBody).not.toContain('Skill Installation Failed');
});Without these, the rendering path for the new category is untested and a template format change could silently break the failure report.
@copilot please address this.
| ctx.steps = append(ctx.steps, " script: |\n") | ||
| ctx.steps = append(ctx.steps, generateGitHubScriptWithRequire("collect_skill_install_failures.cjs")) | ||
|
|
||
| ctx.outputs["skill_install_failure_count"] = "${{ steps.collect-skill-install-failures.outputs.failure_count || '0' }}" |
There was a problem hiding this comment.
[/codebase-design] The collect-skill-install-failures step and its two outputs (skill_install_failure_count, skill_install_errors) are not asserted in activation_skills_step_test.go, leaving the wiring unverified by the existing test.
💡 Suggested assertion additions
In TestBuildActivationJob_AddsFrontmatterSkillsInstallSteps, add:
assert.Contains(t, steps, "collect-skill-install-failures", "expected collect failures step")
assert.Contains(t, steps, "collect_skill_install_failures.cjs", "expected collect failures script")
// Verify job outputs are wired
outputsStr := fmt.Sprintf("%v", job.Outputs)
assert.Contains(t, outputsStr, "skill_install_failure_count")
assert.Contains(t, outputsStr, "skill_install_errors")This ensures the output names stay in sync between the Go compiler and the downstream references in notify_comment_conclusion_helpers.go.
@copilot please address this.
| envVars = append(envVars, fmt.Sprintf(" GH_AW_STALE_LOCK_FILE_FAILED: ${{ needs.%s.outputs.stale_lock_file_failed }}\n", constants.ActivationJobName)) | ||
| // SkillReferences holds structured skill refs (with github-token/github-app); Skills holds raw | ||
| // skill specs from simple frontmatter. Both result in activation-job skill install steps. | ||
| if len(data.SkillReferences) > 0 || len(data.Skills) > 0 { |
There was a problem hiding this comment.
[/codebase-design] The condition here uses data.Skills (raw, unfiltered) while addActivationSkillInstallSteps in the compiler filters out empty skill strings before deciding whether to emit the collect step. If data.Skills = [""] (one blank entry), this block injects GH_AW_SKILL_INSTALL_FAILURE_COUNT/GH_AW_SKILL_INSTALL_ERRORS but the activation job never sets those outputs, so they evaluate to empty string in the conclusion job.
💡 Suggested fix
Apply the same blank-filter here, or extract a shared helper:
func hasNonEmptySkills(data *WorkflowData) bool {
if len(data.SkillReferences) > 0 {
return true
}
for _, s := range data.Skills {
if strings.TrimSpace(s) != "" {
return true
}
}
return false
}Then use hasNonEmptySkills(data) in both places. This is low-severity in practice since the || '0' / || '' defaults handle it, but the asymmetry makes the logic harder to reason about.
@copilot please address this.
|
@copilot run pr-finisher skill |
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Ran pr-finisher pass and pushed follow-up fixes in |
impeccable-skills-reviewercrashed in the activation step becausevars.IMPECCABLE_SKILLS_REF || 'needex/skills'resolved to the non-existentneedex/skillsrepo, andinstall_frontmatter_skills.cjshad no error handling — the unhandled rejection killed the step.Root cause fix
Remove the
|| 'needex/skills'fallback. An unset var yields empty string →parseSkillSpecs("")returns[]→ no install steps run (safe no-op).Skill install failure detection (end-to-end)
install_frontmatter_skills.cjs— wrap each skill install in try/catch; on failure warn and append{skill, error}to/tmp/gh-aw/skill_install_failures.json(shared file within the activation job runner), then continue.collect_skill_install_failures.cjs(new) — runs after all install steps (if: always()), reads the shared file, setsfailure_count/errorsstep outputs.Go compiler (
compiler_activation_job_builder.go) — appends the collect step and exposesskill_install_failure_count/skill_install_errorsas activation job outputs.Conclusion job (
notify_comment_conclusion_helpers.go) — passesGH_AW_SKILL_INSTALL_FAILURE_COUNT/GH_AW_SKILL_INSTALL_ERRORStohandle_agent_failure.cjswhen the workflow has skills configured.handle_agent_failure.cjs— reads the env vars, addsbuildSkillInstallFailureContext()renderingskill_install_failure.mdinto the failure issue/comment body, addsskill_install_failurestobuildFailureMatchCategories.main_workflow_schema.json— addsskill_install_failuresto thefailure_categoriespattern.