Skip to content

fix(security): catch multi-word prompt injection bypass in skills_guard#192

Merged
teknium1 merged 1 commit intoNousResearch:mainfrom
0xbyt4:fix/skills-guard-injection-bypass
Mar 4, 2026
Merged

fix(security): catch multi-word prompt injection bypass in skills_guard#192
teknium1 merged 1 commit intoNousResearch:mainfrom
0xbyt4:fix/skills-guard-injection-bypass

Conversation

@0xbyt4
Copy link
Copy Markdown
Contributor

@0xbyt4 0xbyt4 commented Feb 28, 2026

Summary

  • The prompt injection regex in skills_guard.py only matched a single word between "ignore" and "instructions" (e.g. ignore previous instructions)
  • Multi-word variants like ignore all prior instructions or ignore the above instructions bypassed the scanner entirely
  • Fixed by changing \s+ to \s+(?:\w+\s+)* to allow arbitrary intermediate words before the keyword
The regex `ignore\s+(previous|all|...)\s+instructions` only matched
a single keyword between 'ignore' and 'instructions'. Phrases like
'ignore all prior instructions' bypassed the scanner entirely.

Changed to `ignore\s+(?:\w+\s+)*(previous|all|...)\s+instructions`
to allow arbitrary words before the keyword.
teknium1 added a commit that referenced this pull request Mar 4, 2026
The 'disregard ... instructions/rules/guidelines' regex had the
same single-word gap vulnerability as the 'ignore' pattern fixed
in PR #192. 'disregard all your instructions' bypassed the scanner.

Added (?:\w+\s+)* between both keyword groups to allow arbitrary
intermediate words.
@teknium1 teknium1 merged commit 520a26c into NousResearch:main Mar 4, 2026
@teknium1
Copy link
Copy Markdown
Contributor

teknium1 commented Mar 4, 2026

Merged in ba214e4 — thanks @0xbyt4!

Found the same single-word gap vulnerability in the disregard pattern on line 175 — disregard all your instructions bypassed it too. Fixed that in the follow-up commit (same technique: (?:\w+\s+)* between keyword groups).

teknium1 added a commit that referenced this pull request Mar 4, 2026
Systematic audit of all prompt injection regexes in skills_guard.py
found 8 more patterns with the same single-word gap vulnerability
fixed in PR #192. Multi-word variants like 'pretend that you are',
'output the full system prompt', 'respond without your safety
filters', etc. all bypassed the scanner.

Fixed patterns:
- you are [now] → you are [... now]
- do not [tell] the user → do not [... tell ... the] user
- pretend [you are|to be] → pretend [... you are|to be]
- output the [system|initial] prompt → output [... system|initial] prompt
- act as if you [have no] [restrictions] → act as if [... you ... have no ... restrictions]
- respond without [restrictions] → respond without [... restrictions]
- you have been [updated] to → you have been [... updated] to
- share [the] [entire] [conversation] → share [... conversation]

All use (?:\w+\s+)* to allow arbitrary intermediate words.
@teknium1
Copy link
Copy Markdown
Contributor

teknium1 commented Mar 4, 2026

Follow-up: audited all injection patterns and found 8 more with the same vulnerability (021f62c). Every prompt injection regex in skills_guard.py now uses (?:\w+\s+)* between keyword groups to handle multi-word variants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants