Regular Expression Backtracking Control Verb Arguments

Backtracking control verb arguments are only supported by Perl, PCRE 8.10 and later, and PCRE2. While Boost and PCRE 7.3 to 8.02 support backtracking control verbs, they do not allow these verbs to have arguments.

R, Delphi XE7, and PHP 5.3.4 and later support control verb arguments in regular expressions as they are based on PCRE or PCRE2. But they do not provide a way to do anything with those arguments after the match. They do not use PCRE2’s replacement string syntax.

MARK

(*MARK:ARG) is a verb that does nothing except set its argument. You can even omit the verb and just specify (*:ARG) if you only want to set the argument. On its own, setting an argument does not affect the matching process at all. The regex engine does keep track of the argument. You can use it after the match to determine which path the regex engine took through the regex. Perl and PCRE2 allow a search-and-replace to insert the argument of the last control verb that was encountered during the matching process.

In a Perl script, when a regex finds a match you can use $REGMARK to retrieve the argument of the last control verb that was involved in the match. When a regex fails you can use $REGERROR to retrieve the same unless the failure was triggered by a backtracking control verb that has its own argument. Then $REGERROR returns the argument of the verb that triggered the failure instead. These are not magic variables like $1 and are not local to a scope. They are volatile package variables. If a regex does not have any control verbs then it does not touch these variables.

PCRE and PCRE2 allow you to retrieve the argument of the last control verb to be involved in the match attempt using a single method for both successful and failed matches. But the code for PCRE and PCRE2 is quite different. For PCRE it looks like this:

unsigned char *mark = NULL; pcre_extra extra; extra.flags = PCRE_EXTRA_MARK; extra.mark = &mark; pcre_exec(re, &extra, ...)

After the call to pcre_exec(), our variable mark will be filled with a pointer to a null-terminated string that holds the argument of the last control verb. It will be NULL if there is no such argument.

PCRE2 provides a much simpler way. Simply call pcre2_get_mark() and pass it a pointer to the same pcre2_match_data structure that you passed to the preceding call to pcre2_match(). It returns a pointer to a null-terminated string that holds the argument of the last control verb. It returns NULL if there is no such argument.

With both PCRE and PCRE2, the actual string that you get a pointer to is part of the compiled pattern. The string is destroyed when the pattern is destroyed. If you need to manipulate this string in any way then you should make a copy of it.

None of the languages and application discussed in this tutorial that have regex support based on PCRE or PCRE2 expose this functionality.

SKIP:ARG

(*SKIP:ARG) behaves quite differently from (*SKIP) without the argument. a+b+(*SKIP)d|.{2,3} matches cc in aabbcc because the regex engine skips ahead to the position of (*SKIP) when d fails.

But a+(*MARK:ARG)b+(*SKIP:ARG)d|.{2,3} matches bbc in the same string. The regex engine skips ahead to the position where the argument was marked when d fails.

Let’s see how this works. First, a+ matches aa. Then (*MARK:ARG) adds a backtracking position to the stack that has no effect other than to remember that :ARG was reached at the position between the second a and the first b in the string. b+ then matches bb. (*SKIP:ARG) is added to the backtracking stack, including its argument. d fails to match c. The engine backtracks. It finds (*SKIP:ARG) at the top of the stack. The argument tells the regex engine to look further down the stack to see where the argument was last marked. It finds that (*MARK:ARG) was encountered between the second a and the first b in the string. The regex engine then clears the entire backtracking stack and advances to the position where the argument was marked.

The matching process starts anew at this position. a fails to match b. The engine backtracks to the second alternative. .{2,3} matches bbc which is the overall match.

If the argument was never set then SKIP has no effect. a+b+(*SKIP:ARG)d|(*MARK:ARG).{2,3} and a+b+(*SKIP:ARG)d|.{2,3} both match aab in aabbcc. With both regexes, when d fails and (*SKIP:ARG) is popped of the backtracking stack, the regex engine searches the backtracking stack in vain to see where the argument was last marked. The first regex hasn’t marked it yet because the second alternative hasn’t been tried yet. The second regex never marks the argument at all. Either way, the regex engine decides that this (*SKIP:ARG) is a dud and backtracks normally. b+ and then a+ are forced to give up their matches. The engine then backtracks to the second alternative which matches aab at the start of the string.

Arguments on Other Verbs

Arguments on other backtracking control verbs basically set the argument when the control verb is reached during the matching process. The control verb itself performs its usual function, regardless of whether the argument is ever marked. So (*PRUNE:ARG) is almost the same as (*:ARG)(*PRUNE). The only difference is that (*SKIP:ARG) only looks for arguments explicitly set with (*:ARG) or (*MARK:ARG). (*SKIP:ARG) ignores arguments on other control verbs.

So arguments on verbs other than SKIP and MARK don’t have any purpose other than letting you determine afterwards which backtracking control verb caused the match attempt to fail or to be accepted. While the latest versions of Perl and PCRE2 allow arguments on all backtracking control verbs, this wasn’t the case originally.

Perl allows arguments on all verbs except (*FAIL:ARG), which is only allowed since Perl 5.24.

PCRE2 always allowed (*PRUNE:ARG) and (*THEN:ARG). But (*COMMIT:ARG), (*ACCEPT:ARG), and (*FAIL:ARG) are only allowed since PCRE2 10.32.

PCRE 8.21 and later allow (*PRUNE:ARG) and (*THEN:ARG). PCRE never allowed (*COMMIT:ARG), (*ACCEPT:ARG), or (*FAIL:ARG).