That particular \(a\(b\)*\)*\2 example was added to demonstrate the:
If the containing subexpression does not match, or if there is no match for the contained subexpression within the string matched by the containing subexpression, then back-reference expressions corresponding to the contained subexpression shall not match
part that you did not quote.
Here the containing subexpression is \(a\(b\)*\), the contained subexpression is \(b\).
On abab, \(a\(b\)*\)* can match:
abab but then following \2 won't match
aba, ab then a but then for that a, there is no match for the contained subexpression within the string matched by the containing subexpression, so \2 won't match.
- the first or second
ab, where \2 is b so won't match
- the first or second
a, but the containing subexpression does not match, so \2 won't watch.
- the empty string: same as above.
So your grep/expr implementations are not compliant.
That text was suggested in 2004 by Glenn Fowler from AT&T Labs Research¹ to explicitly clarify those points (Base definitions Enhancement Request Number 7, Aardvark Interpretation 36).
Maybe a more striking example where the GNU grep behaviour can be seen as wrong could be:
$ echo foobarbarbarfoo | grep -x '\(\(foo\)\{0,1\}bar\)*\1\2'
foobarbarbarfoo
where you have \1 matching bar and \2 matching foo even though \2 is meant to be within \1.
In ksh93, using the AT&T Research regex:
$ re='\(a\(b\)*\)*\2' string=abab ksh -c '[[ $string = ~(G)$re ]]; echo "$?"'
1
$ re='^\(\(foo\)\{0,1\}bar\)*\1\2$' string=foobarbarbarfoo ksh -c '[[ $string = ~(G)$re ]]; echo "$?"'
1
Now, in practice, it's unlikely for this kind of bug to be hit, so understandable that no one rushed to fix it.
For historical context, in the Austin Group mailing list archive, one can find:
@ page 172 line 6105-6109 section 9.3.6 comment [gwc BRE nested subpatterns]
Defect code : 3. Clarification required
Problem:
There seems to be a discrepancy between the description of BREs in
XBD6 and the description of regcomp() in XSH6 as regards the treatment
of nested subpatterns with a following * repeater.
The example that prompted this is whether:
echo aba | sed 's/\(a\(b\)*\)*/<\1|\2>/'
should output <a|b> or <a|>.
According to the description of BREs:
If the subexpression referenced by the back-reference matches
more than one string because of an asterisk ( * ) or an interval
expression (see item (5)), the back-reference shall match the last
(rightmost) of these strings.
which seems to imply that the correct output is <a|b> since the last
string that the second subpattern matches is the b that it got from
the first iteration of the first subpattern. (In the second iteration
of the first subpattern the second subpattern doesn't match anything.)
Several sed implementations do output <a|b>.
However, the description of regcomp() says:
- If subexpression
i is contained within another subexpression
j, and i is not contained within any other subexpression that
is contained within j, and a match of subexpression j is
reported in pmatch[j], then the match or non-match of
subexpression i reported in pmatch[i] shall be as described in 1.
and 2. above, but within the substring reported in pmatch[j]
rather than the whole string. The offsets in pmatch[i] are
still relative to the start of string."
Since the second subpattern in the example does not match anything
within the last string matched by the first subpattern, this implies
that regcomp() would report the second subpattern as a non-match.
Although there is no requirement for sed to be implemented using
regcomp(), presumably this is an unintentional difference between the
descriptions of regcomp() and BREs.
The text quoted above on BREs was an addition in POSIX.1-2001 intended
to clarify what happens for repeated subpatterns. Possibly the added
text is too simplistic and should have treated nested subpatterns in
the way that regcomp() does. On the other hand, since several current
sed implementations do not behave as the regcomp() description implies,
perhaps it is the regcomp() description that needs alteration.
(So far I am only aware of one, historical, sed implementation which
outputs <a|> for the above example, however I have not tested many.)
Action:
Issue an official interpretation of the current requirement.
Depending on the outcome of the interpretation, implement a
clarification or correction in the next revision.
By Geoff Clare of the OpenGroup which led to the interpretation mentioned above, though over 20 years later, it doesn't seem to have made much difference to how most sed implementations behave.
¹ and one of the authors of AT&T Software Technology's regex library (included in ast-open and ksh93).
sedand GNUsedon OpenBSD) seem to ignore "If the containing subexpression does not match, or if there is no match for the contained subexpression within the string matched by the containing subexpression, then back-reference expressions corresponding to the contained subexpression shall not match". Because with\1 = a(note: notab) and\2 = b, how can\(a\(b\)*\)*\2matchababwhen "there is no match for the contained subexpression within the string matched by the containing subexpression"?re='\(a\(b\)*\)*\2' string=abab ksh93 -c '[[ $string = ~(G)$re ]]';Gfor BRE)