8

According to chapter 9. Regular Expressions, subchapter 9.3.6 BREs Matching Multiple Characters, number "3.": "... the expression "\(a\(b\)*\)*\2" fails to match 'abab' ...".

But when I tried this using grep and expr:

echo "abab" | grep "\(a\(b\)*\)*\2"
expr "abab" : "\(a\(b\)*\)*\2"

they both matched matched the string "abab".

I guess according to the documentation, the first time the expression "\(a\(b\)*\)*" matches "ab" and the second time it matches "a" and according to 3. "...when a subexpression matches more than one string, a back-reference expression corresponding to the subexpression shall refer to the last matched string...", the "\2" back-references to the "b" from the first match. And therefore the expression "\(a\(b\)*\)*\2" actually matches the string 'abab'.

Could it be that the sample in the POSIX documentation is incorrect?

4
  • 1
    The implementations (tested sed and GNU sed on OpenBSD) seem to ignore "If the containing subexpression does not match, or if there is no match for the contained subexpression within the string matched by the containing subexpression, then back-reference expressions corresponding to the contained subexpression shall not match". Because with \1 = a (note: not ab) and \2 = b, how can \(a\(b\)*\)*\2 match abab when "there is no match for the contained subexpression within the string matched by the containing subexpression"? Commented 6 hours ago
  • 2
    That text was suggested in 2004 by Glenn Fowler from AT&T Labs Research, and you can see it doesn't match in ksh93 that still uses their regex implementation. Commented 4 hours ago
  • 1
    (see: re='\(a\(b\)*\)*\2' string=abab ksh93 -c '[[ $string = ~(G)$re ]]'; G for BRE) Commented 4 hours ago
  • BTW the links to the js regex library at becke.ch seem to be broken. Commented 2 hours ago

1 Answer 1

6

That particular \(a\(b\)*\)*\2 example was added to demonstrate the:

If the containing subexpression does not match, or if there is no match for the contained subexpression within the string matched by the containing subexpression, then back-reference expressions corresponding to the contained subexpression shall not match

part that you did not quote.

Here the containing subexpression is \(a\(b\)*\), the contained subexpression is \(b\).

On abab, \(a\(b\)*\)* can match:

  • abab but then following \2 won't match
  • aba, ab then a but then for that a, there is no match for the contained subexpression within the string matched by the containing subexpression, so \2 won't match.
  • the first or second ab, where \2 is b so won't match
  • the first or second a, but the containing subexpression does not match, so \2 won't watch.
  • the empty string: same as above.

So your grep/expr implementations are not compliant.

That text was proposed in 2004 by Glenn Fowler from AT&T Labs Research¹ to explicitly clarify those points (Base definitions Enhancement Request Number 7, Aardvark Interpretation 36).

Maybe a more striking example where the GNU (and many others) grep behaviour can be seen as wrong could be:

$ echo foobarbarbarfoo | grep -x '\(\(foo\)\{0,1\}bar\)*\1\2'
foobarbarbarfoo

where you have \1 matching bar and \2 matching foo even though \2 is meant to be within \1.

In ksh93, using the AT&T Research regex BREs via its ~(G) glob operator (like the grep -G of a few grep implementations including GNU's or ast's where it's the default there):

$ re='\(a\(b\)*\)*\2' string=abab ksh -c '[[ $string = ~(G)$re ]]; echo "$?"'
1
$ re='^\(\(foo\)\{0,1\}bar\)*\1\2$' string=foobarbarbarfoo ksh -c '[[ $string = ~(G)$re ]]; echo "$?"'
1

No match in both cases as required by POSIX.

Now, in practice, it's unlikely for this kind of bug to be hit, so understandable that no one rushed to fix it.

For historical context, in the Austin Group mailing list archive, one can find:

@ page 172 line 6105-6109 section 9.3.6 comment [gwc BRE nested subpatterns]

Defect code : 3. Clarification required

Problem:

There seems to be a discrepancy between the description of BREs in XBD6 and the description of regcomp() in XSH6 as regards the treatment of nested subpatterns with a following * repeater.

The example that prompted this is whether:

echo aba | sed 's/\(a\(b\)*\)*/<\1|\2>/'

should output <a|b> or <a|>.

According to the description of BREs:

If the subexpression referenced by the back-reference matches more than one string because of an asterisk ( * ) or an interval expression (see item (5)), the back-reference shall match the last (rightmost) of these strings.

which seems to imply that the correct output is <a|b> since the last string that the second subpattern matches is the b that it got from the first iteration of the first subpattern. (In the second iteration of the first subpattern the second subpattern doesn't match anything.) Several sed implementations do output <a|b>.

However, the description of regcomp() says:

  1. If subexpression i is contained within another subexpression j, and i is not contained within any other subexpression that is contained within j, and a match of subexpression j is reported in pmatch[j], then the match or non-match of subexpression i reported in pmatch[i] shall be as described in 1. and 2. above, but within the substring reported in pmatch[j] rather than the whole string. The offsets in pmatch[i] are still relative to the start of string."

Since the second subpattern in the example does not match anything within the last string matched by the first subpattern, this implies that regcomp() would report the second subpattern as a non-match.

Although there is no requirement for sed to be implemented using regcomp(), presumably this is an unintentional difference between the descriptions of regcomp() and BREs.

The text quoted above on BREs was an addition in POSIX.1-2001 intended to clarify what happens for repeated subpatterns. Possibly the added text is too simplistic and should have treated nested subpatterns in the way that regcomp() does. On the other hand, since several current sed implementations do not behave as the regcomp() description implies, perhaps it is the regcomp() description that needs alteration. (So far I am only aware of one, historical, sed implementation which outputs <a|> for the above example, however I have not tested many.)

Action:

Issue an official interpretation of the current requirement.

Depending on the outcome of the interpretation, implement a clarification or correction in the next revision.

By Geoff Clare of the OpenGroup which led to the interpretation mentioned above, though over 20 years later, it doesn't seem to have made much difference to how most sed implementations behave.


¹ and one of the authors of AT&T Software Technology's regex library (included in ast-open and ksh93).

2
  • Question regarding: aba, ab then a but then for that a, there is no match for the contained subexpression within the string matched by the containing subexpression, so \2 won't match. Is there no match for the contained subexpression or is it an empty match? <asterisk> it shall match what zero or more consecutive occurrences of the BRE would match Commented 3 hours ago
  • 1
    @becke-ch, there's no match. It's not \(b*\) where b* would match an empty string, its \(b\)*, where b doesn't match. Commented 3 hours ago

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.