Python unicode regex issue

Question

Why does this work:

>>> ss
u'\U0001f300'
>>> r = re.compile(u"[u'\U0001F300-\U0001F5FF']+", re.UNICODE)
>>> r.search(ss) # this works
<_sre.SRE_Match object at 0x7f359acf03d8>

But this doesn't:

>>> r = re.compile("[u'\U0001F300-\U0001F5FF']+", re.UNICODE)
>>> r.search(ss) # this doesn't

Based on Ignacio's answer below, this also works:

>>> r = re.compile(u"[\U0001F300-\U0001F5FF]+", re.UNICODE)
>>> r.search(ss)
<_sre.SRE_Match object at 0x7f359acf03d8>

Those u'..' inside the character classes are not doing anything except including u as a legal match - along with the apostrophe, twice. — Mark Reed, Commented Oct 22, 2015 at 0:50
@MarkReed I don't understand. Based on what you said, how did my very first match succeed (in my post above)? — Ankur Agarwal, Commented Oct 22, 2015 at 0:52
Your first match says: "match one or more of any of the codepoints u, single apostrophe, and any character from U+1F300 to U+1F5FF". ss contains the single codepoint U+1F300, which meets the requirements. — Mark Tolonen, Commented Oct 22, 2015 at 0:59
Character classes are "or"s. [ax-z] matches any of a, x, y or z. Your character class matches u or ' or U+1F300 or U+1F301 or ... or U+1F5FE or U+1F5FF. — Mark Reed, Commented Oct 22, 2015 at 1:03
re.UNICODE only affects the behavior of \d, \s, \w and has nothing to do with the Unicode/byte semantic of the regex engine. — nhahtdh, Commented Oct 22, 2015 at 4:41

Ignacio Vazquez-Abrams · Accepted Answer · 2015-10-22 00:47:24Z

3

Use a unicode pattern when performing a search on a unicode haystack.

Also, the "u'...'" should not be in the pattern; those are Unicode characters (in the unicode) without that regardless.

answered Oct 22, 2015 at 0:47

Ignacio Vazquez-Abrams

801k160 gold badges1.4k silver badges1.4k bronze badges

Add a comment |

Collectives™ on Stack Overflow

Python unicode regex issue

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related