0

Why does this work:

>>> ss
u'\U0001f300'
>>> r = re.compile(u"[u'\U0001F300-\U0001F5FF']+", re.UNICODE)
>>> r.search(ss) # this works
<_sre.SRE_Match object at 0x7f359acf03d8>

But this doesn't:

>>> r = re.compile("[u'\U0001F300-\U0001F5FF']+", re.UNICODE)
>>> r.search(ss) # this doesn't

Based on Ignacio's answer below, this also works:

>>> r = re.compile(u"[\U0001F300-\U0001F5FF]+", re.UNICODE)
>>> r.search(ss)
<_sre.SRE_Match object at 0x7f359acf03d8>
6
  • Those u'..' inside the character classes are not doing anything except including u as a legal match - along with the apostrophe, twice.
    – Mark Reed
    Commented Oct 22, 2015 at 0:50
  • @MarkReed I don't understand. Based on what you said, how did my very first match succeed (in my post above)? Commented Oct 22, 2015 at 0:52
  • 2
    Your first match says: "match one or more of any of the codepoints u, single apostrophe, and any character from U+1F300 to U+1F5FF". ss contains the single codepoint U+1F300, which meets the requirements. Commented Oct 22, 2015 at 0:59
  • 1
    Character classes are "or"s. [ax-z] matches any of a, x, y or z. Your character class matches u or ' or U+1F300 or U+1F301 or ... or U+1F5FE or U+1F5FF.
    – Mark Reed
    Commented Oct 22, 2015 at 1:03
  • 1
    re.UNICODE only affects the behavior of \d, \s, \w and has nothing to do with the Unicode/byte semantic of the regex engine.
    – nhahtdh
    Commented Oct 22, 2015 at 4:41

1 Answer 1

3

Use a unicode pattern when performing a search on a unicode haystack.

Also, the "u'...'" should not be in the pattern; those are Unicode characters (in the unicode) without that regardless.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.