Using Regexp to catch substring python

Question

Let's assume I have some string like that:

x = 'Wish she could have told me herself. @NicoleScherzy #nicolescherzinger #OneLove #myfav #MyQueen :heavy_black_heart::heavy_black_heart: some string too :smiling_face:'

So, I want to get from that :

:heavy_black_heart:
:smiling_face:

To do that I did the following :

import re
result = re.search(':(.*?):', x)
result.group()

It only gives me the ':heavy_black_heart:' . How could I make it work ? If possible I want to store them in dictonary after I found all of them.

Maybe set(re.findall(r':[^:]+:', x)) will do? Not sure what there might be between :, maybe r':\w+:' will work better. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Sep 14, 2017 at 12:13
@WiktorStribiżew for the example, it works, but I couldn't understand why you're not sure — zwlayer
– zwlayer, Commented Sep 14, 2017 at 12:20
See my answer with some explanations. Actually, you have not provided all the requirements, just two examples, that is why I said I was not sure. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Sep 14, 2017 at 12:23
Do you really want to match ::? As I said, you did not post exact specs. If you need to match any chars inside :...: that are not whitespaces, use :[^\s:]+: - see my updated answer. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Sep 14, 2017 at 12:48

Bhawan · Accepted Answer · 2017-09-14 12:49:34Z

3

print re.findall(':.*?:', x) is doing the job.

Output:
[':heavy_black_heart:', ':heavy_black_heart:', ':smiling_face:']

But if you want to remove the duplicates:

Use:

res = re.findall(':.*?:', x)
dictt = {x for x in res}
print list(dictt)

Output:
[':heavy_black_heart:', ':smiling_face:']

edited Sep 14, 2017 at 12:49

answered Sep 14, 2017 at 12:15

Bhawan

2,5115 gold badges26 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Wiktor Stribiżew Over a year ago

re.MULTILINE is not doing anything with the pattern since there are no ^ and $ to modify the behavior of. re.match only searches for a match at the beginning of the string.

Wiktor Stribiżew Over a year ago

Now, you do not have : in the matches.

Bhawan Over a year ago

Check now @WiktorStribiżew

Wiktor Stribiżew Over a year ago

You do not need any capturing group, remove ( and ). It will still match :: (not sure it is expected).

Bhawan Over a year ago

Thanks for pointing out . Capturing parentheses are removed. No , it won't match ::

|

Wiktor Stribiżew · Accepted Answer · 2017-09-14 12:44:49Z

You seem to want to match smilies that are some symbols in-between 2 :s. The .*? can match 0 symbols, and your regex can match ::, which I think is not what you would want to get. Besdies, re.search only returns one - the first - match, and to get multiple matches, you usually use re.findall or re.finditer.

I think you need

set(re.findall(r':[^:]+:', x))

or if you only need to match word chars inside :...::

set(re.findall(r':\w+:', x))

or - if you want to match any non-whitespace chars in between two ::

set(re.findall(r':[^\s:]+:', x))

The re.findall will find all non-overlapping occurrences and set will remove dupes.

The patterns will match :, then 1+ chars other than : ([^:]+) (or 1 or more letters, digits and _) and again :.

>>> import re
>>> x = 'Wish she could have told me herself. @NicoleScherzy #nicolescherzinger #OneLove #myfav #MyQueen :heavy_black_heart::heavy_black_heart: some string too :smiling_face:'
>>> print(set(re.findall(r':[^:]+:', x)))
{':smiling_face:', ':heavy_black_heart:'}
>>>

Jean Michél Marca · Accepted Answer · 2017-09-14 12:15:41Z

0

try this regex:

:([a-z0-9:A-Z_]+):

answered Sep 14, 2017 at 12:15

Jean Michél Marca

1694 bronze badges

2 Comments

zwlayer Over a year ago

When I try it, it produces ':heavy_black_heart::heavy_black_heart:' which isn't what I want

Wiktor Stribiżew Over a year ago

@zwlayer It returns that match because : is inside the character class and + is a greedy quantifier, so all the chars defined in the character class are matched first, as many as possible occurrences, up to the last : that occurs after _, letters and digits.

Arun · Accepted Answer · 2017-09-14 12:19:57Z

0

import re
x = 'Wish she could have told me herself. @NicoleScherzy #nicolescherzinger #OneLove #myfav #MyQueen :heavy_black_heart::heavy_black_heart: some string too :smiling_face:' 
print set(re.findall(':.*?:', x))

output:

{':heavy_black_heart:', ':smiling_face:'}

answered Sep 14, 2017 at 12:19

Arun

1,2891 gold badge12 silver badges22 bronze badges

Comments

Eric Duminil · Accepted Answer · 2017-09-14 12:56:21Z

0

Just for fun, here's a simple solution without regex. It splits around ':' and keeps the elements with odd index:

>>> text = 'Wish she could have told me herself. @NicoleScherzy #nicolescherzinger #OneLove #myfav #MyQueen :heavy_black_heart::heavy_black_heart: some string too :smiling_face:'
>>> text.split(':')[1::2]
['heavy_black_heart', 'heavy_black_heart', 'smiling_face']
>>> set(text.split(':')[1::2])
set(['heavy_black_heart', 'smiling_face'])

answered Sep 14, 2017 at 12:56

Eric Duminil

54.6k10 gold badges80 silver badges134 bronze badges

Collectives™ on Stack Overflow

Using Regexp to catch substring python

5 Answers 5

7 Comments

Comments

2 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

7 Comments

Comments

2 Comments

Comments

Comments

Related