4

I'm currently studying regular expressions and have come across an inquiry. So the title of the question is what I'm trying to find out. I thought since \s represents a white space, re.split(" ", string) and re.split("\s+", string) would give out same values, as shown next:

>>> import re
>>> a = re.split(" ", "Why is this wrong")
>>> a
["Why", "is", "this", "wrong"]
>>> import re
>>> a = re.split("\s+", "Why is this wrong")
>>> a
["Why", "is", "this", "wrong"]

These two give out the same answers so I thought that they were the same thing. However, it turns out that these are different. In what case would it be different? And what am I missing here that is blinding me?

3
  • 3
    "\s+" represents one or more of any whitespace, including " ", "\t", "\n" and a couple more. " " is just a single space character. Commented Dec 24, 2020 at 13:21
  • 2
    @schwobaseggl so \s can also represent more than just " " and can express Enter (which is equal to \n), or " ", with two space characters? Commented Dec 24, 2020 at 13:23
  • Does this answer your question? Reference - What does this regex mean? Commented Dec 24, 2020 at 13:24

3 Answers 3

13

This only look similar based on your example.

A split on ' ' (a single space) does exactly that - it splits on a single space. Consecutive spaces will lead to empty "matches" when you split.

A split on '\s+' will also split on multiple occurences of those characters and it includes other whitespaces then "pure spaces":

import re

a = re.split(" ", "Why    is this  \t \t  wrong")
b = re.split("\s+", "Why    is this  \t \t  wrong")

print(a)
print(b)

Output:

# re.split(" ",data)
['Why', '', '', '', 'is', 'this', '', '\t', '\t', '', 'wrong']

# re.split("\s+",data)
['Why', 'is', 'this', 'wrong']

Documentation:

\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]. (https://docs.python.org/3/howto/regex.html#matching-characters)

Sign up to request clarification or add additional context in comments.

Comments

4

It means about space characters. '\s' is split with any whitespaces characters(\b, \t, \n, \a, \r etc.). '+' is if it's following whitespaces. For example " \n \r  \t \v". In my opinion, if you need to use directly string operations for separation, you should use my_string.split() like standart methods. Otherwise you should you regex. Because regex engine has a cost and developer should be able to predict that.

Comments

2

In terms of the code you posted, the general idea of it is there is not much of a difference of the two (in terms of its goal), both are going to output this.

["Why", "is", "this", "wrong"]

The difference is just... I would say the WAY on how you are going to split the string. In this case the first one is using the .split() built-in method in a str object, the second one is using the .split() function from re.

Now this one re.split(" ", "Why is this wrong") just splits the string base on this character right here " " your first parameter or argument

Now this one re.split("\s+", "Why is this wrong") splits your string based on this regular expression \s+.

Take note that " " is not the same as \s+. This \s+ has more like a meaning on what it is & the " " is just basically a str. You can find out more about regex here.

\s+ -> Returns a match where the string contains a white space character

I should also say that if you want to split a string based on not just a string or you want to have it more like a pattern? Then regex is for you.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.