Difference between re.split(" ", string) and re.split("\s+", string)?

Question

I'm currently studying regular expressions and have come across an inquiry. So the title of the question is what I'm trying to find out. I thought since \s represents a white space, re.split(" ", string) and re.split("\s+", string) would give out same values, as shown next:

>>> import re
>>> a = re.split(" ", "Why is this wrong")
>>> a
["Why", "is", "this", "wrong"]

>>> import re
>>> a = re.split("\s+", "Why is this wrong")
>>> a
["Why", "is", "this", "wrong"]

These two give out the same answers so I thought that they were the same thing. However, it turns out that these are different. In what case would it be different? And what am I missing here that is blinding me?

"\s+" represents one or more of any whitespace, including " ", "\t", "\n" and a couple more. " " is just a single space character. — user2390182
– user2390182, Commented Dec 24, 2020 at 13:21
@schwobaseggl so \s can also represent more than just " " and can express Enter (which is equal to \n), or " ", with two space characters? — Sihwan Lee
– Sihwan Lee, Commented Dec 24, 2020 at 13:23
Does this answer your question? Reference - What does this regex mean? — Tim Biegeleisen
– Tim Biegeleisen, Commented Dec 24, 2020 at 13:24

Patrick Artner · Accepted Answer · 2020-12-24 13:25:35Z

This only look similar based on your example.

A split on ' ' (a single space) does exactly that - it splits on a single space. Consecutive spaces will lead to empty "matches" when you split.

A split on '\s+' will also split on multiple occurences of those characters and it includes other whitespaces then "pure spaces":

import re

a = re.split(" ", "Why    is this  \t \t  wrong")
b = re.split("\s+", "Why    is this  \t \t  wrong")

print(a)
print(b)

Output:

# re.split(" ",data)
['Why', '', '', '', 'is', 'this', '', '\t', '\t', '', 'wrong']

# re.split("\s+",data)
['Why', 'is', 'this', 'wrong']

Documentation:

\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]. (https://docs.python.org/3/howto/regex.html#matching-characters)

Emre Demirkol · Accepted Answer · 2020-12-24 13:34:32Z

4

It means about space characters. '\s' is split with any whitespaces characters(\b, \t, \n, \a, \r etc.). '+' is if it's following whitespaces. For example " \n \r \t \v". In my opinion, if you need to use directly string operations for separation, you should use my_string.split() like standart methods. Otherwise you should you regex. Because regex engine has a cost and developer should be able to predict that.

answered Dec 24, 2020 at 13:34

Emre Demirkol

916 bronze badges

Comments

Ice Bear · Accepted Answer · 2020-12-24 13:33:26Z

In terms of the code you posted, the general idea of it is there is not much of a difference of the two (in terms of its goal), both are going to output this.

["Why", "is", "this", "wrong"]

The difference is just... I would say the WAY on how you are going to split the string. In this case the first one is using the .split() built-in method in a str object, the second one is using the .split() function from re.

Now this one re.split(" ", "Why is this wrong") just splits the string base on this character right here " " your first parameter or argument

Now this one re.split("\s+", "Why is this wrong") splits your string based on this regular expression \s+.

Take note that " " is not the same as \s+. This \s+ has more like a meaning on what it is & the " " is just basically a str. You can find out more about regex here.

\s+ -> Returns a match where the string contains a white space character

I should also say that if you want to split a string based on not just a string or you want to have it more like a pattern? Then regex is for you.

Collectives™ on Stack Overflow

Difference between re.split(" ", string) and re.split("\s+", string)?

3 Answers 3

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Linked

Related