0

I have a dataframe like this:

name      link
apple    example1.com/dsa/es?id=2812168&width=1200/web/map&resize.html
banana.  example2.com/es?id=28132908&width=1220/web/map_resize.html
orange.  example3.com/es?id=3209908&width=1120/web&map_resize.html

Each name's ID is buried in the link, which may have different structure. However, I know that the pattern is 'id=' + 'what I want' + '&'

I wonder, is there a way to extract the id from link and put it back to the dataframe to get the following:

name      link
apple    2812168
banana.  28132908
orange.  3209908

I try to use this:

df['name'] = df['name'].str.extract(r'id=\s*([^\.]*)\s*\\&', expand=False)

but it returns a column with all nan

Also, there may be more than one & in the link

3 Answers 3

2

I think Ids are always numbers, so this is somewhat cleaner:

df["link"] = df['link'].str.extract(r'id=(\d+)&', expand=False)
print(df)
#     name      link
#0   apple   2812168
#1  banana  28132908
#2  orange   3209908
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! Yes ID is always a number. This works perfectly in this context!
Glad I could help :)
2

Let tri split

df['link'].str.split('id=').str[1].str.split('&').str[0]
0     2812168
1    28132908
2     3209908
Name: link, dtype: object

1 Comment

@Tian yw :-) happy coding ~
2

We can make use of positive lookbehind and positive lookahead:

df['link'] = df['link'].str.extract('(?<=id\=)(.*?)(?=\&)')


      name      link
0    apple   2812168
1  banana.  28132908
2  orange.   3209908

Details:

  • (?<=id\=): positive lookbehind on id=
  • (.*): everything
  • (?=\&width): positive lookahead on &width

3 Comments

Thank you! I should have mentioned that it is not always &width after the id. It doesn't seem to work if I use df['link'] = df['link'].str.extract('(?<=id\=)(.*)(?=\&)') Is there a way to get around this?
Yes, by using a so called "non greedy" operator, notice the .*?, see edit. Be aware that the accepted solution will not work if the id at one point should contain alphanumeric values.
Thank you!! This makes a lot of sense!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.