Regular expression in Python not working

Question

I'm working on a Python script that can download images from Flickr, among other sites. I use the Flickr API to pull the various sizes of the image I'm trying to download and identify the URL for the original size. Well, that's what I'm trying to do. Here's my code so far...

URL = {a Flickr link}

flickr = re.match(r".*flickr\.com\/photos\/([^\/]+)\/([0-9^\/]+)\/", URL)
URL = "https://api.flickr.com/services/rest/?method=flickr.photos.getSizes&api_key=6002c84e96ff95c1a861eafafa4284ba&photo_id=" + flickr.group(2) + "&format=json&nojsoncallback=1"

request = requests.get(URL)
result = request.text

parsed = re.match(r".\"Original\".*\"source\"\: \"([^\"]+)", result)
URL = parsed.group(1)

Using print() statements throughout my code, I know that the first regular expression (to parse the original Flickr URL to identify the photo ID) works properly, and that the API request works properly, returning the following result (using the example URL https://www.flickr.com/photos/matbellphotography/33413612735/sizes/h/)...

{ "sizes": { "canblog": 0, "canprint": 0, "candownload": 1, 
"size": [
  { "label": "Square", "width": 75, "height": 75, "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_s.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/sq\/", "media": "photo" },
  { "label": "Large Square", "width": "150", "height": "150", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_q.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/q\/", "media": "photo" },
  { "label": "Thumbnail", "width": 100, "height": 67, "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_t.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/t\/", "media": "photo" },
  { "label": "Small", "width": "240", "height": "160", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_m.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/s\/", "media": "photo" },
  { "label": "Small 320", "width": "320", "height": "213", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_n.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/n\/", "media": "photo" },
  { "label": "Medium", "width": "500", "height": "333", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/m\/", "media": "photo" },
  { "label": "Medium 640", "width": "640", "height": "427", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_z.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/z\/", "media": "photo" },
  { "label": "Medium 800", "width": "800", "height": "534", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_c.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/c\/", "media": "photo" },
  { "label": "Large", "width": "1024", "height": "683", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_b.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/l\/", "media": "photo" },
  { "label": "Large 1600", "width": "1600", "height": "1067", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_4d92e2f70d_h.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/h\/", "media": "photo" },
  { "label": "Large 2048", "width": "2048", "height": "1365", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_81441ed1da_k.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/k\/", "media": "photo" },
  { "label": "Original", "width": "5760", "height": "3840", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_34cbc172c1_o.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/o\/", "media": "photo" }
] }, "stat": "ok" }

My code apparently breaks down after that. The second regular expression, intended to identify the download URL of the image at its original filesize, apparently doesn't find any matches. According to yet another print() statement...

parsed.group(1) = none

I setup the expression using RegExr, which identified exactly what I needed from the JSON result. What have I done wrong?

I suspect you'll have better luck with re.search instead of re.match. — Aran-Fey, Commented Mar 15, 2017 at 0:28
@Rawing At this point, I'll pursue almost any alternative! In this case, why would it work better? — Andrew, Commented Mar 15, 2017 at 0:30
Why don't you just use a json parser? Then you can just access the attributes that contain the data you want, and it will be automatically escaped etc... In fact you're already using the requests library. You can use .json instead of .text — Shadow, Commented Mar 15, 2017 at 0:30
Are you trying to parse a JSON repsonse using regular expressions? Sounds like you need to do json.loads(request.content) and go from there instead. — jDo, Commented Mar 15, 2017 at 0:31
@Andrew You would probably need to access "sizes" -> "size", pick the element in the list that has the desired width/height (if that's what you mean by "filesize") and assign its "source" to a variable — jDo, Commented Mar 15, 2017 at 0:47

jDo · Accepted Answer · 2017-03-15 00:48:08Z

Maybe your requests.Response object has a json attribute that you can access directly. If not, simply import json, parse your request.content and work with the returned dictionary. Example:

>>> import json
>>> json_response = """
... { "sizes": { "canblog": 0, "canprint": 0, "candownload": 1, 
... "size": [
...   { "label": "Square", "width": 75, "height": 75, "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_s.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/sq\/", "media": "photo" },
...   { "label": "Large Square", "width": "150", "height": "150", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_q.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/q\/", "media": "photo" },
...   { "label": "Thumbnail", "width": 100, "height": 67, "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_t.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/t\/", "media": "photo" },
...   { "label": "Small", "width": "240", "height": "160", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_m.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/s\/", "media": "photo" },
...   { "label": "Small 320", "width": "320", "height": "213", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_n.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/n\/", "media": "photo" },
...   { "label": "Medium", "width": "500", "height": "333", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/m\/", "media": "photo" },
...   { "label": "Medium 640", "width": "640", "height": "427", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_z.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/z\/", "media": "photo" },
...   { "label": "Medium 800", "width": "800", "height": "534", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_c.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/c\/", "media": "photo" },
...   { "label": "Large", "width": "1024", "height": "683", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_645397d6a5_b.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/l\/", "media": "photo" },
...   { "label": "Large 1600", "width": "1600", "height": "1067", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_4d92e2f70d_h.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/h\/", "media": "photo" },
...   { "label": "Large 2048", "width": "2048", "height": "1365", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_81441ed1da_k.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/k\/", "media": "photo" },
...   { "label": "Original", "width": "5760", "height": "3840", "source": "https:\/\/farm3.staticflickr.com\/2855\/33413612735_34cbc172c1_o.jpg", "url": "https:\/\/www.flickr.com\/photos\/matbellphotography\/33413612735\/sizes\/o\/", "media": "photo" }
... ] }, "stat": "ok" }"""
>>> 
>>> json_parsed = json.loads(json_response)
>>> for img in json_parsed["sizes"]["size"]:
...     print img.get("source")
... 
https://farm3.staticflickr.com/2855/33413612735_645397d6a5_s.jpg
https://farm3.staticflickr.com/2855/33413612735_645397d6a5_q.jpg
https://farm3.staticflickr.com/2855/33413612735_645397d6a5_t.jpg
https://farm3.staticflickr.com/2855/33413612735_645397d6a5_m.jpg
https://farm3.staticflickr.com/2855/33413612735_645397d6a5_n.jpg
https://farm3.staticflickr.com/2855/33413612735_645397d6a5.jpg
https://farm3.staticflickr.com/2855/33413612735_645397d6a5_z.jpg
https://farm3.staticflickr.com/2855/33413612735_645397d6a5_c.jpg
https://farm3.staticflickr.com/2855/33413612735_645397d6a5_b.jpg
https://farm3.staticflickr.com/2855/33413612735_4d92e2f70d_h.jpg
https://farm3.staticflickr.com/2855/33413612735_81441ed1da_k.jpg
https://farm3.staticflickr.com/2855/33413612735_34cbc172c1_o.jpg
>>>

It worked! I did have to make a small change—parentheses for the print statement — Andrew, Commented Mar 15, 2017 at 1:45
Is there any way to have it identify only the URL of the entry with a label of "Original"? — Andrew, Commented Mar 15, 2017 at 1:46
@Andrew: It's just a list of dicts. Use the dict with whatever['label'] == 'Original'. — Kevin J. Chase, Commented Mar 15, 2017 at 2:13
@Andrew You can simply add an if statement in the for loop and, as Kevin J. Chase suggests, look for list elements where "label" equals "Original"; e.g. if img.get("label") == "Original": source_url = img.get("souce") — jDo, Commented Mar 15, 2017 at 12:15

Collectives™ on Stack Overflow

Regular expression in Python not working

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related