2

I have the following code snippet from page source:

var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); 

the

'PDFObject('

is unique on the page. I want to retreive url content using REGEX. In this case I need to get

http://www.site.com/doc55.pdf

Please help.

1
  • 1
    Regex should work pretty good for this.
    – Smandoli
    Commented Jul 4, 2013 at 20:32

7 Answers 7

3

Here is an alternative for solving your problem without using regex:

url,in_object = None, False
with open('input') as f:
    for line in f:
        in_object = in_object or 'PDFObject(' in line
        if in_object and 'url:' in line:
            url = line.split('"')[1]
            break
print url
3
  • Why on hearth OPs can't be helped with the tools they want to use? There is always someone to tell them "Hey dude! That's not HOW to think about it..."
    – Stephan
    Commented Jul 4, 2013 at 21:06
  • you picked this out within all those regex answers?
    – perreal
    Commented Jul 4, 2013 at 21:07
  • 1
    Good answer, I agree that regex is not the best tool for this. But you should probably break the loop after finding the url (or just put the code into a function and return), otherwise you could have false positives if other lines contain "url:".
    – l4mpi
    Commented Jul 4, 2013 at 21:28
0

In order to be able to find "something that happens in the line after something else", you need to match things "including the newline". For this you use the (dotall) modifier - a flag added during the compilation.

Thus the following code works:

import re
r = re.compile(r'(?<=PDFObject).*?url:.*?(http.*?)"', re.DOTALL)
s = '''var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); '''

print r.findall(s)

Explanation:

r = re.compile(         compile regular expression
    r'                  treat this string as a regular expression
    (?<=PDFObject)      the match I want happens right after PDFObject
    .*?                 then there may be some other characters...
    url:                followed by the string url:
    .*?                 then match whatever follows until you get to the first instance (`?` : non-greedy match of
    (http:.*?)"         match the string http: up to (but not including) the first "
    ',                  end of regex string, but there's more...
    re.DOTALL)          set the DOTALL flag - this means the dot matches all characters
                        including newlines. This allows the match to continue from one line
                        to the next in the .*? right after the lookbehind
2
  • Thanks a lot Floris. Your code is the shortest and it works just fine:)
    – Ash
    Commented Jul 4, 2013 at 21:52
  • Glad it worked for you. Was an opportunity for me to figure out the re.DOTALL thing... I knew it existed, had not used it, this was my chance to learn about it. So we both came out ahead.
    – Floris
    Commented Jul 4, 2013 at 21:56
0

using a combination of look-behind and look-ahead assertions

import re
re.search(r'(?<=url:).*?(?=",)', s).group().strip('" ')
'http://www.site.com/doc55.pdf'
0

This works:

import re

src='''\
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
URL: "http://www.site.com/doc52.PDF",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder"); '''   

print [m.group(1).strip('"') for m in 
        re.finditer(r'^url:\s*(.*)[\W]$',
        re.search(r'PDFObject\(\{(.*)',src,re.M | re.S | re.I).group(1),re.M|re.I)]

prints:

['http://www.site.com/doc55.pdf', 'http://www.site.com/doc52.PDF']
0

Regex

new\s+PDFObject\(\{\s*url:\s*"[^"]+"

Regular expression image

Demo

Extract url only

1
  • 2
    This doesn't address the "after PDFObject" part. There will be other instances of url: "http:.*" on the page - OP wants a specific one.
    – Floris
    Commented Jul 4, 2013 at 21:06
0

If 'PDFObject(' is the unique identifier in the page, you only have to match the first next quoted content.

Using the DOTALL flag (re.DOTALL or re.S) and the non-greedy star (*?), you can write:

import re

snippet = '''                                    
var myPDF = new PDFObject({
url: "http://www.site.com/doc55.pdf",
  id: "pdfObjectContainer",
  width: "100%",
  height: "700px",
  pdfOpenParams: {
    navpanes: 0,
    statusbar: 1,
    toolbar: 1,
    view: "FitH"
  }
}).embed("pdf_placeholder");
'''

# First version using unnamed groups
RE_UNNAMED = re.compile(r'PDFObject\(.*?"(.*?)"', re.S)

# Second version using named groups
RE_NAMED = re.compile(r'PDFObject\(.*?"(?P<url>.*?)"', re.S)

RE_UNNAMED.search(snippet, re.S).group(1)
RE_NAMED.search(snippet, re.S).group('url')
# result for both: 'http://www.site.com/doc55.pdf'

If you don't want to compile your regex because it's used once, simply this syntax:

re.search(r'PDFObject\(.*?"(.*?)"', snippet, re.S).group(1)
re.search(r'PDFObject\(.*?"(?P<url>.*?)"', snippet, re.S).group('url')

Four choices, one should match you need and taste!

0

Although the other answers may appear to work, most do not take into account that the only unique thing on the page is 'PDFObject('. A much better regular expression would be the following:

PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",

It takes into account that 'PDFObject(' is unique and contains some basic URL verification.

Below is an example of how this regex could be used in python

>>> import re
>>> strs = """var myPDF = new PDFObject({
... url: "http://www.site.com/doc55.pdf",
...   id: "pdfObjectContainer",
...   width: "100%",
...   height: "700px",
...   pdfOpenParams: {
...     navpanes: 0,
...     statusbar: 1,
...     toolbar: 1,
...     view: "FitH"
...   }
... }).embed("pdf_placeholder");"""
>>> re.search(r'PDFObject\({\surl: "(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",',strs).group(1)
'http://www.site.com/doc55.pdf'

A pure python (no regex) alternative would be:

>>> unique = 'PDFObject({\nurl: "'
>>> start = strs.find(unique) + len(unique)
>>> end = start + strs[start:].find('"')
>>> strs[start:end]
'http://www.site.com/doc55.pdf'

No regex oneliner:

>>> (lambda u:(lambda s:(lambda e:strs[s:e])(s+strs[s:].find('"')))(strs.find(u)+len(u)))('PDFObject({\nurl: "')
'http://www.site.com/doc55.pdf'
4
  • Not sure the link validation is needed, but I appreciate that matching http: in my example was going one character too far, as it would skip any https: links - I have modified my answer, and thanks. Does your regex permit all legal links (even ones with URL encoded queries attached)? It's a bit hard to be sure...
    – Floris
    Commented Jul 4, 2013 at 21:21
  • @Floris yes this regex accepts all links, even ones with URL encoded queries, given their protocol is either http or https.
    – luke
    Commented Jul 5, 2013 at 1:16
  • That's cool - I will keep a copy, might come in handy. Of your own making, or did you find it somewhere?
    – Floris
    Commented Jul 5, 2013 at 2:02
  • I think I found it somewhere, can't remember where though. I used it in one of my projects a while back, just copied it out of there for this.
    – luke
    Commented Jul 5, 2013 at 2:23

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.