5

I want to open a pdf in my Python program. So far that works.

existing_pdf = PdfFileReader(file(path_to_pdf, "rb"))

Right now I open the pdf from my local disk, but I want it to fetch the pdf from the internet, instead of opening it from my local drive. Note that I don't wish to save the existing_pdf, once I fetched it from the internet I will manipulate it and then save it.

I think I need BytesIO + urllib2, but I cannot figure it out, can somebody help me?

So lets say I want to create the variable: existing_pdf with content http://tug.ctan.org/tex-archive/macros/latex/contrib/logpap/example.pdf in it, but I don't wish to download that file first to the disk and then open it. I want to download it 'in memory' and create the variable existing_pdf, which I can later modify in my program.

EDIT:

  response=urllib2.urlopen("URL")
  pdf_file = BytesIO(response.read())

  existing_pdf = PdfFileReader(pdf_file)

It simply hangs and never finishes PdfFileReader(pdf_file)

  ....
  existing_pdf = PdfFileReader(pdf_file)
  File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 374, in __init__
  self.read(stream)
  File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 705, in read
  line = self.readNextEndLine(stream)
  File "C:\Python27\lib\site-packages\pyPdf\pdf.py", line 870, in readNextEndLine
  line = x + line

2 Answers 2

11

Did you try the requests package?

import requests
from StringIO import StringIO
r = requests.get(URL)
pdf_file = StringIO(r.content)
existing_pdf = PdfFileReader(pdf_file)

This worked for me:

import os
import urllib2
from io import BytesIO
URL = "http://tug.ctan.org/tex-archive/macros/latex/contrib/logpap/example.pdf"
response=urllib2.urlopen(URL)
p = BytesIO(response.read())
p.seek(0, os.SEEK_END)
print p.tell()
# 79577
Sign up to request clarification or add additional context in comments.

3 Comments

Yeah, just tried that and that worked!!! But I don't know why the urllib2 doesn't work.
Looks like it should have. I find requests to be less finicky.
for the second example, in Python3 you want to import urllib not urllib2 (deprecated) and the call would be response=urllib.request.urlopen(URL)
0
import os
from urllib.request import urlopen
from io import BytesIO
URL = "http://tug.ctan.org/tex-archive/macros/latex/contrib/logpap/example.pdf"
response=urlopen(URL)
p = BytesIO(response.read())
p.seek(0, os.SEEK_END)
print(p.tell())

urllib2 didnt work in 2021. Use the example above.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.