Skip to main content

All Questions

1 vote
1 answer
32 views

Issue in Pdf download using request module in python

import requests pdf_url = "https://www.alexandrina.sa.gov.au/__data/assets/pdf_file/0028/1619614/Council-Special-Meeting-Agenda-11-June-2024.pdf" pdf_path = 'Test.pdf' response = requests....
Krupesh Pandya's user avatar
0 votes
0 answers
543 views

Cleaning Unstructured PDF data

Raw Data: Given is a PDF data containing the student placement details of a university. It is in a completely unstructured form and needs to be cleaned up before processing. The Expected CSV file ...
gurukishoreg78's user avatar
0 votes
1 answer
55 views

Scraping data from a particular pdf hosted online

I am trying to scrap data from series of pdfs hosted online The code I am using is- import fitz import requests import io import re url_pdf = ["https://wcsecure.weblink.com.au/pdf/ASN/02528656....
Bitopan Gogoi's user avatar
-1 votes
2 answers
2k views

Extract metadata info from online pdf using pdfminer in python

I am interested to find out some metadata of an online pdf using pdfminer. I am interested in extracting info such as Title, author, no of lines etc from the pdf I am trying to use a related solution ...
Bitopan Gogoi's user avatar
1 vote
2 answers
74 views

Scraping specific pdfs from different websites

First question here. I need to download a specific pdf from every url. I need just the pdf of the european commission proposal from each url that I have, which is always in a specific part of the page ...
Cesare's user avatar
  • 33
3 votes
5 answers
2k views

Is there a way to remove unwanted spaces from a string using Python or some NLP technique?? (NOT trailing or extra spaces)

s = "Over 20 years, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have been Œ based on the well -proven ...
zackakshay's user avatar
0 votes
1 answer
410 views

Scrapy script that was supposed to scrape pdf, doc files is not working properly

I am trying to implement a similar script on my project following this blog post here: https://www.imagescape.com/blog/scraping-pdf-doc-and-docx-scrapy/ The code of the spider class from the source: ...
glitchy_itchy's user avatar
0 votes
3 answers
560 views

How to parse the drop down list and get the all the links for the pdf using Beautiful Soup in Python?

I'm trying to scrape the pdf links from the drop down this website. I want to scrape just the Guideline Values (CVC) drop down. Following is the code that i used but did not succeed import requests ...
techwreck's user avatar
1 vote
1 answer
219 views

How to webscrape PDFs that are hidden under the selection option?

I am trying to download >100 pdf from a website using python. However, those pdfs are hidden under the selection option. For example: Option 1 Option 2 Option 3 ... Then, if I choose Option 1, I ...
Isaac A's user avatar
  • 575
0 votes
1 answer
42 views

file handling + word scraping (trying to find all the words in a file that end with 'y')

ERROR: Traceback (most recent call last): File "c:\Users\Pranjal\Desktop\tstp\zen_scraper.py", line 5, in words = re.findall("$y",file) File "C:\Program Files\WindowsApps\...
user avatar
0 votes
1 answer
148 views

Pandas DataFrame combine multi row spanning column

I have a complex scraped dataframe that looks like this: For context, the original data from a PDF looks like so: DataFrame info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 26 entries, ...
user1757703's user avatar
  • 3,015
6 votes
3 answers
58k views

How to scrape PDFs using Python; specific content only

I am trying to get data from PDFs available on the site https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en For example, If I look at November 2019 report https://downloads....
Camilia's user avatar
  • 81
1 vote
1 answer
2k views

Extracting data from a table of pdf to a structured format

I want to scrape the pdf table data in any structured format like html,xml,json. I am using python . I am first converting the pdf to text using pdftotext command line function. but it I am not able ...
Shivam Singh's user avatar
9 votes
1 answer
14k views

Is there a Google Image Search API? [closed]

I'm searching for an API or a program (preferably Python and open-source) which lets me download the first n pictures of a Google Image Search for let's say bicycles. It would also be helpful if it ...
technical_difficulty's user avatar