Newest 'pdf-scraping+python+web-scraping' Questions

1 vote

1 answer

32 views

Issue in Pdf download using request module in python

import requests pdf_url = "https://www.alexandrina.sa.gov.au/__data/assets/pdf_file/0028/1619614/Council-Special-Meeting-Agenda-11-June-2024.pdf" pdf_path = 'Test.pdf' response = requests....

Krupesh Pandya

11

asked Jun 14, 2024 at 7:19

0 votes

0 answers

543 views

Cleaning Unstructured PDF data

Raw Data: Given is a PDF data containing the student placement details of a university. It is in a completely unstructured form and needs to be cleaned up before processing. The Expected CSV file ...

gurukishoreg78

11

asked May 17, 2023 at 14:41

0 votes

1 answer

55 views

Scraping data from a particular pdf hosted online

I am trying to scrap data from series of pdfs hosted online The code I am using is- import fitz import requests import io import re url_pdf = ["https://wcsecure.weblink.com.au/pdf/ASN/02528656....

Bitopan Gogoi

125

asked Mar 1, 2023 at 8:38

-1 votes

2 answers

2k views

Extract metadata info from online pdf using pdfminer in python

I am interested to find out some metadata of an online pdf using pdfminer. I am interested in extracting info such as Title, author, no of lines etc from the pdf I am trying to use a related solution ...

Bitopan Gogoi

125

asked Feb 28, 2023 at 11:24

1 vote

2 answers

74 views

Scraping specific pdfs from different websites

First question here. I need to download a specific pdf from every url. I need just the pdf of the european commission proposal from each url that I have, which is always in a specific part of the page ...

Cesare

33

asked Dec 27, 2022 at 21:56

3 votes

5 answers

2k views

Is there a way to remove unwanted spaces from a string using Python or some NLP technique?? (NOT trailing or extra spaces)

s = "Over 20 years, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have been Œ based on the well -proven ...

zackakshay

41

asked Mar 22, 2022 at 7:55

0 votes

1 answer

410 views

Scrapy script that was supposed to scrape pdf, doc files is not working properly

I am trying to implement a similar script on my project following this blog post here: https://www.imagescape.com/blog/scraping-pdf-doc-and-docx-scrapy/ The code of the spider class from the source: ...

glitchy_itchy

39

asked Dec 12, 2021 at 16:48

0 votes

3 answers

560 views

How to parse the drop down list and get the all the links for the pdf using Beautiful Soup in Python?

I'm trying to scrape the pdf links from the drop down this website. I want to scrape just the Guideline Values (CVC) drop down. Following is the code that i used but did not succeed import requests ...

techwreck

53

asked Jul 5, 2021 at 9:17

1 vote

1 answer

219 views

How to webscrape PDFs that are hidden under the selection option?

I am trying to download >100 pdf from a website using python. However, those pdfs are hidden under the selection option. For example: Option 1 Option 2 Option 3 ... Then, if I choose Option 1, I ...

Isaac A

575

asked Jun 28, 2021 at 19:25

0 votes

1 answer

42 views

file handling + word scraping (trying to find all the words in a file that end with 'y')

ERROR: Traceback (most recent call last): File "c:\Users\Pranjal\Desktop\tstp\zen_scraper.py", line 5, in words = re.findall("$y",file) File "C:\Program Files\WindowsApps\...

user14143568

asked Mar 20, 2021 at 13:36

0 votes

1 answer

148 views

Pandas DataFrame combine multi row spanning column

I have a complex scraped dataframe that looks like this: For context, the original data from a PDF looks like so: DataFrame info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 26 entries, ...

user1757703

3,015

asked May 15, 2020 at 17:48

6 votes

3 answers

58k views

How to scrape PDFs using Python; specific content only

I am trying to get data from PDFs available on the site https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en For example, If I look at November 2019 report https://downloads....

Camilia

81

asked Dec 1, 2019 at 22:43

1 vote

1 answer

2k views

Extracting data from a table of pdf to a structured format

I want to scrape the pdf table data in any structured format like html,xml,json. I am using python . I am first converting the pdf to text using pdftotext command line function. but it I am not able ...

Shivam Singh

21

asked Apr 17, 2018 at 10:09

9 votes

1 answer

14k views

Is there a Google Image Search API? [closed]

I'm searching for an API or a program (preferably Python and open-source) which lets me download the first n pictures of a Google Image Search for let's say bicycles. It would also be helpful if it ...

technical_difficulty

477

asked Apr 7, 2016 at 12:03

Collectives™ on Stack Overflow

All Questions

Issue in Pdf download using request module in python

Cleaning Unstructured PDF data

Scraping data from a particular pdf hosted online

Extract metadata info from online pdf using pdfminer in python

Scraping specific pdfs from different websites

Is there a way to remove unwanted spaces from a string using Python or some NLP technique?? (NOT trailing or extra spaces)

Scrapy script that was supposed to scrape pdf, doc files is not working properly

How to parse the drop down list and get the all the links for the pdf using Beautiful Soup in Python?

How to webscrape PDFs that are hidden under the selection option?

file handling + word scraping (trying to find all the words in a file that end with 'y')

Pandas DataFrame combine multi row spanning column

How to scrape PDFs using Python; specific content only

Extracting data from a table of pdf to a structured format

Is there a Google Image Search API? [closed]

Hot Network Questions

Collectives™ on Stack Overflow

All Questions

Related Tags