All Questions
14 questions
1
vote
1
answer
32
views
Issue in Pdf download using request module in python
import requests
pdf_url = "https://www.alexandrina.sa.gov.au/__data/assets/pdf_file/0028/1619614/Council-Special-Meeting-Agenda-11-June-2024.pdf"
pdf_path = 'Test.pdf'
response = requests....
0
votes
0
answers
543
views
Cleaning Unstructured PDF data
Raw Data:
Given is a PDF data containing the student placement details of a university.
It is in a completely unstructured form and needs to be cleaned up before processing.
The Expected CSV file ...
0
votes
1
answer
55
views
Scraping data from a particular pdf hosted online
I am trying to scrap data from series of pdfs hosted online
The code I am using is-
import fitz
import requests
import io
import re
url_pdf = ["https://wcsecure.weblink.com.au/pdf/ASN/02528656....
-1
votes
2
answers
2k
views
Extract metadata info from online pdf using pdfminer in python
I am interested to find out some metadata of an online pdf using pdfminer. I am interested in extracting info such as Title, author, no of lines etc from the pdf
I am trying to use a related solution ...
1
vote
2
answers
74
views
Scraping specific pdfs from different websites
First question here. I need to download a specific pdf from every url. I need just the pdf of the european commission proposal from each url that I have, which is always in a specific part of the page
...
3
votes
5
answers
2k
views
Is there a way to remove unwanted spaces from a string using Python or some NLP technique?? (NOT trailing or extra spaces)
s = "Over 20 years, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have been Œ based on the well -proven ...
0
votes
1
answer
410
views
Scrapy script that was supposed to scrape pdf, doc files is not working properly
I am trying to implement a similar script on my project following this blog post here:
https://www.imagescape.com/blog/scraping-pdf-doc-and-docx-scrapy/
The code of the spider class from the source:
...
0
votes
3
answers
560
views
How to parse the drop down list and get the all the links for the pdf using Beautiful Soup in Python?
I'm trying to scrape the pdf links from the drop down this website. I want to scrape just the Guideline Values (CVC) drop down. Following is the code that i used but did not succeed
import requests
...
1
vote
1
answer
219
views
How to webscrape PDFs that are hidden under the selection option?
I am trying to download >100 pdf from a website using python. However, those pdfs are hidden under the selection option. For example:
Option 1
Option 2
Option 3
...
Then, if I choose Option 1, I ...
0
votes
1
answer
42
views
file handling + word scraping (trying to find all the words in a file that end with 'y')
ERROR: Traceback (most recent call last): File "c:\Users\Pranjal\Desktop\tstp\zen_scraper.py", line 5, in words = re.findall("$y",file) File "C:\Program Files\WindowsApps\...
0
votes
1
answer
148
views
Pandas DataFrame combine multi row spanning column
I have a complex scraped dataframe that looks like this:
For context, the original data from a PDF looks like so:
DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, ...
6
votes
3
answers
58k
views
How to scrape PDFs using Python; specific content only
I am trying to get data from PDFs available on the site
https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en
For example, If I look at November 2019 report
https://downloads....
1
vote
1
answer
2k
views
Extracting data from a table of pdf to a structured format
I want to scrape the pdf table data in any structured format like html,xml,json.
I am using python . I am first converting the pdf to text using pdftotext command line function. but it I am not able ...
9
votes
1
answer
14k
views
Is there a Google Image Search API? [closed]
I'm searching for an API or a program (preferably Python and open-source) which lets me download the first n pictures of a Google Image Search for let's say bicycles. It would also be helpful if it ...