Skip to main content
0 votes
0 answers
19 views

How to Extract Text Tables Images from PDFs while maintaining the structures

from unstructured library opensource one when i tried a pdf that have background images design patterns and XObjects in it this library also consider those as a images and store the path. so how can ...
Umair Ashraf's user avatar
1 vote
0 answers
78 views

PDF Scraping in Python

I am having trouble scraping certain data from PDF files in Python. There are no console errors, but when the CSV is produced, the columns Owner's First Name - Zip Code are either filled with the ...
user29394340's user avatar
0 votes
2 answers
81 views

Web Scraping Task

I'm trying to scrape a webpage and download an image in either PDF, PNG, or JPG format. The webpage I'm working with is: https://asn.scientificposters.com/epsAbstractASN.cfm?id=6. On this page, there'...
Shishir Singh's user avatar
0 votes
2 answers
101 views

Is there a way to automate reading PDFs across multiple webpages with rvest and pdftools?

I am working with all of the 2012 data from the following website: https://councildocs.dsm.city/resolutions/ The data are separated by meeting date and clicking on one date links to a different page ...
Shaq's user avatar
  • 53
1 vote
0 answers
24 views

PDF Scraping with Templated Document

I cannot scrape other details from PDF File. Some document scraped all, while others are not. And this is the following issue I am encountering. I am scraping a Sample PDF File. CASE1: Definition and ...
Donna Esperas's user avatar
1 vote
0 answers
640 views

Scrape PDF in golang

Hi can anyone help me how can I use https://pkg.go.dev/github.com/pdfcpu/[email protected] to extract human readable String via scrapping a pdf. FYI I am using AWS lambda: Here is my code snippet: package ...
kishor purohit's user avatar
1 vote
1 answer
32 views

Issue in Pdf download using request module in python

import requests pdf_url = "https://www.alexandrina.sa.gov.au/__data/assets/pdf_file/0028/1619614/Council-Special-Meeting-Agenda-11-June-2024.pdf" pdf_path = 'Test.pdf' response = requests....
Krupesh Pandya's user avatar
3 votes
1 answer
525 views

pdfplumber not picking up column & issue with multiline data

So i'm struggling with two things with a pdf extraction script i've written. The first thing being that the script isn't picking up the last column 'Serial Number' I've boxed the area I'm interested ...
Mark k's user avatar
  • 153
1 vote
1 answer
286 views

Encoding Issue When Attempting to Convert Hindi Script PDF to CSV in Python

I'm currently attempting to convert a PDF file containing Hindi Devanagari script to a CSV file using the fitz library in Python, but when I read in the text I encounter a strange encoding issue. Here ...
cedratcarlisle's user avatar
0 votes
0 answers
719 views

ModuleNotFoundError: No module named 'langchain'

i tried to extract data from an unstructured pdf file in python Vscode, i searched all solutions in google without any improvement, i struggled with an error when i tried to import LangChain library, ...
joj abd's user avatar
0 votes
1 answer
294 views

Extracted images from pdf, look like rotated, and inverted

quick question, is there some big errors in my code, apart from being dirty? why the extracted images from a pdf using PyMuPDF are looking inverted and upside down? i made some changes to the ...
user40208's user avatar
1 vote
0 answers
80 views

PDF scraping, tabula py - columns do not correspond with "true" values of PDF file

I get stuck again with PDF scraping and observe that columns do not correspond to some of the values that I obtain for those columns. Basically, I want to obtain a CSV file, but first I want to ...
Michael Picazo's user avatar
0 votes
1 answer
136 views

PDF Scraping - All Objects Passed were None

I am attempting to create a simple pdf scraper using pandas and pdfquery. I want to take the data I need from each page of the PDF by using the xml coordinates, put it into a dataframe and then save ...
Andrew Martin's user avatar
1 vote
1 answer
115 views

Pdfminer randomly changes text size when converting pdf to html

An example of the type of pdf I'm trying to scrape. I'm trying to scrape a pdf document for the number of papers, where the names of papers are in a specific font and size (10px). Given that other ...
gamer220's user avatar
0 votes
0 answers
1k views

Why is this code using PyMuPDF not extracting all the images in a PDF?

I'm trying to extract images from an invoice for an equipment order and each time I run the code I only get 4 of 8 or 9 total photos on each page. Are there some PDFs that are just not compatible with ...
Asia Vassos's user avatar

15 30 50 per page
1
2 3 4 5
11