All Questions
Tagged with pdf-scraping text-extraction
9 questions
0
votes
0
answers
471
views
Extract only the body text of the PDF, not the bulleted points, headings and subheadings using python pdfplumber library
Code
import pdfplumber
ecdata = ""
with pdfplumber.open("XYZ Transcript.pdf") as pdf:
for i in range(len(pdf.pages)):
print("Page No.: ", i+1)
...
0
votes
1
answer
3k
views
pdfminer: extract only text according to font size
I only want to extract text that has font size 9.800000000000068 and 10.000000000000057 from my pdf files.
The code below returns a list of the font size of each text block and its characters for one ...
2
votes
2
answers
2k
views
Python PdfMiner - How to get the info on the orientation of each word/sentence included in a pdf?
Target:
I want to extract the info on the orientation of each word or sentence from a PDF like the attached one. The reason for this is that i want to keep the text only from the orientation with zero ...
1
vote
0
answers
49
views
trying to extract data from pdf and make sense of it and upload it to a database
Ive got many PDF's which contain data like name , Address , Contact info , Email Id's and many more details.
i am trying to write a program to convert this data into Text file and using different ...
6
votes
2
answers
778
views
pdftotext get font information (font-family, style, size)
I'm using "pdftotext -bbox file.pdf" to convert a pdf file into HTML.
Here's a sample line from the output:
<word xMin="351.852025" yMin="42.548936" xMax="365.689478"
yMax="47.681498">foo</...
0
votes
1
answer
776
views
How to get chars/words/lines/blocks coordinates
I'm doing pdftotext -bbox file.pdf and that produces word-level output.
Is there a way to output coordinates on the character/phrase/line/block level?
I'm interested in knowing if either the poppler ...
2
votes
1
answer
3k
views
PDF scraping using textract module
I have a Node.js app that have to do some web scraping of online pdf.
This is a piece of code:
var textract = require('textract');
const util = require('util');
var methods = {};
var urls = [
{...
0
votes
4
answers
459
views
optical character recognition of PDFs of parliamentary debates
For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany.
The problem is that most of these files have a two-...
420
votes
13
answers
467k
views
Python module for converting PDF to text [closed]
Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.