Skip to main content

All Questions

0 votes
0 answers
471 views

Extract only the body text of the PDF, not the bulleted points, headings and subheadings using python pdfplumber library

Code import pdfplumber ecdata = "" with pdfplumber.open("XYZ Transcript.pdf") as pdf: for i in range(len(pdf.pages)): print("Page No.: ", i+1) ...
Kituva Ravindran Praveen's user avatar
0 votes
1 answer
3k views

pdfminer: extract only text according to font size

I only want to extract text that has font size 9.800000000000068 and 10.000000000000057 from my pdf files. The code below returns a list of the font size of each text block and its characters for one ...
id345678's user avatar
  • 107
2 votes
2 answers
2k views

Python PdfMiner - How to get the info on the orientation of each word/sentence included in a pdf?

Target: I want to extract the info on the orientation of each word or sentence from a PDF like the attached one. The reason for this is that i want to keep the text only from the orientation with zero ...
Vagelis's user avatar
  • 66
1 vote
0 answers
49 views

trying to extract data from pdf and make sense of it and upload it to a database

Ive got many PDF's which contain data like name , Address , Contact info , Email Id's and many more details. i am trying to write a program to convert this data into Text file and using different ...
suyash joshi's user avatar
6 votes
2 answers
778 views

pdftotext get font information (font-family, style, size)

I'm using "pdftotext -bbox file.pdf" to convert a pdf file into HTML. Here's a sample line from the output: <word xMin="351.852025" yMin="42.548936" xMax="365.689478" yMax="47.681498">foo</...
James Kroning's user avatar
0 votes
1 answer
776 views

How to get chars/words/lines/blocks coordinates

I'm doing pdftotext -bbox file.pdf and that produces word-level output. Is there a way to output coordinates on the character/phrase/line/block level? I'm interested in knowing if either the poppler ...
James Kroning's user avatar
2 votes
1 answer
3k views

PDF scraping using textract module

I have a Node.js app that have to do some web scraping of online pdf. This is a piece of code: var textract = require('textract'); const util = require('util'); var methods = {}; var urls = [ {...
user avatar
0 votes
4 answers
459 views

optical character recognition of PDFs of parliamentary debates

For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany. The problem is that most of these files have a two-...
Cetin Sert's user avatar
  • 4,601
420 votes
13 answers
467k views

Python module for converting PDF to text [closed]

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.
cnu's user avatar
  • 37.3k