Newest 'pdf-scraping+text-extraction' Questions

0 votes

0 answers

471 views

Extract only the body text of the PDF, not the bulleted points, headings and subheadings using python pdfplumber library

Code import pdfplumber ecdata = "" with pdfplumber.open("XYZ Transcript.pdf") as pdf: for i in range(len(pdf.pages)): print("Page No.: ", i+1) ...

Kituva Ravindran Praveen

45

asked Aug 12, 2022 at 5:15

0 votes

1 answer

3k views

pdfminer: extract only text according to font size

I only want to extract text that has font size 9.800000000000068 and 10.000000000000057 from my pdf files. The code below returns a list of the font size of each text block and its characters for one ...

id345678

107

asked Aug 22, 2021 at 15:32

2 votes

2 answers

2k views

Python PdfMiner - How to get the info on the orientation of each word/sentence included in a pdf?

Target: I want to extract the info on the orientation of each word or sentence from a PDF like the attached one. The reason for this is that i want to keep the text only from the orientation with zero ...

Vagelis

66

asked Sep 24, 2020 at 9:53

1 vote

0 answers

49 views

trying to extract data from pdf and make sense of it and upload it to a database

Ive got many PDF's which contain data like name , Address , Contact info , Email Id's and many more details. i am trying to write a program to convert this data into Text file and using different ...

suyash joshi

61

asked Nov 16, 2019 at 7:00

6 votes

2 answers

778 views

pdftotext get font information (font-family, style, size)

I'm using "pdftotext -bbox file.pdf" to convert a pdf file into HTML. Here's a sample line from the output: <word xMin="351.852025" yMin="42.548936" xMax="365.689478" yMax="47.681498">foo</...

James Kroning

71

asked May 6, 2018 at 11:23

0 votes

1 answer

776 views

How to get chars/words/lines/blocks coordinates

I'm doing pdftotext -bbox file.pdf and that produces word-level output. Is there a way to output coordinates on the character/phrase/line/block level? I'm interested in knowing if either the poppler ...

James Kroning

71

asked May 6, 2018 at 9:51

2 votes

1 answer

3k views

PDF scraping using textract module

I have a Node.js app that have to do some web scraping of online pdf. This is a piece of code: var textract = require('textract'); const util = require('util'); var methods = {}; var urls = [ {...

user6118527

asked Apr 24, 2018 at 13:16

0 votes

4 answers

459 views

optical character recognition of PDFs of parliamentary debates

For a contract work, I need to digitalize a lot of old, scanned-graphic-only plenary debate protocol PDFs from the Federal Parliament of Germany. The problem is that most of these files have a two-...

Cetin Sert

4,601

asked Jul 9, 2009 at 14:59

420 votes

13 answers

467k views

Python module for converting PDF to text [closed]

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.

cnu

37.3k

asked Aug 25, 2008 at 4:44

Collectives™ on Stack Overflow

All Questions

Extract only the body text of the PDF, not the bulleted points, headings and subheadings using python pdfplumber library

pdfminer: extract only text according to font size

Python PdfMiner - How to get the info on the orientation of each word/sentence included in a pdf?

trying to extract data from pdf and make sense of it and upload it to a database

pdftotext get font information (font-family, style, size)

How to get chars/words/lines/blocks coordinates

PDF scraping using textract module

optical character recognition of PDFs of parliamentary debates

Python module for converting PDF to text [closed]

Hot Network Questions

Collectives™ on Stack Overflow

All Questions

Related Tags