Timeline for Determine table structure in pdf using whitespace between coordinates

Current License: CC BY-SA 4.0

19 events

when toggle format	what		by	license	comment
Jul 16, 2020 at 23:54	comment	added	RootTwo		The provided code doesn't appear to work with the sample data. 'page' column is missing. Fix that and get "NameError: name 'df_coord' is not defined". Also, what is the expected output for the sample data.
Jul 16, 2020 at 21:06	comment	added	lawson		Hi @RootTwo - yes, need the code to figure that out
Jul 16, 2020 at 20:10	comment	added	RootTwo		Do you already know how many rows and columns there are, or does the code need to figure that out?
Jul 16, 2020 at 17:31	comment	added	spyr03		@lawson I've posted a review, but I'm not happy with it, in that I don't directly address the performance aspect of your question. Hopefully someone else can enlighten us with more knowledge in this area.
Jul 16, 2020 at 17:26	answer	added	spyr03		timeline score: 3
Jul 16, 2020 at 15:00	history	tweeted			twitter.com/StackCodeReview/status/1283778446435520514
Jul 16, 2020 at 14:57	history	edited	lawson	CC BY-SA 4.0	Clarified question - dataframe columns explained in description
Jul 16, 2020 at 14:48	comment	added	lawson		Hi @spyr03 - sorry, this isn't clear. I'll update the question
Jul 16, 2020 at 14:32	comment	added	spyr03		I'm confused by the dataframe. What does each column of the sample dataframe correspond to? You've mentioned x co-ordinates, but I don't know which is which? What is df.left_diff?
Jul 16, 2020 at 6:10	history	edited	lawson	CC BY-SA 4.0	Clarified question - text extraction has already been completed prior to the code shown
Jul 16, 2020 at 6:02	comment	added	lawson		The PyPDF2 package does a good job of extracting the text and there's poppler-utils which have some good binaries. This question starts from the point at which the text has already been extracted with the dataframe above representing the coordinates of each of the blocks of text on the page. The task then is to try and get some sense of order from the coordinates which indicate a table and this is what I'm getting at with this question. Whilst this works, I'm sure this could be improved which is what I'm looking for some input with.
Jul 15, 2020 at 23:57	history	edited	Ben A	CC BY-SA 4.0	added 2 characters in body
Jul 15, 2020 at 22:32	comment	added	Reinderien		If there is no copying and pasting being done, how do you get the text? OCR?
Jul 15, 2020 at 22:22	comment	added	lawson		The documents are highly variable, not well structured and require preprocessing so at no point is any copying and pasting done. This aims to identify tables where no gridlines are present and give some structure to the doc.
Jul 15, 2020 at 22:11	comment	added	Reinderien		There are better ways to extract tables from a PDF than copying and pasting text out of them if the PDF is well-understood and comes from the same source every time.
Jul 15, 2020 at 21:58	comment	added	lawson		Its for preserving tables and paragraph formatting from pdfs
Jul 15, 2020 at 21:44	comment	added	Reinderien		But why? If this is a "contrived" problem for you to practice on, then fine. But if you have to do this in real life, something is wrong.
Jul 15, 2020 at 21:37	review	First posts
Jul 15, 2020 at 23:06
Jul 15, 2020 at 21:36	history	asked	lawson	CC BY-SA 4.0

toggle format