Timeline for Determine table structure in pdf using whitespace between coordinates
Current License: CC BY-SA 4.0
19 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Jul 16, 2020 at 23:54 | comment | added | RootTwo | The provided code doesn't appear to work with the sample data. 'page' column is missing. Fix that and get "NameError: name 'df_coord' is not defined". Also, what is the expected output for the sample data. | |
| Jul 16, 2020 at 21:06 | comment | added | lawson | Hi @RootTwo - yes, need the code to figure that out | |
| Jul 16, 2020 at 20:10 | comment | added | RootTwo | Do you already know how many rows and columns there are, or does the code need to figure that out? | |
| Jul 16, 2020 at 17:31 | comment | added | spyr03 | @lawson I've posted a review, but I'm not happy with it, in that I don't directly address the performance aspect of your question. Hopefully someone else can enlighten us with more knowledge in this area. | |
| Jul 16, 2020 at 17:26 | answer | added | spyr03 | timeline score: 3 | |
| Jul 16, 2020 at 15:00 | history | tweeted | twitter.com/StackCodeReview/status/1283778446435520514 | ||
| Jul 16, 2020 at 14:57 | history | edited | lawson | CC BY-SA 4.0 |
Clarified question - dataframe columns explained in description
|
| Jul 16, 2020 at 14:48 | comment | added | lawson | Hi @spyr03 - sorry, this isn't clear. I'll update the question | |
| Jul 16, 2020 at 14:32 | comment | added | spyr03 | I'm confused by the dataframe. What does each column of the sample dataframe correspond to? You've mentioned x co-ordinates, but I don't know which is which? What is df.left_diff? | |
| Jul 16, 2020 at 6:10 | history | edited | lawson | CC BY-SA 4.0 |
Clarified question - text extraction has already been completed prior to the code shown
|
| Jul 16, 2020 at 6:02 | comment | added | lawson | The PyPDF2 package does a good job of extracting the text and there's poppler-utils which have some good binaries. This question starts from the point at which the text has already been extracted with the dataframe above representing the coordinates of each of the blocks of text on the page. The task then is to try and get some sense of order from the coordinates which indicate a table and this is what I'm getting at with this question. Whilst this works, I'm sure this could be improved which is what I'm looking for some input with. | |
| Jul 15, 2020 at 23:57 | history | edited | Ben A | CC BY-SA 4.0 |
added 2 characters in body
|
| Jul 15, 2020 at 22:32 | comment | added | Reinderien | If there is no copying and pasting being done, how do you get the text? OCR? | |
| Jul 15, 2020 at 22:22 | comment | added | lawson | The documents are highly variable, not well structured and require preprocessing so at no point is any copying and pasting done. This aims to identify tables where no gridlines are present and give some structure to the doc. | |
| Jul 15, 2020 at 22:11 | comment | added | Reinderien | There are better ways to extract tables from a PDF than copying and pasting text out of them if the PDF is well-understood and comes from the same source every time. | |
| Jul 15, 2020 at 21:58 | comment | added | lawson | Its for preserving tables and paragraph formatting from pdfs | |
| Jul 15, 2020 at 21:44 | comment | added | Reinderien | But why? If this is a "contrived" problem for you to practice on, then fine. But if you have to do this in real life, something is wrong. | |
| Jul 15, 2020 at 21:37 | review | First posts | |||
| Jul 15, 2020 at 23:06 | |||||
| Jul 15, 2020 at 21:36 | history | asked | lawson | CC BY-SA 4.0 |