Skip to main content
19 events
when toggle format what by license comment
Jul 16, 2020 at 23:54 comment added RootTwo The provided code doesn't appear to work with the sample data. 'page' column is missing. Fix that and get "NameError: name 'df_coord' is not defined". Also, what is the expected output for the sample data.
Jul 16, 2020 at 21:06 comment added lawson Hi @RootTwo - yes, need the code to figure that out
Jul 16, 2020 at 20:10 comment added RootTwo Do you already know how many rows and columns there are, or does the code need to figure that out?
Jul 16, 2020 at 17:31 comment added spyr03 @lawson I've posted a review, but I'm not happy with it, in that I don't directly address the performance aspect of your question. Hopefully someone else can enlighten us with more knowledge in this area.
Jul 16, 2020 at 17:26 answer added spyr03 timeline score: 3
Jul 16, 2020 at 15:00 history tweeted twitter.com/StackCodeReview/status/1283778446435520514
Jul 16, 2020 at 14:57 history edited lawson CC BY-SA 4.0
Clarified question - dataframe columns explained in description
Jul 16, 2020 at 14:48 comment added lawson Hi @spyr03 - sorry, this isn't clear. I'll update the question
Jul 16, 2020 at 14:32 comment added spyr03 I'm confused by the dataframe. What does each column of the sample dataframe correspond to? You've mentioned x co-ordinates, but I don't know which is which? What is df.left_diff?
Jul 16, 2020 at 6:10 history edited lawson CC BY-SA 4.0
Clarified question - text extraction has already been completed prior to the code shown
Jul 16, 2020 at 6:02 comment added lawson The PyPDF2 package does a good job of extracting the text and there's poppler-utils which have some good binaries. This question starts from the point at which the text has already been extracted with the dataframe above representing the coordinates of each of the blocks of text on the page. The task then is to try and get some sense of order from the coordinates which indicate a table and this is what I'm getting at with this question. Whilst this works, I'm sure this could be improved which is what I'm looking for some input with.
Jul 15, 2020 at 23:57 history edited Ben A CC BY-SA 4.0
added 2 characters in body
Jul 15, 2020 at 22:32 comment added Reinderien If there is no copying and pasting being done, how do you get the text? OCR?
Jul 15, 2020 at 22:22 comment added lawson The documents are highly variable, not well structured and require preprocessing so at no point is any copying and pasting done. This aims to identify tables where no gridlines are present and give some structure to the doc.
Jul 15, 2020 at 22:11 comment added Reinderien There are better ways to extract tables from a PDF than copying and pasting text out of them if the PDF is well-understood and comes from the same source every time.
Jul 15, 2020 at 21:58 comment added lawson Its for preserving tables and paragraph formatting from pdfs
Jul 15, 2020 at 21:44 comment added Reinderien But why? If this is a "contrived" problem for you to practice on, then fine. But if you have to do this in real life, something is wrong.
Jul 15, 2020 at 21:37 review First posts
Jul 15, 2020 at 23:06
Jul 15, 2020 at 21:36 history asked lawson CC BY-SA 4.0