I'm looking to see if there are any better / faster ways of identifying table structures on a page without gridlines.
The page is converted to an image with rectangular boxes for each strip of text. For the sake of this snippet, this has already been generated and has yielded the dataframe below. This is ordered top to bottom, left to right in reading order.
top top1 left middle left1
0 73.0 141 76 282.0 489.0
1 73.0 95 614 667.0 721.0
2 95.0 117 614 683.0 753.0
3 118.0 140 614 668.0 722.0
4 140.0 162 614 715.0 816.0
5 163.0 185 614 629.0 645.0
6 254.0 272 76 118.0 160.0
7 254.0 272 614 638.0 662.0
8 279.0 298 614 703.0 792.0
9 294.0 315 76 76.0 77.0
10 294.0 315 77 296.0 516.0
11 294.0 321 614 710.0 806.0
12 313.0 334 76 167.0 259.0
13 326.0 345 614 703.0 792.0
14 341.0 361 76 147.0 219.0
15 350.0 369 614 698.0 783.0
16 373.0 392 614 715.0 817.0
17 383.0 404 76 76.0 77.0
18 383.0 404 77 276.0 476.0
19 397.0 416 614 713.0 812.0
20 410.0 430 76 158.0 241.0
....... etc......
My method here is to group by an x coordinate (taking into account the text could be justified left, centred or to the right), search for ant other points which are close (within a tolerance of 5 pixels in this snippet). This gives me my columns.
Then, for each column identified, look to see where the rows are by looking for the points at which the gap between rows is over a certain a threshold. Here, we take the indexes of the points where the text should break and generate index pairs. By taking the max and min points, we can generate a bounding box around this cell.
Then, I look to see if there are other boxes located on the same x coordinate and store this in a table list.
Finally, form pairs from the tables and look at the index distance between each of the items in the table list. As the indexes should run sequentially, this should equal 1. If it doesn't, this indicates that the table doesn't continue.
import itertools
def pairwise(splits):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = itertools.tee(splits, 2)
next(b, None)
return list(zip(a, b))
def space_sort(df):
groups = df.loc[(df_coord.table==False)].groupby('page')
pages = {i:j[['top','top1','left','middle','left1']] for i,j in groups}
cols = ['left','middle','left1']
boxes = {}
for page in pages:
rows = {}
c_df = pages[page]
min_x = min(c_df.left)
gaps = c_df.loc[df.left_diff>5]
# value count on left, middle and left1 values so we can deal with text justification.
counts = {'left':[], 'middle':[], 'left1':[]}
[counts[col].append(gaps[col].unique()) for col in cols if (gaps[col].value_counts()>2).any()]
if len(counts['left'])>0:
counts['left'][0] = np.insert(counts['left'][0], 0, int(min_x))
# search c_df for other points close to these x values.
for col in cols:
if len(counts[col])>0:
for x in counts[col][0]:
row_spaces = {}
matches = c_df.loc[np.isclose(c_df[col],x, atol=5)]
left_groups = df_coord.loc[matches.index.values].reset_index()
# find points where line diff > 5 indicating new row. Get indexes.
vert_gaps = left_groups.loc[(left_groups.top - left_groups.top1.shift())>5]
vert_indexes = vert_gaps.index.values
vert_indexes = np.insert(vert_indexes,0,0)
vert_indexes = np.append(vert_indexes,len(left_groups))
# form groups between rows.
pairs = pairwise(vert_indexes)
for start,end in pairs:
box = left_groups.loc[start:end-1]
coords = (page, min(box.top),min(box.left),max(box.top1),max(box.left1))
boxes[coords]=(list(left_groups.loc[start:end-1,('index')]))
# Find close boxes by seeing which align on the same x value (either top, centre or bottom)
table = []
for a, b in itertools.combinations(boxes, 2):
a_pg, a_top, a_left, a_top1, a_left1 = a
b_pg, b_top, b_left, b_top1, b_left1 = b
a_centre = (a_top+a_top1)//2
b_centre = (b_top+b_top1)//2
if (np.isclose(a_top, b_top, atol=5)) | (np.isclose(a_centre, b_centre, atol=5)) | (np.isclose(a_top1, b_top1, atol=5)):
table.append([boxes[a],boxes[b]])
# Table list contains two lists of indexes of rows which are close together.
# As ordered, the indexes should be sequential.
# If difference between one pair and next is 1, sequential. If not, reset rows to 1
t = (pairwise(table))
row = 0
for i in t:
if (i[1][0][-1] - i[0][1][-1]) == 1:
for r in i:
row+=1
num = 1
for col in r:
print('indexes', col, 'row',row, 'col',num)
num+=1
else:
row = 0