Return to Revisions

2 of 4

added 2 characters in body

edited Jul 15, 2020 at 23:57

Ben A

10.8k
5
40
103

Determine table structure in image using whitespace between coordinates

I'm looking to see if there are any better / faster ways of identifying table structures on a page without gridlines.

The page is converted to an image with rectangular boxes for each strip of text. For the sake of this snippet, this has already been generated and has yielded the dataframe below. This is ordered top to bottom, left to right in reading order.

    top     top1    left    middle  left1
0   73.0    141     76      282.0   489.0
1   73.0    95      614     667.0   721.0
2   95.0    117     614     683.0   753.0
3   118.0   140     614     668.0   722.0
4   140.0   162     614     715.0   816.0
5   163.0   185     614     629.0   645.0
6   254.0   272     76      118.0   160.0
7   254.0   272     614     638.0   662.0
8   279.0   298     614     703.0   792.0
9   294.0   315     76      76.0    77.0
10  294.0   315     77      296.0   516.0
11  294.0   321     614     710.0   806.0
12  313.0   334     76      167.0   259.0
13  326.0   345     614     703.0   792.0
14  341.0   361     76      147.0   219.0
15  350.0   369     614     698.0   783.0
16  373.0   392     614     715.0   817.0
17  383.0   404     76      76.0    77.0
18  383.0   404     77      276.0   476.0
19  397.0   416     614     713.0   812.0
20  410.0   430     76      158.0   241.0

....... etc......

My method here is to group by an x coordinate (taking into account the text could be justified left, centred or to the right), search for ant other points which are close (within a tolerance of 5 pixels in this snippet). This gives me my columns.

Then, for each column identified, look to see where the rows are by looking for the points at which the gap between rows is over a certain a threshold. Here, we take the indexes of the points where the text should break and generate index pairs. By taking the max and min points, we can generate a bounding box around this cell.

Then, I look to see if there are other boxes located on the same x coordinate and store this in a table list.

Finally, form pairs from the tables and look at the index distance between each of the items in the table list. As the indexes should run sequentially, this should equal 1. If it doesn't, this indicates that the table doesn't continue.

import itertools

def pairwise(splits):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = itertools.tee(splits, 2)
    next(b, None)
    return list(zip(a, b))

def space_sort(df):
    groups = df.loc[(df_coord.table==False)].groupby('page')
    pages = {i:j[['top','top1','left','middle','left1']] for i,j in groups}
    cols = ['left','middle','left1']
    boxes = {}
    for page in pages:
        rows = {}
        c_df = pages[page]
        min_x = min(c_df.left)
        gaps = c_df.loc[df.left_diff>5]
        
        #  value count on left, middle and left1 values so we can deal with text justification.
        counts = {'left':[], 'middle':[], 'left1':[]}
        [counts[col].append(gaps[col].unique()) for col in cols if (gaps[col].value_counts()>2).any()]
        
        if len(counts['left'])>0:
            counts['left'][0] = np.insert(counts['left'][0], 0, int(min_x))

        #  search c_df for other points close to these x values.
        for col in cols:
            if len(counts[col])>0:
                for x in counts[col][0]:
                    row_spaces = {}
                    matches = c_df.loc[np.isclose(c_df[col],x, atol=5)]
                    left_groups = df_coord.loc[matches.index.values].reset_index()
                    
#           find points where line diff > 5 indicating new row. Get indexes.
                    vert_gaps = left_groups.loc[(left_groups.top - left_groups.top1.shift())>5]                    
                    vert_indexes = vert_gaps.index.values
                    vert_indexes = np.insert(vert_indexes,0,0)
                    vert_indexes = np.append(vert_indexes,len(left_groups))
                    
#           form groups between rows.
                    pairs = pairwise(vert_indexes)
                    for start,end in pairs:
                        box = left_groups.loc[start:end-1]
                        coords = (page, min(box.top),min(box.left),max(box.top1),max(box.left1))
                        boxes[coords]=(list(left_groups.loc[start:end-1,('index')]))

#  Find close boxes by seeing which align on the same x value (either top, centre or bottom)
    
    table = []
    for a, b in itertools.combinations(boxes, 2):

        a_pg, a_top, a_left, a_top1, a_left1 = a
        b_pg, b_top, b_left, b_top1, b_left1 = b
        a_centre = (a_top+a_top1)//2
        b_centre = (b_top+b_top1)//2
        if (np.isclose(a_top, b_top, atol=5)) | (np.isclose(a_centre, b_centre, atol=5)) | (np.isclose(a_top1, b_top1, atol=5)):
            table.append([boxes[a],boxes[b]])
    
#  Table list contains two lists of indexes of rows which are close together. 
#  As ordered, the indexes should be sequential.
# If difference between one pair and next is 1, sequential. If not, reset rows to 1
    t = (pairwise(table))
    row = 0
    for i in t:
        if (i[1][0][-1] - i[0][1][-1]) == 1:
            for r in i:
                row+=1
                num = 1
                for col in r:
                    print('indexes', col, 'row',row, 'col',num)
                    num+=1
        else:
            row = 0

asked Jul 15, 2020 at 21:36

lawson