I’m trying to extract tabular data from a scanned engineering document.
The table contains:
merged header cells
irregular row heights
irregular column widths
faint and broken borders
text inside every cell
vertical strokes in text that look like borders
engineering symbols and logos
My goal is to extract the table in the exact same structure as the image:
correct rows
correct columns
correct merged cells
correct OCR text inside each cell
❗ What I need
A solution that can:
Detect the true horizontal and vertical lines
Reconstruct the table grid accurately
Identify and handle merged cells
Extract OCR cell by cell in the correct reading order
Produce a structured output (e.g., Pandas DataFrame) matching the original layout
❗ What I’ve tried (with code) — but still not working
I have tried multiple OpenCV-based approaches, but none can robustly reconstruct the table.
Below is a summary of each method and why it fails.
1. Contour-based cell detection
cnts, _ = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
❌ Problems:
text inside the cells produces additional contours
grid lines that touch text merge into large irregular polygons
hard to get reading order
merged cells break the hierarchy
header row splits into many sub-contours
2. Hough Line Transform
lines = cv2.HoughLinesP(binary, 1, np.pi/180, threshold=50)
❌ Problems:
partial faint lines detected as many segments
cannot distinguish broken borders from noise
short text strokes (“I”, “|”, “1”) detected as vertical lines
merging line segments accurately becomes very difficult
3. Morphological Line Detection
Horizontal lines:
h_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (80, 1))
h_lines = cv2.morphologyEx(bw, cv2.MORPH_OPEN, h_kernel)
Vertical lines:
v_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 80))
v_lines = cv2.morphologyEx(bw, cv2.MORPH_OPEN, v_kernel)
❌ Problems:
vertical strokes from letters like “M”, “I”, “H”, “T”, “1” are detected as column lines
small horizontal strokes under text get detected as row lines
header row gets split into 10–20 false columns
faint broken borders produce multiple lines
Even after line merging and connected-components filtering, text strokes are still detected as lines.
4. Connected Component Filtering
I attempted strong filtering:
if h >= 0.7 * image_height and w <= 10:
keep_vertical_line
And similar for horizontal lines.
❌ Problems:
real table borders are sometimes broken → rejected
text strokes occasionally exceed height threshold → accepted
threshold tuning is image dependent
❗ Why this is hard
This engineering table includes:
many merged cells
multiple header bands
broken borders due to scanning
text overlapping borders
vertical elements in the logo
dense text with vertical-like strokes
faint interior separators
Because of these issues, pure OpenCV geometric reconstruction becomes unreliable.
❗ What I am asking for
What is the recommended method (OpenCV-only, hybrid, or ML-based) to:
reliably detect the table grid,
preserve the table structure,
handle merged cells,
and extract OCR text cell-by-cell in correct order
from a complex scanned engineering document?
I’m looking for:
a robust OpenCV pipeline, OR
a deep learning approach (TableNet, CascadeTabNet, YOLO table models), OR
hybrid OpenCV + ML guidance, OR
built-in table extraction tools that handle structural reconstruction
📌 Minimal Reproducible Example
Here is the code I am currently using:
import cv2
import numpy as np
img = cv2.imread("table.png")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
bw = cv2.adaptiveThreshold(
gray, 255,
cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY_INV,
15, 8
)
# Morphological line detection
h_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (80, 1))
raw_h = cv2.morphologyEx(bw, cv2.MORPH_OPEN, h_kernel)
v_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 80))
raw_v = cv2.morphologyEx(bw, cv2.MORPH_OPEN, v_kernel)
# Filtering (still fails)
def filter_vertical(binary, min_height):
num, _, stats, _ = cv2.connectedComponentsWithStats(binary)
out = np.zeros_like(binary)
for i in range(1, num):
x,y,w,h,area = stats[i]
if h >= min_height and w < 10:
out[y:y+h, x:x+w] = 255
return out
v_lines = filter_vertical(raw_v, int(0.7 * img.shape[0]))
cv2.imwrite("vlines_debug.png", v_lines)
Even with strong filtering, vertical text strokes still appear as table lines, breaking the grid.
📌 Desired Output
A Pandas DataFrame where:
cell positions map to the real table structure
merged cells are respected
data is extracted in correct reading order
the table layout matches the scanned image
🙏 Any guidance would be greatly appreciated.
I am open to:
OpenCV-only solutions
ML-based table structure models
hybrid approaches
or suggestions for more reliable tooling