While building an OCR pipeline for automatic data extraction from a fairly messy scanned document, I got stuck at the image preprocessing stage where I need to crop all cells from a table before running Tesseract on them. One major complication is the necessity to somehow label rows and columns, since extracted data needs structure to hold any meaning at all. Additionally, ability to extract only specific rows and/or columns would make my pipeline much faster when working with several documents of similar composition, which is its expected use.
Here is the code I currently use, almost directly taken from an open source box detection algorithm by Kanan Vyas found earlier on Github
import cv2
import numpy as np
import os
import glob
def sort_contours(cnts, method="left-to-right"):
reverse = False
i = 0
if method == "right-to-left" or method == "bottom-to-top":
reverse = True
if method == "top-to-bottom" or method == "bottom-to-top":
i = 1
boundingBoxes = [cv2.boundingRect(c) for c in cnts]
(cnts, boundingBoxes) = zip(*sorted(zip(cnts, boundingBoxes),
key=lambda b: b[1][i], reverse=reverse))
return (cnts, boundingBoxes)
def box_extraction(img_for_box_extraction_path, cropped_dir_path):
img = cv2.imread(img_for_box_extraction_path, 0)
img[int(0):int(img.shape[0]),int(0):int(5)] = [255, 255, 255]
(thresh, img_bin) = cv2.threshold(img, 128, 255,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)
img_bin = 255-img_bin
cv2.imwrite("Image_bin.jpg",img_bin)
kernel_length = np.array(img).shape[1]//40
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_length))
hori_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_length, 1))
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
img_temp1 = cv2.erode(img_bin, vertical_kernel, iterations=3)
vertical_lines_img = cv2.dilate(img_temp1, vertical_kernel, iterations=3)
cv2.imwrite("vertical_lines.jpg",vertical_lines_img)
img_temp2 = cv2.erode(img_bin, hori_kernel, iterations=3)
horizontal_lines_img = cv2.dilate(img_temp2, hori_kernel, iterations=3)
cv2.imwrite("horizontal_lines.jpg",horizontal_lines_img)
alpha = 0.5
beta = 1.0 - alpha
img_final_bin = cv2.addWeighted(vertical_lines_img, alpha, horizontal_lines_img, beta, 0.0)
img_final_bin = cv2.erode(~img_final_bin, kernel, iterations=2)
(thresh, img_final_bin) = cv2.threshold(img_final_bin, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
cv2.imwrite("img_final_bin.png",img_final_bin)
im2, contours, hierarchy = cv2.findContours(
img_final_bin, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
(contours, boundingBoxes) = sort_contours(contours, method="top-to-bottom")
idx = 0
for c in contours:
x, y, w, h = cv2.boundingRect(c)
if (w > 30 and h > 20) and w > h and h < 150:
idx += 1
new_img = img[y:y+h, x:x+w]
cv2.imwrite(cropped_dir_path+str(idx) + '.png', new_img)
cv2.drawContours(img, contours, -1, (0, 0, 255), 3)
cv2.imwrite("./Temp/img_contour.jpg", img)
output_dir = "./Cropped/"
files = glob.glob(output_dir+"*")
for f in files:
os.remove(f)
box_extraction("./prototype/page_cropped.png", output_dir)
The problem with this approach is in sorting - simple "directional" directives used here fail to preserve table structure, especially if the table is somewhat skewed in the source image because of the trade-off between dewarping table borders and preserving table contents for OCR. If it is important, I do not currently expect to deal with table cells that occupy more than one row or column.
My actual image contains sensitive business data, so I've made this dummy picture with GIMP. It should be enough for demonstration purposes, you only need to point box_extraction() function at it.
Given that incomplete cells to the right should be ignored by the box extraction algorithm, I expected to get 9 x 4 = 36 images named "1.png" (with cell 0,0), "2.png" (0,1) etc, each set of 4 in correspondence to cells of a single row (it should be possible to get all cells of a single column by selecting its header cell from the first row and every fifth image thereafter).
However, now output images end up arranged in a very weird order, with "1.png" to "4.png" holding cells of the first row in reverse order, "5.png" - first cell of the second row, "6.png" - last cell of the second row, and collapse of the pattern after that.
