Extract data from image containing table grid using python

Question

I have images such as the one attached below. I need to extract the data within the grid along with the tabular structure and transform it into a dataframe/csv.

I am using OCR to extract the text along with the coordinates but in order to extract the table structure I would like to extract the horizontal and vertical grid lines.

Is there a method in OpenCV to do that that would generalize well ?

So far the approaches I've come across are : 1. Hough Lines 2. Extracting Rectangular contours 3. Drawing vertical and horizonal contours

Chrys Bltr · Accepted Answer · 2020-05-27 23:56:10Z

2

You can define a grid structure and extract information from all separate area with openCV, check this article A Box detection algorithm for any image containing boxes

Everything is perfectly explained

answered May 27, 2020 at 23:56

Chrys Bltr

783 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

stepovr Over a year ago

This works quite well but has some inconsistencies in the case of scanned documents. Any idea on how to make the extraction more robust ?

Chrys Bltr Over a year ago

Glad it was useful! Could you being more specific about what inconsistencies are you talking about ?

stepovr Over a year ago

Please check this out : stackoverflow.com/questions/62092264/…

Knight Forked · Accepted Answer · 2020-05-28 05:20:17Z

With all due respect to @Chrys Bltr, the solution in the link is a little overkill. Here's a simpler solution, so I think:

import numpy as np
import cv2
import matplotlib.pyplot as plt

img_rgb = cv2.imread('your/image')
img = cv2.cvtColor(img_rgb, cv2.COLOR_BGR2GRAY)

th = cv2.adaptiveThreshold(img,255, cv2.ADAPTIVE_THRESH_MEAN_C,cv2.THRESH_BINARY,3,3)

_, ctrs, _ = cv2.findContours(img,cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
im_h, im_w = img.shape
im_area = im_w * im_h
for ctr in ctrs:
    x, y, w, h = cv2.boundingRect(ctr)
    # Filter contours based on size
    if 0.01 * im_area < w * h < 0.1*im_area:
        cv2.rectangle(img_rgb, (x, y), (x+w, y+h), (0, 255, 0), 2)

plt.imshow(img_rgb, cmap='gray', vmin=0, vmax=255)

You can store the rectangle information in the filtering process above and then do the OCR within each individual rectangular area.

Collectives™ on Stack Overflow

Extract data from image containing table grid using python

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related