Extract text from image using OCR in python

Question

I want to extract text from a specific area of the image like the name and ID number from identity card. The ID card from which I want to extract text is in the Chinese language(Chinese ID card). I have tried this code but it just extracts the address and date of birth which I don't need. I just need the name and ID number.

import cv2
from PIL import Image
import pytesseract
import argparse
import os

image = cv2.imread("E:/face.jpg")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename,gray)

text = pytesseract.image_to_string(Image.open(filename), lang='chi_sim')
print(text)
os.remove(filename)

I have also attached the image from which I am trying to extract text. I have tried according to my knowledge but not succeeded.any help and guidance would be appreciated.

Show us the error instead.showing the error would help people here to give solution. If you don't have any idea how to proceed for the problem look for another tutorials. — Bal Krishna Jha
– Bal Krishna Jha, Commented Jul 11, 2018 at 5:01
@DevashishPrasad yes i am getting this output from my code (出生 1991年7月14日住址上濂市宝山区渭`鳙七村鹏号5o3雹) — Tehseen
– Tehseen, Commented Jul 11, 2018 at 5:14
@krishna i am asking for help. my existing code doesn't give me my desired results so i ask for help here — Tehseen
– Tehseen, Commented Jul 11, 2018 at 5:16
@Tehseen Can you attach the binary image as well? If there is any information loss in binary image itself, then it wont recognize the characters. — ZdaR
– ZdaR, Commented Jul 11, 2018 at 5:16

Jeru Luke · Accepted Answer · 2018-07-11 07:47:48Z

7

I can suggest a pre-processing step prior to finding textual information. The code is simple to comprehend.

Code:

image = cv2.imread(r'C:\Users\Jackson\Desktop\face.jpg')

#--- dilation on the green channel ---
dilated_img = cv2.dilate(image[:,:,1], np.ones((7, 7), np.uint8))
bg_img = cv2.medianBlur(dilated_img, 21)

#--- finding absolute difference to preserve edges ---
diff_img = 255 - cv2.absdiff(image[:,:,1], bg_img)

#--- normalizing between 0 to 255 ---
norm_img = cv2.normalize(diff_img, None, alpha=0, beta=255, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8UC1)
cv2.imshow('norm_img', cv2.resize(norm_img, (0, 0), fx = 0.5, fy = 0.5))

#--- Otsu threshold ---
th = cv2.threshold(norm_img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
cv2.imshow('th', cv2.resize(th, (0, 0), fx = 0.5, fy = 0.5))

Use it and let me know if you are able to find the relevant textual information!

answered Jul 11, 2018 at 7:47

Jeru Luke

21.4k13 gold badges85 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Tehseen Over a year ago

i have used your code and i am able to extract the name on the image which is on the first line but still it doesn't extract the ID number which is on the last line of the card. it's very clear on the image but i don't know why it doesn't extract that..this is the output i am getting from this code "姓名` 费家杰…翼叠沣瓢男二黾族汉 _ …′^出…生〉翼叠g肝勇7 月斓亘住址上诲市宝山区泗塘七村93 '号503室"′ ′′二"

Tehseen Over a year ago

i have converted original image into gray scale and then applied dilation on that gray image and then find absolute difference and now the results are a bit improved. now i am getting the ID number but it's not satisfactory.. this is the output "性别男〈 “ =) 黾族汉… ` _ _′ .…′′z′′ 「出生′ 「叠g′丐菩荠二]7′_眉菩卒垂′暮′日 「` 住址上诲市宝山区泗塘七村腋号503菖] ′…】 … _ ′ ′ 毛 ′ 公民身份号码 '′′"31b『D9i991o蓁141011"

Jeru Luke Over a year ago

@Tehseen I think you have tweak the dilation parameters a bit more, like the type of kernel used and the size of the kernel. Or also try performing a median blur to remove the unwanted smaller spots (be careful while choosing the kernel size as well)

Tehseen Over a year ago

i have updated the code for dilation like this "dilated_img = cv2.dilate(gray, np.ones((5, 5), np.uint8))" and "bg_img = cv2.medianBlur(dilated_img, 23)" now it's better but still something at the first line and also i just want to extract the name which the first line and the ID number which is the last line. this is the output i am getting now. 姓名费家加 __ 「`′' 性名u ′男… ' 民族汉 __ 出生 199壕年~7月童4日住址上海市宝山区泗塘七村93 乙工乙道 ′ 公民身份号码 310109199107141011.. can you guide me how to target specific area to extract only the name and ID number?

SRK · Accepted Answer · 2019-06-24 04:20:13Z

0

In pytesseract, lang = 'chi_sim' tries to interpret the digits also as Chinese characters. Use lang = 'eng' to get the numbers ocr'ed properly

answered Jun 24, 2019 at 4:20

SRK

535 bronze badges

Collectives™ on Stack Overflow

Extract text from image using OCR in python

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related