7

I want to extract text from a specific area of the image like the name and ID number from identity card. The ID card from which I want to extract text is in the Chinese language(Chinese ID card). I have tried this code but it just extracts the address and date of birth which I don't need. I just need the name and ID number.

import cv2
from PIL import Image
import pytesseract
import argparse
import os

image = cv2.imread("E:/face.jpg")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename,gray)

text = pytesseract.image_to_string(Image.open(filename), lang='chi_sim')
print(text)
os.remove(filename)

I have also attached the image from which I am trying to extract text. I have tried according to my knowledge but not succeeded.any help and guidance would be appreciated.enter image description here

This is the binary image

14
  • Are you getting ? as output from tesseract.... Commented Jul 11, 2018 at 4:58
  • Show us the error instead.showing the error would help people here to give solution. If you don't have any idea how to proceed for the problem look for another tutorials. Commented Jul 11, 2018 at 5:01
  • @DevashishPrasad yes i am getting this output from my code (出生 1991年7月14日 住 址 上濂市宝山区渭`鳙七村鹏 号5o3雹) Commented Jul 11, 2018 at 5:14
  • @krishna i am asking for help. my existing code doesn't give me my desired results so i ask for help here Commented Jul 11, 2018 at 5:16
  • @Tehseen Can you attach the binary image as well? If there is any information loss in binary image itself, then it wont recognize the characters. Commented Jul 11, 2018 at 5:16

2 Answers 2

7

I can suggest a pre-processing step prior to finding textual information. The code is simple to comprehend.

Code:

image = cv2.imread(r'C:\Users\Jackson\Desktop\face.jpg')

#--- dilation on the green channel ---
dilated_img = cv2.dilate(image[:,:,1], np.ones((7, 7), np.uint8))
bg_img = cv2.medianBlur(dilated_img, 21)

#--- finding absolute difference to preserve edges ---
diff_img = 255 - cv2.absdiff(image[:,:,1], bg_img)

#--- normalizing between 0 to 255 ---
norm_img = cv2.normalize(diff_img, None, alpha=0, beta=255, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8UC1)
cv2.imshow('norm_img', cv2.resize(norm_img, (0, 0), fx = 0.5, fy = 0.5))

enter image description here

#--- Otsu threshold ---
th = cv2.threshold(norm_img, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
cv2.imshow('th', cv2.resize(th, (0, 0), fx = 0.5, fy = 0.5))

enter image description here

Use it and let me know if you are able to find the relevant textual information!

Sign up to request clarification or add additional context in comments.

4 Comments

i have used your code and i am able to extract the name on the image which is on the first line but still it doesn't extract the ID number which is on the last line of the card. it's very clear on the image but i don't know why it doesn't extract that..this is the output i am getting from this code "姓名` 费家杰…翼 叠沣瓢 男二 黾族汉 _ …′^出…生〉 翼叠g肝勇7 月斓亘 住址 上诲市宝山区泗塘七村93 '号503室"′ ′′二"
i have converted original image into gray scale and then applied dilation on that gray image and then find absolute difference and now the results are a bit improved. now i am getting the ID number but it's not satisfactory.. this is the output "性别 男〈 “ =) 黾族汉… ` _ _′ .…′′z′′ 「出 生′ 「叠g′丐菩荠二]7′_眉菩卒垂′暮′日 「` 住 址 上诲市宝山区泗塘七村腋 号503菖] ′…】 … _ ′ ′ 毛 ′ 公民身份号码 '′′"31b『D9i991o蓁141011"
@Tehseen I think you have tweak the dilation parameters a bit more, like the type of kernel used and the size of the kernel. Or also try performing a median blur to remove the unwanted smaller spots (be careful while choosing the kernel size as well)
i have updated the code for dilation like this "dilated_img = cv2.dilate(gray, np.ones((5, 5), np.uint8))" and "bg_img = cv2.medianBlur(dilated_img, 23)" now it's better but still something at the first line and also i just want to extract the name which the first line and the ID number which is the last line. this is the output i am getting now. 姓 名 费家加 __ 「`′' 性名u ′男… ' 民族汉 __ 出生 199壕年~7月童4日 住 址 上海市宝山区泗塘七村93 乙工乙道 ′ 公民身份号码 310109199107141011.. can you guide me how to target specific area to extract only the name and ID number?
0

In pytesseract, lang = 'chi_sim' tries to interpret the digits also as Chinese characters. Use lang = 'eng' to get the numbers ocr'ed properly

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.