Is it possible to use OCR to extract only the text with a specific color? IronOCR

Question

I have some PNG files with multiples sentences in two different colors Black (Davy Gray) and Light Brown (Mushroom).

Like this :

I'm only interested in the Black text so I tried changing the color of the light brown text to the background color using Input.ReplaceColor but there's many shades of that color and I always end up with some weird characters as a result of the small residues left.

Here's my actual code

var Ocr = new IronTesseract();
using (var Input = new OcrInput())
{
    var ContentArea = new Rectangle() { X = 872 , Y = 130, Height = 900, Width = 725 };
    Input.AddImage(@"C:\OCR\Capture (" + i + ").PNG", ContentArea);
    Input.ReplaceColor(Color.FromArgb(185, 163, 143), Color.FromArgb(235, 226, 216), 25);
    Input.Sharpen();
    Input.ToGrayScale();
    var Result = Ocr.Read(Input);
    richTextBox1.AppendText(Result.Text + Environment.NewLine);
    richTextBox1.SelectionStart = richTextBox1.Text.Length;
    richTextBox1.ScrollToCaret();
}

Edit : The answer is "No" for now, hopefully they release this feature in the future.

The only option for now is to play with colors until you find the best parameters.

If you have a better alternative than IronOCR and free (even if only for dev), I'll gladly take it.

Try replacing the background with pure white, play with the tolerance parameter on OcrInput.ReplaceColor() and maybe use the same method to make the grey text black. Sharpen() may actually be working against you by darkening faint blemishes. Bottom line: there is probably no definitive general answer here, just trial-and-error fine-tuning for your image. — AlanK
– AlanK, Commented Aug 7, 2021 at 20:33
Thank you, the black and white technique gave me better results but removing Sharpen() give me worse results (from one character to 20 weird characters per line), I tried playing with the tolerance a lot but I end up messing up with the black text too if I put a higher value. — Yox
– Yox, Commented Aug 9, 2021 at 22:26

Amin Dodin · Accepted Answer · 2021-08-26 19:06:50Z

1

The answer below was edited in response to comment.

Since the color you wish to eliminate is not a single shade, you could search for all pixels in a color range and replace them all with the background color.
I haven’t used IronTesseract before, so I don’t know if it has this feature, but you can use Windows Bitmap functions to do it as follows:

System.Drawing.Bitmap image = new Bitmap("BsRyL.png");
Color c1 = Color.FromArgb(180, 157, 136); //lower color
Color c2 = Color.FromArgb(238, 228, 219); //upper color
Color bkColor = Color.FromArgb(235, 226, 216); //background
for (int x = 0; x < image.Width; x++)
   for (int y = 0; y< image.Height; y++)
   {
      Color c = image.GetPixel(x, y);
      if (c.R >= c1.R && c.R <= c2.R && c.G >= c1.G && c.G <= c2.G && c.B >= c1.B && c.B <= c2.B)
         image.SetPixel(x, y, bkColor);
   }
image.Save("FilledWithBackgroundNL.png", System.Drawing.Imaging.ImageFormat.Png);

The image filled background color looks like this:

This pixel-by-pixel manipulation is suitable if your images are all small like the sample you provided or you don’t care about performance. If you’re dealing with larger images (in the megapixel range), working with individual pixels can be slow.

Another way to do this is to use an imaging toolkit such as LEADTOOLS (Disclaimer: I’m a LEADTOOLS employee). The code looks like this:

Leadtools.Codecs.RasterCodecs codecs = new Leadtools.Codecs.RasterCodecs();
Leadtools.RasterImage image = codecs.Load("BsRyL.png");
var c1 = new Leadtools.RasterColor(180, 157, 136); //lower color
var c2 = new Leadtools.RasterColor(238, 228, 219); //upper color
image.AddColorRgbRangeToRegion(c1, c2, Leadtools.RasterRegionCombineMode.Set);
var backgroundColor = new Leadtools.RasterColor(235, 226, 216);
Leadtools.ImageProcessing.FillCommand fill = new Leadtools.ImageProcessing.FillCommand(backgroundColor);
fill.Run(image);
codecs.Save(image, "FilledWithBackground.png", Leadtools.RasterImageFormat.Png, 24);

This could be useful if the images are large and higher performance is needed.

edited Aug 26, 2021 at 19:06

answered Aug 15, 2021 at 18:22

Amin Dodin

3612 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Yox Over a year ago

That sounds more like an ad than an answer. I'm just trying to make a small tool that I'll share for free on GitHub, I can't pay $3995 for that.. Thanks anyway.

Amin Dodin Over a year ago

I have added code that doesn’t use my company’s product. For small images, the new code in the answer is quite sufficient, and up to a certain image size, can be faster than using a professional imaging toolkit. However, when I tested it on an 8 megapixel image (typical scanned Letter-size page at 300 DPI), the LEADTOOLS code was more than 9 times faster. Regarding the price, the color-replace code above uses the LEADTOOLS Imaging Pro toolkit, which is only $795. The more expensive toolkits are for advanced features such as medical communication or document imaging.

Collectives™ on Stack Overflow

Is it possible to use OCR to extract only the text with a specific color? IronOCR

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related