Scene Text Detection, Recognition & Translation*eV-kA6YjpGnp7XFv

Original Source Here

Scene Text Detection, Recognition & Translation

1. Overview/ Problem Statement

  • The need for computer vision in text detection and recognition from an image or video is getting very popular these days. Because the text is to reliably and effectively spread or acquire information across time and space. In this sense, text constitutes the cornerstone of human civilization.
  • This approach can be used for handwriting recognition, natural scene text detection and recognition, vehicle number detection and recognition, and many more. But multiple challenges may still be encountered when detecting and recognizing text in the scene.
  • A few of the challenges are as follows:
    1] Text in images exhibits much higher diversity and variability, especially for natural scene images. For example, instances of scene text can be in different languages, colors, fonts, sizes, orientations, and shapes.
    2] The backgrounds of natural scenes are virtually unpredictable. Ther might be patterns extremely similar to text (e.g., tree leaves, traffic signs, bricks, windows, and stockades), or occlusions caused by foreign objects, which may potentially lead to confusion and mistakes.
    3] In some circumstances, the quality of text images and videos could not be guaranteed. That is, in poor imaging conditions, text instances may be of low resolution and severe distortion due to inappropriate shooting distance or angle, or blurred because of out of focus or shaking, or noise on account of low light level, or corrupted by highlights or shadows.

2. Real-world/Business objectives and constraints

  • The main objective of this case study is to build a system that can detect and recognize a text from a natural scene image and then can be translated into another language that an end-user can understand.
  • As captured natural scenes images may be blurred, noisy, and in low quality, or multi-oriented(rotated/ curved).
  • So, we have to deal with this problem also. To overcome this before detecting the text from those regions there are several steps needed for processing the image to deblur and de-noising it.

3. Data

  • The scope of this project is limited to only one language for detecting text and then converting it to another language after recognition.
  • For this project, I’ve chosen the ICDAR 2015 dataset which contains images for English word-level text.
    a) The ICDAR15 dataset contains 1,500 images: 1,000 for training and 500 for testing. Specifically, it contains 2,077 cropped text instances, including more than 200 irregular text samples.
    b) As text images were taken by Google Glasses without ensuring the image quality, most of the text is very small, blurred, and multi-oriented.
    c) No lexicon is provided.
  • The whole dataset can be downloaded from here.

Importing packages

Data Overview

  • An ICDAR 2015 images are blurred, noisy, multi-oriented, rotated or curved, and of low quality.
  • A training set of 1000 images containing about 4500 readable words will be provided through the downloads section.
  • Different ground truth data is provided for each task.
  • All images are provided as JPG files and the text files are UTF-8 files with CR/LF newline endings.
  • The ground truth is given as separate text files (one per image) where each line specifies the coordinates of one word’s bounding box and its transcription in a comma-separated format.
  • There is a total of 1000 train images and 500 test images.
  • The total size of the ICDAR15 dataset for both train and test images is 129 MB.

4. Exploratory Data Analysis

4.1. Displaying a few sample datapoint

Displaying a few random images

Sample file (gt_img_1.txt) which has ground truth for each word of img_1.jpg with their co-ordinates

Note: Anything that follows the eighth comma is part of the transcription, and no escape characters are used. “Do Not Care” regions are indicated in the ground truth with a transcription of “###”.

Plotting ground truth of one sample file (gt_img_1.txt) which has ground truth for each word of img_1.jpg with their co-ordinates

4.2. Creating a data frame with image location and it’s the corresponding ground truth

4.3. Total text instances and unique text instances in the dataset

4.4. Getting the number of images with different dimension, channels & extension


  • As we can observe, randomly displayed above images are not displaying small text regions clearly which means images are blurred.
  • Also as we know the size of the whole dataset is 129 MB and there is a total of 1500 images, 1000 for train and 500 for the test.
  • So on average, each image is of size 88 kb which is very small.
  • As we can see all images in the dataset are of the same dimension (i.e. (720,1080)).
  • All images have 3 channels that mean, all are colored images with RGB channels.
  • All images have an extension of jpg.

5. Baseline or traditional methods for text detection & recognition

5.1. Text Detection using MSER


  • As we can observe, MSERs have limited performances on blurred or noisy images and textured images.
  • Both cases are actually related to the image scale since blur (which can distort shapes of extracted MSERs) is equivalent to image down-scaling.
  • MSER is not suitable for rotated or curved, word-level text detection because it can detect one word as multiple characters.
  • Also as we can see, MSER also detects some unwanted or say no-textual region in our case.

5.2. Text Recognition using Pytesseract


  • Pytesseract is not 100% accurate, has its own limitation.
  • As we can observe blurred, rotated, small images are either recognized incorrectly or not recognizes any text.
  • So, the Pytesseract OCR engine recognizes accurate text mostly for horizontal text instances but for rotated or curved text instances it may not work well.

6. EAST (Efficient accurate scene text detector) text detection with pytesseract text recognition


  • The EAST text detection model with pytesseract text recognition model works well as compared to our baseline model.
  • The EAST model finds a bounding box of text instances very well for clearly visible instances but for small, rotated or noisy instances it doesn’t work very well.
  • Pytesseract OCR engine recognizes accurate text mostly for horizontal text instances but for rotated or curved text instances it is not working as per our expectations.
  • So, we can conclude that EAST with pytesseract text detection and recognition works well but only for horizontal and nice quality images.
  • But in our case, its performance is like a random model that may work well but not in every case or in very few cases.

7. EasyOCR text detection & recognition

  • As we can see, EasyOCR gives better results than EAST text detection and pytesseract text recognition.
  • But still while recognizing the text there are some mistakes although the detected text region is quite accurate. Even for rotated and blurred text regions, it detects very efficiently.
  • So to build the accurate text detection and recognition model, we can combine the EasyOCR text detection result and different much more accurate text recognition models.

8. Text Recognition

  • We have seen, EasyOCR models do a pretty well job while detecting the text region from an image although its recognition is quite good.
  • But as we know, our ICDAR15 dataset has lots of rotated, blurred, low-resolution.
  • Due to which in some cases text recognition fails.
  • So, to improve or get well adequate text recognition results we’ll be proposing a different recognition model.
  • And then we can combine the results of EasyOCR text detection with different text recognition models to get more accurate predictions.
  • Most of the text regions are blurry, rotated(can be clockwise as well as anti-clockwise).
  • So to do more accurate recognition we need to apply some kind of transformation on images.
  • In most recent years, spatial transformation network became very popular for image transformation as it allows a neural network to learn how to perform spatial transformations on the input image in order to enhance the geometric invariance of the model.
  • For example, it can crop a region of interest, scale and correct the orientation of an image, and so on.
  • So, after getting N different detected text regions we’ll be processing those regions independently from each other.
  • The processing of the N different regions is handled by a CNN. ResNet-based feature extraction will be used to achieves good results if we use a variant of the ResNet architecture for our recognition network.
  • We can also integrate BiLSTM (Bi-directional Long Short-Term Memory) sequence model to improve the result by learning not only from beginning-to-end but also from end-to-beginning.
  • Our text recognition model gives much better results than pytesseract & EasyOCR text recognition but still, it can be improved further.
  • To do so, we have applied our custom spell correction model on top of text recognition prediction.
  • After applying our spell correction model it gives an 86% accurate result.

9. Text Translator (English to the Hindi language)

  • For text translation from predicted English text to Hindi text, we’ll use englisttohindi python package.
  • In this function, there is a function named EngtoHindi() which translates English text (can be single word as well as sentences) to Hindi text very accurately.
'कॉफी का स्वाद बहुत अच्छा है।'

10. Final End-to-End Scene Text Detection, Recognition & Translation

Building final hybrid text detection, recognition & translation model by combining the easyocr text detection prediction, our own pre-trained text recognition model & pre-trained english to hindi language translation model.

10.1 Utility function

10.2 Final model

10.3 Accuracy of the final model

Character level accuracy: 0.8461538461538461
Word level accuracy: 0.6
Character level accuracy: 1.0
Word level accuracy: 1.0


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: