Scene Text Detection And Recognition Using EAST And Tesseract

Original Source Here

Scene Text Detection And Recognition Using EAST And Tesseract

Detecting and Recognizing text for a given natural scene image using EAST and Tesseract Algorithms.

Image By Paritosh Mahto
This Articles Includes:
2.Real World Problem
3.Problem Statement
4.Bussiness objectives and constraints
5.i Data Overview
ii.Data Description
iii.Machine Learning Problem
iv.Performance Metrics
6.Exploratory Data Analysis(EDA)
7.Data Preprocessing and Feature Engineering
8.Model Implementation
11.Improvements to Existing Approach
12.Future Work

1. Introduction

In this era of digitization, the need for the extraction of textual information from different sources has risen to a large extent. Fortunately, recent advances in Computer Vision allow us to make great strides in easing the burden of text detection and other document analysis and understanding. In Computer Vision the method of converting the text present in images or scanned documents to a machine-readable format that can later be edited, searched, and can be used for further processing is known as Optical Character Recognition (OCR).

Applications of OCR

a.Information Retrieval and Automatic Data Entry– OCR plays a very important role for many companies and institutions which have thousands of documents to process, analyze, and transform to carry out day-to-day operations.

For Example- In bank information like account details, the amount from the cheque can easily be extracted using OCR. Similarly, at airports, while the passport checking the information can also be extracted using OCR. Other examples are information retrieval using OCR from receipts, invoices, forms, statements, contracts etc.

b.Vehicle Number Plate Recognition– OCR can also be used to recognise the vehicle registration plate which can then be used for vehicle tracking, toll collection, etc.

c.Self Driving Car– Most of you might be thinking of how and where the OCR is used in self-driving cars. The answer is recognizing the traffic signs. The autonomous car uses OCR to recognize the traffic signs and thus take actions accordingly. Without this, the self-driving car will pose a risk for both pedestrians and other vehicles on road.

In this article, we will discuss and implement deep learning algorithms used in OCR.

Image By Paritosh Mahto

Digitization-the conversion of text, pictures, or sound into a digital form that can be processed by a compute

2. Real World Problem

As we are now familiar with various applications of text detection and recognition. In this article detection and recognition of texts from Natural Scene Images will be discussed.

So, In our case, we are using any natural image or scene (not particularly documents, licence, or vehicle number) and for a given image/scene we want to localize char/word/sentence in the image by bounding box. After that, we want to recognize the localized texts which can be of any language. The general work-flow diagram is shown below:

Image By Paritosh Mahto

The image used above is for the purpose of showing the overall task. But for this case study, we will use a random natural scene as an Input image.

2.2 Problem Statement

For a given Natural Scene/Image the objective is to detect the textual region by plotting the bounding box and after that, the detected text has to be recognized.

2.3 Bussiness Objectives and Constraints.

  • Texts in natural scene images can be in different languages, colours, fonts, sizes, orientations, and shapes. We have to deal with these texts in natural scene images that exhibit higher diversity and variability.
  • Natural Scenes may have backgrounds with patterns or objects with a shape that is extremely similar to any text which creates problems while detecting texts.
  • Disrupted images (low quality/resolution/multi-orientation)
  • Low latency is required to detect, recognize and translate the text in the images in real-time.

3. Datasets available for Text Detection and Recognition

There are lots of datasets available publicly which can be used for this task, the different datasets with the year of release, Image number, the orientation of text, language and important features are listed below.

Image Source — Paper(Scene Text Detection and Recognition)

All the datasets might not work well for all the deep learning models because of the unstructured text, different orientations etc. For this task I am selecting ICDAR 2015 Data, as it is easily available with a sufficient number of images for non-commercial usage, the texts in these images are in English as I am a beginner I want to focus on understanding the working of the algorithm to solve this task. Also, the images are small with multi-oriented & blur in this dataset due to which I can do more experiments with the detection part.

3.1 Dataset Overview & Description

Data Source- Downloads — Incidental Scene Text — Robust Reading Competition (

ICDAR-2015 is provided by International Conference Document Analysis & Recognition

The competition named Robust Reading Competition was one of the challenges named Incidental Scene Text-2015 for which this dataset was provided.


  • The dataset is available in train and test sets with ground truth information for each set. It contains a total of 1,500 images of which 1,000 for training and 500 for testing. It also contains 2,077 cropped text instances, including more than 200 irregular text samples.
  • The images are obtained from wearable cameras.

4. Exploratory Data Analysis(EDA)

  • After downloading the data, all files are structured in the following way-
Data(main directory)
|----- ICDAR2015
|-----train (
containing all image files )
|------train_gt (
containing texts and coordinates )
|------test (
containing all image files )
|-----test (
containing texts and coordinates )
  • Using the following codes, other information like image dimensions, number of channels has been observed

For Train Images

For Test Images

  • We can also conclude from the bar plots that the heights and widths of all the images are the same i.e 720 and 1280.

For train images

Image By Paritosh Mahto

For test images

Image By Paritosh Mahto
  • Plotting Original Image and Image with Bounding Boxes with the help of ground truth Information
  • For Train Images

Conclusions Drawn From EDA

  • In ICDAR-15 dataset all the images are of similar sizes(720×1280) and extensions(.jpg).
  • Train sets have 1000 images whereas in test sets there are 500 images.
  • The height and width of all images are the same so we need not take mean height and mean width.
  • In most of the images, all texts are in small regions and images are blurred.
  • All texts are in English Langguares, Few texts are also not available and * replaced with ‘###’.
  • Most of the texts are single word, not character and sentences, also words are multi-oriented. We have to build such a model which predict these blurred texts also.

5. Methods for text detection before the Deep Learning Era

As mentioned in the problem statement, we have to first localize the texts in the images i.e to detect the texts first then recognize the detected texts. Now for detection, we will try a few methods used for the detection of texts before the deep learning era.

a. MSER(Maximally Stable Extremal Regions)

b.SWT(Stroke Width Transform)

All the outputs from both the methods are not very clear, in the first method we can observe there are regions in the images where there are no texts still it is marked with boxes. Also in the second method, the texts are not properly detected.

There are many other deep learning algorithms for text detection and recognition. In this article, we will talk about the EAST detector and we will try to implement it with the help of a research paper on the EAST Algorithm. For recognition, we will try the pre-trained model Tesseract.

6. EAST (Efficient Accurate Scene Text Detector)

It is a fast and accurate scene text detection method and consists of two stages:

1. It uses a complete convolutional network (FCN) model to directly generate pixel-based word or text line predictions

2. After generating text predictions ( Rotate a rectangle or quad) and the output is sent to the non-maximum suppression to produce the final result.

The pipeline is shown below:

Network Architecture-(with PVANet)

PVANet- It is lightweight feature extraction network architecture for object detection, which achieves real-time object detection performance without losing accuracy.

The model can be divided into three parts: Stem feature extraction, feature merging branches and output layer.

i. Feature Extractor (PVANet)

This part can be any convolutional neural network with convolutional layer and pooled layer interleaving pre-trained on Imagenet data for examples PVANet, VGG16 and RESNET50. From this network, four levels of feature maps f1, f2, f3 and f4 can be obtained. Because we are extracting features it is called a Feature Extractor.

ii. Feature Merging Branch

In this part, the feature maps obtained from the feature extractor are first fed to the unpooling layer to double its size, and then concatenated ​​with the current feature map in each merging state. Next, the 1X1 convolution is used where the conv bottleneck reduced the number of channels and reduced the amount of calculation also, followed by a 3X3 convolution to fuse information to produce the final output of every merging stage as shown in fig.

The calculation process of g and h for each is shown in the figure below


gi is an intermediate state and the basis of merging

hi is the merged feature map

iii. Output Layer

The final output from the merged state is passed through the 1X1 Conv layer with 1 channel which gives a score map ranging from[0–1] .The final output is also passed through RBOX or QUAD Geometry (description about these are shown in the below fig) which gives a multi-channel geometry map.

The details about the score map and geometry map will be discussed while implementation.

7. Implementation

For implementation, we will follow the above-shown pipeline-

Step-1- Data Preparation & Data Generation( DATA PIPELINE )

Image By Paritosh Mahto

In this step we have to do data preparation and also have to build a generator function which will give an Image array(Input of the model) with score map(output) and geo-map(output) as you can observe in the above figure the output from the Multichannel FCN along with training mask.

Score Map:

It represents the confidence score/level for the predicted geometry map at that location.It lies in the range[0,1].Let’s understand it by an example:

let’s say 0.80 is the score map of a pixel, this simply means that for this pixel we are 80% confident that it will have the predicted geometry map or we can say that it is an 80% chance that the pixel is part of the predicted text region.

Geo Map:

As we know along with the score map, we also obtain a multi-channel geometric information map as an output. The geometric output can be RBOX or QUAD. The number of channels along with the functions for AABB, RBOX and QUAD is shown in the table below.


Image By Paritosh Mahto

From the above image we can observe that For RBOX, the geometry uses a four-channel axis-aligned bounding box (AABB) R and a channel rotation angle θ. The formula of R is G. The four channels represent 4 distances, which are the distance from the pixel position to the rectangular boundaries and one channel for rotation angle as shown below.

Image Source-Paper(Scene Text Detection and Recognition)


Image By Paritosh Mahto

For QUAD, we use 8 numbers to represent the coordinate displacement from four vertices to each pixel location. Each offset distance contains Δxi | Δyi two numbers and the geometric output contains 8 channels. An example is shown below

Image By Paritosh Mahto

In this implementation, we will only use RBOX .

For the Generator function, we have to follow few steps

Image By Paritosh Mahto
Image By Paritosh MAhto

All the codes are available here-

The output original image with score map, geometry map and training mask from the generator function is shown here-

Step -2 Model Building & Loss Function

Image By Paritosh Mahto

In this step, we will try to build the detector architecture with a pre-trained VGG16 model and ResNet50 model on Imagenet data as a feature extractor.

Model-1(VGG16 as Feature Extractor)

Source Code-

Model Architecture-

Model-2(ResNet50 as Feature Extractor)

Model Architecture-

Loss Function

As we are working on image data, IOU Score is one of the frequently used loss. But here we have mainly two outputs score map and geometry map so we have to calculate the loss for both.

The total loss is expressed as :

Ls and Lg Represent the score graph and the geometric shape, λg Measure the importance of the two weights. In our experiment, we set λg Is 1.

For Score Map Loss

In the paper, the loss used for the score map is binary cross-entropy loss with weightage to both positive and negative class as shown in fig.

But, while implementation Dice Loss is used

Geometry Map Loss

For RBOX, the loss is defined as

The First loss is box loss and for this IOU loss is used as it invariant against objects of different scale.

For rotation angle, the loss is given by-

Implemented codes are shown below:

Step-3 Model Training

Both models are trained for 30 epochs with Adam optimizers and the other parameters are shown below-


model_vgg.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001,amsgrad=True),loss= total_Loss())

Epoch Vs Loss Plot:



Epoch Vs Loss Plot:

Step-4 Interference Pipeline

After training first the geometry map is converted back to the boundary boxes. Then we apply thresholding based on the score map to remove some low confidence boxes. The remaining boxes are merged using Non-Maximum Suppression.

Non Maximum Suppression (NMS) is a technique used in many computer vision algorithms. It is a class of algorithms to select one entity (e.g. bounding boxes) out of many overlapping entities. -source Reflections on Non Maximum Suppression (NMS) | by Subrata Goswami | Medium

Here we will use Locally Aware NMS. It adds weighted merge to the standard NMS. The so-called weighted merge is to merge 2 IOUs higher than a certain threshold in the output box based on the score. The steps i.e followed while implementation is discussed below-

  1. First, sort the geometries and start from the topmost.
  2. 2.Take the next box in the row and find the IOU with the previous
  3. 3.If IOU > threshold, merge the 2 boxes by taking the weightage average by score otherwise keep it as it is.
  4. 4. Repeat steps 2 to 3 until all boxes are iterated.
  5. 5. At last apply standard NMS on the remaining boxes.

Implemented Codes-

Non-Maximum Suppression

Interference Pipeline for Detection Model

The outputs from each Model:



If we compare the losses of both models then we can reach a conclusion that model 2 (resnet_east) is performing well . Lets us do an analysis on the performance of the model_2.

8. Model Analysis & Model Quantization

As we have seen the model 2 is performing better than model 1, here we will do some analysis on the outputs from model 2. First, the loss for each image in the train and test directory has been calculated, after this based on the distribution by looking at the box plot of loss for each train and test images we will select two threshold loss and at the end, we categorize the data into three categories i.e Best, Average and worst.

The number of images from each category has been shown below:

For Train Images:

For Test Images:

Model Quantization

Quantization for deep learning is the process of approximating a neural network that uses floating-point numbers by a neural network of low bit-width numbers. This dramatically reduces both the memory requirement and computational cost of using neural networks. After quantization

9. Deployment

After model quantization, the float16 quantized model has been selected and deployed using streamlit and Github. Using streamlit uploader function I created a .jpg file input section where you can give raw image data and the model will give the images with detected texts present on the image.

Webpage Link-

Deployment video

10. Future Work

11. References


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: