TableNet Implementation Using Resnet encoder for extraction of information from document images

Original Source Here

TableNet Implementation Using Resnet encoder for extraction of information from document images


In the last decade, deep neural networks have achieved tremendous success in pattern recognition problems such as computer vision, natural language processing. Computer vision is the field where computers gain high level information from digital images & videos. Computer vision has been applied in multiple industries starting from the development of self driving cars to detection of cancel cells. One of the common theme of most of the computer vision application is image segmentation.

Image segmentation is the process of partitioning a digital image in to multiple segments based on their respective category. In this article, I will explain about a research paper called TableNet, which uses image segmentation deep learning model to detect the table and its structure from scanned images. While some progress has been made in table detection, extracting the table contents is still a challenge since this involves more fine grained table structure recognition. Improvement of the original model has been achieved by using Resnet model in place of VGG19.

TableNet Architecture

The model is based on the FCN(Fully Convolutional Network) model. It does not have any dense layer. It consists of convolutional, pooling and up sampling layers. The model uses VGG 19 layer as base layer .The fully connected layers (layers after pool5) of VGG-19 are replaced with two (1×1) convolution layers. Each of these convolution layers (conv6) uses the ReLU activation followed by a dropout layer having probability of 0.8 (conv6+ dropout as shown in Figure). I have used Resnet121 layer in place of VGG19 . During my experiments Resnet layer gave better results.

conv2_block6_concat’,‘conv4_block9_0_relu’, ‘conv5_block1_0_relu’ layers are used .Following this layer, two different branches of the decoder network are appended. The output of the (conv5_block1_0_relu + dropout) layer is distributed to both decoder branches. In each branch, additional layers are appended to filter out the respective active regions. In the table branch of the decoder network, an additional (1×1) convolution layer, conv7 table is used, before using a series of fractionally strided convolution layers for upscaling the image. The output of the conv7 table layer is also up-scaled using fractionally strided convolutions, and is appended with the pool4 pooling layer of the same dimension.

Similarly, the combined feature map is again up-scaled and the pool3 pooling is appended to it. Finally, the final feature map is upscaled to meet the original image dimension. In the other branch for detecting columns, there is an additional convolution layer (conv7 column) with a ReLU activation function and a dropout layer with the same dropout probability. The feature maps are up-sampled using fractionally strided convolutions after a (1×1) convolution (conv8 column) layer. The up-sampled feature maps are combined with the pool4 pooling layer and the combined feature map is up-sampled and combined with the pool3 pooling layer of the same dimension. After this layer, the feature map is up-scaled to the original image. In both branches, multiple (1×1) convolution layers are used before the transposed layers.

One decoder branch is used for doing segmentation of table region and another branch is responsible for segmentation of column region. After detecting table and column region, the tabular data is extracted using Tesseract OCR.


The model is trained on Marmot dataset. Marmot dataset consists of scanned document images and corresponding xml which specifies the table location. The annotation for table column is done by the authors of the research paper and is available at following link.


  • Given a scanned document image, segmentation of table and its columns needs to be done.
  • Once the region is identified, information has to be extracted from the table

Machine learning Problem

For a given document image we have to do semantic segmentation by classifying each pixel as a table or not. It is a deep learning semanatic segmentation problem.

Performance Metric

As this is a classification problem, f1 score will be used to measure the efficiency of the model.


The dataset consists of images in bmp format and xml files. We have to first create a table and column masks for all the images using the information given in xml files.

Scanned image of document
Sample xml file

Each xml file consists of <bndbox> tags which specifies the coordinates of table and its respective columns. These coordinates will be used to create the masks from the images. The code for creating the masks is given below.

Image masks are basically are a part of original image where only yable and its columns are white and rest of the image is black.

Column mask
Table mask

Next normalization of images is done on original image and both the masks. For training ,the image and masks are grouped for feeding to model.


After data preparation is done, model is created for tablenet as specified in the architecture section.It consists of below two parts.

  1. Encoder section
  2. Decoder section

Encoder section:

Here Resnet model is used with image net weights. Images are resized in to 1024 ,1024,3 dimensions. Three layers of Resnet is passed to the decoder section.The encoder section downamples the images.

Decoder Section:

There are two decoder present in this model. One decoder is used for detecting the table location and the other decoder is used for detecting the columns of the table. The downsampled images after passing though two conv2D layers, again processed though one 1×1 conv2D layer. Then with help of skip-pooling technique the low-resolution feature maps of the decoder network combined with the high-resolution features of encoder networks. After upsampling we will get output table mask of shape (1024*1024*2).

Model loss
Model f1 score


Once training is completed, prediction on the original image is performed. For extracting information, table mask is imposed on the original image. This is done to identify the table location based on prediction. After this step, information is extracted using Tesseract ocr and saved in to a csv file.

Further improvements:

We need more data to increase the performance of the model. Better computing power is also required for training on high resolution images.


You can connect with me on Linkedin


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: