Face Recognition with FaceNet

Original Source Here

Face Recognition with FaceNet

A facial recognition system is a technology capable of matching a human face from a digital image or a video frame against a database of faces, typically employed to authenticate users through ID verification services, works by pinpointing and measuring facial features from a given image

Have you noticed that Facebook has developed an uncanny ability to recognize your friends in your photographs? In the old days, Facebook used to make you tag your friends in photos by clicking on them and typing in their names. Now as soon as you upload a photo, Facebook tags everyone for you like magic:

This tech is called face recognition. Facebook’s algorithms can recognize your friends’ faces. Facebook can recognize faces with 98% accuracy which is pretty much as good as humans can do!

Even google photos work in the same way intelligently grouping your photos according to the face

google face grouping

Let’s see how it exactly works?

Face recognition can be divided into 3 steps. The image below shows an example of a face recognition pipeline.

  1. Face detection — Detecting faces in an image.
  2. Feature extraction — Extracting the most important features from an image of the face.
  3. Face classification — Classifying the face based on extracted features.

There are various ways to implement each of the steps in a face recognition pipeline. In this blog we’ll focus on popular deep learning approaches where we perform face detection using MTCNN, feature extraction using FaceNet and classification using Softmax.


MTCNN(Multi-Task Cascaded Convolutional Neural Networks) is a python (pip) library written by Github user ipacz, which implements the paper Zhang, Kaipeng et al. “Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks.” IEEE Signal Processing Letters 23.10 (2016): 1499–1503. Crossref. Web.


FaceNet was proposed by Google Researchers in 2015 in the paper titled FaceNet: A Unified Embedding for Face Recognition and Clustering. It achieved state-of-the-art results in the many benchmark face recognition dataset such as Labeled Faces in the Wild (LFW) and Youtube Face Database.
They proposed an approach in which it generates a high-quality face mapping from the images using deep learning architectures such as ZF-Net and Inception. Then it used a method called triplet loss as a loss function to train this architecture.

So how does it work?

it takes input as a face image and returns a vector of 128 numbers which shows the most important features of a face, the vector is also known as embedding. FaceNet takes a person’s face and compresses it into a vector of 128 numbers. Ideally, embeddings of similar faces are also similar.

Embeddings are vectors and we can interpret vectors as points in the Cartesian coordinate system. That means we can plot an image of a face in the coordinate system using its embeddings

One possible way of recognising a person on an unseen image would be to calculate its embedding, calculate distances to images of known people and if the face embedding is close enough to embeddings of person A, we say that this image contains the face of person A.

One best way of recognising a person on an unseen image would be to calculate its embedding, calculate distances to images of known people and if the face embedding is close enough to embeddings of known person (which is in this elon) , we say that this image contains the face of person Elon.

Feed the image through FaceNet, get the embedding and see if the face distance is close enough to any of the known faces. But, How does FaceNet know what to extract from the image of a face and what do these numbers in an embedding vector even mean?

Let’s me try to explain how FaceNet learns to generate face embeddings.

To train facenet we need bunch of images of faces. let’s say we have only few images for 2 people. we can use the same approach if we have thousands of images of different people. In the beginning of training , FaceNet generates random vectors for every image which means the images are scattered randomly when plotted.

Then it randomly picks an image as an anchor ,also randomly picks another image of same person image as the anchor image as positive image and takes another image of person different than anchor image as negative image

It adjusts the FaceNet network parameters so that the positive example is closer to the anchor than the negative example.

The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.

We don’t directly tell FaceNet what the numbers in the vector should represent during training, we only require that the embedding vectors of similar faces are also similar (i.e. close to each other). It’s up to FaceNet to figure out how to represent faces with vectors so that the vectors of the same people are similar and the vectors of different people are not. For this to be true, FaceNet needs to identify key features of a person’s face which separate it from different faces. FaceNet is trying out many different combinations of these features during training until it finds the ones that work the best. FaceNet (or neural networks in general) don’t represent features in an image the same way as we do (distance, size, etc.). That’s why it’s hard to interpret these vectors, but we are pretty sure that something like distance between eyes is hidden behind the numbers in an embedding vector.

Final layer — Softmax

The classification step could be done by calculating the embedding distances between a new face and known faces, this approach is called k-NN which is not efficient. Instead, we decided to use the Softmax classifier which memorises boundaries between people which is much more efficient.

Softmax classifier is used as a final step to classify a person based on a face embedding. Softmax was a logical choice for us since the entire stack is neural networks based, but you can use any classifier you wish such as SVM, Random Forest, etc. If the face embeddings themselves are good, all classifiers should perform well at this step.

I achieved good results with LogisticRegression

Then I built an API around it for saving new face dataset , train and predict

An application programming interface(API) is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build or use such a connection or interface is called an API specification.

For this i used Fast API, it can help us to expose our ML Models to front-end apps also its a great way to train and test ML Models.

  1. Create API
create API

with the help of this create API we can create the dataset and save images in the directory and then use it for training using the same unique user id to train the model on the selected images, Here is the file structure of how we are saving the dataset.

├── 1
│ └── training
│ ├── none
│ │ ├── _118667880_ka_05_friendsreunion.jpg
│ │ ├── 2996CDBD-76E8–4F26-AC68-BCC2E4CC3631.JPG
│ │ ├── 43d6afc7–07bd-4bed-b6b6-e383b63b837b.JPG
│ │ ├── 6f944f6a-af5d-4ff7–9e30–9163d08ef6a4.JPG
│ │ ├── images.jpg
│ │ ├── photo-1603383928972–2116518cd3f3.jpg
│ │ ├── WhatsApp Image 2021–03–26 at 3.20.15 PM(1).jpeg
│ │ ├── WhatsApp Image 2021–03–26 at 3.20.15 PM(2).jpeg
│ │ └── WhatsApp Image 2021–03–26 at 3.20.15 PM.jpeg
│ └── yash
│ ├── 0.jpg
│ ├── 1.jpg
│ └── 2.jpg
│ ├── embeddings
│ │ ├── class_to_idx.pkl
│ │ ├── embeddings.txt
│ │ └── labels.txt
│ ├── face_recogniser.pkl
│ ├── output
│ │ ├── 2022–06–06–134001_tagged.jpg
│ │ ├── 2022–06–06–161547_tagged.jpg
│ │ └── yash_tagged.jpg

2. Train API

Here we can pass the same userid and it will load all those uploaded images to the training pipeline, It will generate embeddings for all those images in the background and also train and save the pkl model file in the same directory
Once the model is saved in the directory we can use predict API to test the model performance

Predict api

Here also we pass the same userid and upload images for which we want to predict it will follow the same procedure i.e. (first it will extract the face out of the image and then generate embedding for that face and pass that to the same model) and return the label
API can be really useful in AI ML work, For instance, if you have some iterative training process you can pass hyperparams values as queries in the API and kick start the training with just a click, This way you can train different-different model variants and see which one performs wellThat’s all… Thanks for reading, If you have any doubt put that in the comment section…




Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: