ASMNet: a Lightweight Deep Neural Network for Face Alignment and Pose Estimation

Original Source Here

ASM Assisted Loss Function

We first review the Active Shape Model (ASM) algorithm and then we explain our customized loss function based on ASM that improves the accuracy of the network.

Active Shape Model Review

Active Shape Model is a statistical model of shape objects. Each shape is represented as n points as well as S set is defined in Eq. 1 in the following:

To simplify the problem and learn shape components, Principal Component Analysis (PCA) is applied to the covariance matrix calculated from a set of K training shape samples. Once the model is built, an approximation of any training sample (S) is calculated using Eq. 2:

Consequently, a set of parameters of a deformable model is defined by vector b, so that by varying the elements of the vector, the shape of the model is changed. Consider that the statistical variance (i.e., eigenvalue) of the ith parameter of b is λi . To make sure the generated image after applying ASM is relatively similar to the ground truth, the parameter bi of vector b is usually limited to ±3√λi [7]. This constraint ensures that the generated shape is similar to those in the original training set. Hence, we create a new shape
SN ew after applying this constraint, according to Eq. 3:

where b̃ is the constrained b. We also define ASM operator according to Eq. 4:

ASM transforms each input point (Px i , Py i ) to a new point (Aix , Aiy ) using Eqs. 1, 2, and 3.

Fig 2: ASMLoss

ASM Assisted Loss

we describe the loss functions for two different tasks. These tasks are responsible for facial landmark points detection and pose estimation.

Facial landmark points detection task: The common loss function for facial landmark points detection is Mean Square Error (MSE). We propose a new loss function that including MSE, as the main loss as well as the assistant loss which utilizes ASM to improve the accuracy of the network called ASM-LOSS.

The proposed ASM-LOSS guides the network to first learn the smoothed distribution of the facial landmark points. In other words, during the training process, the loss function compares the predicted facial landmark points with their corresponding ground truth as well as the smoothed version of the ground truth which is generated using ASM. Given this, in the early stage of training, we set a bigger weight to the ASM-LOSS in comparison to the main loss – which is MSE –, since the variation of the smoothed facial landmark points are much lower than the original landmark points, and as a rule of thumb, easier to be learned by a CNN. Then, by gradually decrease the weight of the ASM-LOSS, we lead the network to focus more on the original landmark points. In practice, we figured out that this method, which is also can be taken to account as transfer learning, works out well and results in more accurate models.

We also discover that although face pose estimation has a heavy reliance on face alignment, it can achieve good accuracy with the assistant of smoothed facial landmark points as well. In other words, if the performance of the facial landmark point detection task is acceptable, which means the network can predict facial landmarks such that the whole shape of the face is correct, the pose estimation can achieve good accuracy. Accordingly, using smoothed landmark points and training network using ASM-LOSS will results in more accuracy in pose estimation task.

Consider that for each image in the training set, there exist n landmark points in a set called G such that (Gxi , Gyi) is the coordinates for the ith landmark point. Similarly, the predicted set P contains n points such that (Px i , Py i ) is the predicted coordinates for the ith landmark point.

We apply PCA on the training set and calculate eigenvectors and eigenvalues. Then, we calculate set A, which contains n points and each point is the transformation of the corresponding point in G, by applying the ASM operator
according to Eq. 4:

We define the main facial landmark point loss, Eq. 7, as
the Mean Square Error between the ground truth (G) and
the predicted landmark points (P).

where N is the total number of images in the training set and Gij = (Gix , Giy ) shows the ith landmark of the j th sample in the training set. We calculate ASM-LOSS as the error between ASM points (Aset ), and predicted landmark points (Pset ) using Eq. 8:

Finally, we calculate the total loss for the facial landmark task with according to Eq. 9:

The accuracy of PCA have a heavy reliance on the ASM points (Aset ), which means that the more accurate the PCA, the less the discrepancy between the ground truth (G) and the ASM points (Aset ). To be more detailed, by reducing the accuracy of PCA, the generated ASM points (Aset ), will be more similar to the average point set, which is the average of all the ground truth face objects in the training sets. Consequently, predicting points in Aset is easier than the points in the Gset since the variation of latter is lower than the variation of the former. We use this feature to design our loss function such that we first guide the network towards learning the distribution of the smoothed landmark points– which is easier to be learned — and gradually harden the problem by decreasing the weight of ASM-LOSS.
We define α as ASM-LOSS weight using Eq. 10:

where i is the epoch number and l is the total number of training epochs. As shown in Eqs. 9, at the beginning of the training, the value of α is higher, which means we put more emphasis on ASM-LOSS. Hence, the network focuses more on predicting a simpler task and converges faster. Then after one-third of the total epochs, we reduce α to 1 and put equal emphasis on the main MSE loss ASM-LOSS. Finally, after two-third of total epochs, by reducing
α to 0.5, we direct the network toward predicting the main ground truths, while considering the smoothed points generated using ASM as an assistant.

Pose estimation task: We use mean square error to calculate the loss for the head pose estimation task. Eq. 11 defines the loss function Lpose, where yaw(yp), pitch(pp) and roll(rp) are the predicted poses and yt, pt , and rt are the corresponding ground truths.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: