On the edge — deploying deep applications on constrained devices

Original Source Here

On the edge — deploying deep applications on mobile

Techniques on striking the efficiency-accuracy trade-off for deep neural networks on constrained devices

Image by the Author.

So many AI advancements get to headlines: “AI is beating humans in Go!”; “Deep weather forecasting”; “Talking Mona Lisa painting”… And yet I do not feel too excited… Despite the appeal on the outlook, these results are achieved with models that are sound proof of concept but are still too far from the real world applications. And the reason for that is simple — their size.

Bigger models with bigger datasets get better results. But these are neither sustainable in terms of the physical resources they consume, such as memory and power, nor in inference times, which are very far from the real-time performance required for many applications.

Real-life problems require smaller models that can run on constrained devices. And with broader security and privacy concerns, there are more and more pros for having models that can fit on a device, eliminating any data transfer to the servers.

Below I go over techniques that make models feasible for constrained devices, such as mobile phones. To make that possible, we reduce the model’s spatial complexity and inference time and organize data flow such that the computations are saved. At the end of the article, I also cover the practical considerations such as types of mobile processors and frameworks that facilitate the process of preparing the models for mobile.

While there is a large area of general computational speed-up of matrix operations, this article will focus on techniques that can be applied directly to deep learning applications.

Reducing model spatial complexity

Deep learning models require memory and computational resources, often scarce on mobile devices. A straightaway approach to the problem is to reduce the spatial complexity (number of parameters) of deep learning models to take less space and thus computations while keeping the same accuracy.

Spatial complexity reduction can be split into five approaches:

  • Reduction in the number of model parameters (e.g. pruning and sharing);
  • Reducing model size through quantisation;
  • Knowledge distillation;
  • Direct design of smaller models;
  • Input data transformation.


The basic idea of pruning is to select and delete some trivial parameters that have little influence on the model’s accuracy and then re-train the model to recover the model performance. We can either prune individual weights, layers, or blocks of layers:

  • Nonstructural pruning removes small saliency neurones wherever they occur. It is relatively easy to perform aggressive pruning, removing most of the NN parameters with minimal impact on the model’s generalisation performance. Nevertheless, the amount of pruned neurons does not directly convert into memory and computational savings. This approach leads to sparse matrix operations, which are known to be hard to accelerate.
  • Structural pruning exploits the structural sparsity of the model at different scales, including filter sparsity, kernel sparsity, and feature mapping sparsity. A group of parameters (e.g., entire convolutional filters) is removed, permitting dense matrix operations. However, achieving higher levels of structural pruning without accuracy loss is challenging.

Pruning is iterative. In each iteration, the approach prunes relatively unimportant filters and re-trains the pruned model to compensate for the loss of accuracy. The iteration ends when the pruned model fails to reach the required minimum accuracy.

For more details checkout this article.

Parameter sharing

Instead of discarding parts of the model, we could instead combine them. When the edge weights are substantially similar, we could share them across several edges.

For example, for two fully-connected layers with N nodes each, we need to store N² weights. However, if the weights are substantially similar, we could cluster them together and assign the same weight to the edges of the same cluster, we would then need to store only the cluster centroids.

Network quantization

Examples of symmetric/asymmetric/uniform/non-uniform quantisation mapping. Image by A. Gholami.

The default type used in a neural network is a 32-bit floating point number. Such a high resolution allows for accurate gradient propagation in the training stage. However, it is often not necessary during inference.

The key idea of network quantization is reducing the number of bits for each weight parameter. For example going from 32-bit floating-point to 16-bit floating-point, 16-bit fixed-point, 8-bit fixed-point, etc.

Much of the research in quantisation is focused on rounding techniques for mapping from a larger range of numbers to a much smaller one — uniform/non-uniform, symmetric/asymmetric quantisation.

When it comes to training there are two major approaches to implementing quantisation:

  • Post-Training Quantisation is perhaps the most straightforward way to apply quantisation — model weights are mapped to a lower precision without additional fine-tuning afterwards. However, this method is bound to reduce the model’s accuracy.
  • Quantisation-Aware Training: requires re-training the model with quantisation applied to match the accuracy of the original model. The quantised network is typically re-trained on the same dataset as the original model. To facilitate gradient propagation, the gradient is not quantised.

Applying quantisation out of the box is not straightforward, as different network parts might require different precision. Hence quantisation/de-quantisation blocks are often inserted in the middle to allow for transition.

For more details checkout this recent survey and I also liked this article on network quantisation.

Knowledge distillation.

Image by J. Gou.

Operating under the assumption of a significant redundancy in the learned weights of a deep model, we can distil the knowledge of a large model (teacher) by training a smaller model (student) to mimic the distribution of the teacher’s outputs.

The key idea of model distillation is to leverage the “soft” probabilities across all classes produced by the teacher, as these probabilities can contain more information about the input instead of “hard” class labels during the training of the student model.

With network quantisation and pruning, it is possible to maintain the accuracy with compression reaching 4x. Attaining similar compression rates with knowledge distillation without accuracy degradation is challenging; however, all methods can be combined.

For more details checkout this article.

Direct design of small models.

Image by A. Howard.

Much of the work in early boom of deep learning algorithms was focused around building bigger models that achieve state-of-the-art accuracy. This trend has later overtaken by a stream of papers that looked into efficiency-accuracy trade off, directly designing smaller models.

The key papers in the area are: MobileNetV1, MobileNetV2, MnasNet, MobileNetV3.

There are some notable examples of architectural changes that are currently parts of all deep learning libraries. These are often based on low-rank factorisation, for example depth-wise separable convolutions for which there is a great article explaining the ins and outs.

As the search space for designing a small and accurate model is huge, the more recent trend is focused less on design of hand-crafted models, but on neural architecture search employing reinforcement learning. This strategy was for example applied in MobileNetV3 or MNASNet.

For more details on neural architecture search checkout this article.

Data transformation.

Instead of speeding up computations by looking at the model’s structure, we could reduce input data dimensionality. An example is image decomposition into two low-resolution sub-images, one of which carries high-frequency information and another containing low-frequency information. Combined, these would carry the same information as the original image but have lower dimensionality — meaning a smaller model to process the input.

For more details check out this article.

Reuse of intermediate results

It is not uncommon to use certain backbone models for various parts of the whole machine learning pipeline from the same input or features for similar inputs to avoid redundant computations.

Data reuse among multiple tasks

It is not uncommon to have several models running in parallel for different but related tasks with the same input. The idea is to re-use the features from shallow layers across multiple models while having trained deeper layers on specific tasks.

Data reuse among image frames

While input data might not be precisely the same, it can be similar enough to be partially re-used when related to the following input (e.g. continuous vision models).

For more details check out this article.


Having distilled, pruned and compressed the model, we are finally ready to deploy on mobile! However, there is a caveat — most likely, the out-of-the-box solution would either be very slow or wouldn’t work… This would typically happen as some operations are either not optimised or not supported on mobile processors.

It is worth bearing in mind that current mobile devices have several processors. A deep learning application would likely run either on a GPU or an NPU (Neural Processing Unit optimised specifically for deep learning applications). Each has it’s own pros and cons when it comes to deploying deep learning applications.

Despite the dedicated purpose, in current NPU efficiency gains could be offset by the data-transfer bottleneck to and from the processor, which might be problematic for real-time applications.

Deep learning frameworks on mobile devices

Traditional deep learning libraries such as PyTorch and Tensorflow are not particularly suitable for mobile applications. These are heavy and rely on third-party dependencies, which makes them cumbersome. Both frameworks are oriented towards efficient training on powerful GPUs, while a model deployed on mobile would benefit from highly mobile-optimised toolkit for inference, which both frameworks lack.

Fortunately there are frameworks which are designed specifically for deep learning on mobile: TensorFlow Lite and PytorchLite.

One of the challenges of developing deep learning applications for mobile is variable standards for each mobile producer; some would run their models in Tensorflow, others in Pytorch, and some would have their frameworks. To facilitate the transition, we can use an Open Neural Network Exchange framework that helps to convert from one library to another.

For a final touch you could use OpenVINO, that helps optimise deep learning applications for inference both on the cloud and edge devices, by focusing on the deployment hardware.

For more details on developing (including allowed operations) for each of the commonly used phones check out their API documentation: Huawei, Apple, Samsung. These also contain specific tricks that would make models more efficient on the specific devices.


Deep models require computational and memory resources that are often unavailable on constrained devices. To address this limitation several research branches concentrated on reducing model size and speeding its computations.

Typical model before being deployed on mobile would be designed to consume as little resources as possible or would be compressed through distillation; it would the undergo quantisation before finally being deployed on a device. For further reading check out this survey on deploying deep learning on mobile.

If you liked this article share it with a friend! To read more on machine learning and image processing topics press subscribe!

Have I missed anything? Do not hesitate to leave a note, comment or message me directly!


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: