MLP Mixer in a Nutshell*k9lD1aN-o7rGQ9Gh

Original Source Here

MLP Mixer in a Nutshell

A Resource-Saving and Performance-Competitive Alternative to Vision Transformers

Photo by Ricardo Gomez Angel on Unsplash

This posts intents to provide a brief overview of the MLP Mixer introduced in the paper MLP-Mixer: An all-MLP Architecture for Vision by I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer et al. [1]. Furthermore, I’d like to provide some further context based on my experience to quickly understand the key characteristics of the MLP Mixer.


  1. Introduction
  2. Motivation of the MLP Mixer and its Contribution
  3. MLP Mixer Architecture and its Comparison to Previous Models
  4. Experiments
  5. Conclusion


The introduction of the transformer architecture by Vaswani et al. with the paper “attention is all you need” [2] has revolutionized the field of machine learning. While it was initially applied on NLP (natural language processing) tasks such as translating sentences from English to German, its architecture was quickly adopted and adapted by other disciplines like computer vision resulting in models like the ViT (Vision transformer) [3]. Among others, the major strengths of the transformer are (1) its capability to capture the global context by an attention mechanism, (2) its high parallelization capabilities for training and (3) that it is a more generalized operation with less inductive biases in contrast to e.g., a convolutional neuronal network which intends to replicate human behavior in perceiving images.

Before we dive into the MLP Mixer, lets first try to understand the motivation for a new architecture by reviewing the weaknesses of the transformer. Then let’s summarize the contribution of the MLP Mixer paper and finally shift gear to review the MLP Mixer.

Motivation of the MLP Mixer and its Contribution

Regardless of the strength of the transformer architecture and its variations, there is one mayor issue: the requirements for compute and memory scale quadratically with the input sequence length. In other words, the more words a vocabulary in an NLP problem has or the large the resolution of an image in computer vision, there more resources are required to train and deploy the model. This constraint is the result of the attention mechanism, in which each element of a set (or sequence) attends to each other elements of a second set, whereas the second set could be the same as the initial set.

The MLP Mixer tackles this issue by replacing the attention mechanism. According to [1], the main contributions of the MLP Mixer are:

  • The introduction of a competitive (but not better) and simple architecture to the ViT that does not use convolutions or self-attention, but only multi-layer perceptrons (MLP).
  • The reliance on basic matrix multiplication, reshaping and transpositions and scalar non linearities only.
  • The linear scaling of computational complexity with the number of input patches (unlike ViT, which scales quadratically)
  • The removal of positional embeddings

MLP Mixer Architecture and its Comparison to Previous Models

In this section we will discuss the model architecture of the MLP Mixer and compare it to previous models. We start with the high-level architecture and will gradually reveal the details.

As the title of the paper suggests, the MLP Mixer is an architecture for vision models. It is from a high-level perspective, very similar to the ViT model, as indicated in Fig. 1.

Fig. 1: Comparision of the high level architecture of (left) Vision Transformer [3] and (right) MLP Mixer [1].

Both, the ViT and the MLP Mixer are classification networks. They input an image and output a class probability. On a high-level, both models linearly project image patches into an embedding space. While the ViT performs this step with strided convolutions, the MLP Mixer uses a fully-connected layer, since one of its objectives is to show, that neither convolutions nor attention layers are required. The embeddings, also called tokens, are in both models fed into their respective main building block for computation. In case of the ViT it’s based on the transformer encoder layer, while the MLP Mixer introduces a new architecture, as we will detail in a second. It’s important to note, that the MLP Mixer does not require additional position embeddings, since it is sensitive to the order of input embeddings, unlike the attention layer. After several main computation layers, the signal is fed into the classification head, where the model predicts a class for the given input image.

Now, let’s take a closer look into the main building blocks depicted in Fig. 2.

Fig. 2: Comparision of (left) transformer encoder block of the ViT [3] and (right) MLP Mixer Layer. Illustration by author, inspired by [2].

The left side depicts the transformer encoder as used in the ViT and the right side illustrates the MLP Mixer layer as proposed by [1]. Both layers are repeated several times (L or N times respectivly) and they follow an isotropic design meaning its input and output have the same shape. On this level of abstraction, the only difference lays in the attention mechanism. The ViT relies on a multi-head self-attention mechanism, which requires three linearly projected embeddings; the key, the value and the query. This layer assigns each value of the query an importance factor to each value of the key and vice versa resulting in an attention map. The attention map captures global dependencies of embeddings, unlike the convolution, which only considers a local neighborhood (the global context is usually captured by several convolution layers, which typically decrease the spatial width and increase the number of channels). The attention map is then multiplied with the value embedding to enforce important values, while non important values are suppressed. The MLP Mixer on the other hand replaces the self-attention mechanism by a MLP block encapsulated between two matrix transposition operations to captor the global context. To understand how this works, we further dive into the detailed architecture of the Mixer layer depicted in Fig. 3.

Fig. 3: Mixer layer of the MLP mixer [1].

This architecture is build on a simple observation by the authors of [1]: modern vison architectures mix their features (1) at a given spatial location across channels and (2) between different spatial locations. A CNN implements (1) within a layer but usually achieves (2) by consecutive convolutional layers which decrease the spatial width and increase the number of channels by applying more and more filters. Attention-based architectures allow (1) and (2) within each layer. The intention of the MLP Mixer is, to clearly separate (1) and (2) which the authors refer to as channel-mixing and token-mixing respectively.

First, we consider token-mixing, performed by the first MLP, i.e. MLP1. MLP1 will act upon each row of an input matrix, leveraging weight sharing. After the first normalization layer the data is represented in a matrix of the form [channels, patches]. Channels (or hidden dimension of the embeddings) is a hyperparameter that can be varied. Patches refers to the number of patches the input image was divided into. To mix data from each tokenized patch, the input is transposed before the MLP1 is applied on each row. The Output of MLP1 is transposed again, to obtain its initial form. By feeding information from each patch into the MLP, global context can be perceived.

The second MLP, i.e., MLP2, performs channel-mixing. MLP2 has different weights than MLP1, but also uses weight sharing. MLP2 receives data from all channels of a single patch, allowing information from each channel to interact with each other.

Each MLP block consists of a fully-connected layer, followed by a GELU activation followed by another fully-connected layer.


One important question has not yet been answered: how well does it actually perform? To answer the question. the authors of [1] conducted several experiments with models of different scale and different datasets. For more details I recommend you to read the paper. I’ll only cover the main results.

To conduct the experiments, several models have been first pre-trained on different datasets and have then been fine-tuned on different downstream tasks. Three parameters have been analyzed:

  1. Accuracy in the downstream task
  2. total computational cost of pre-training
  3. test-time throughput

Table 1 shows the main results with the following explanation of the columns:

Column 1: Tested model
Column 2: Top-1 accuracy of ImageNet downstream task with original labels
Column 3: Top-1 accuracy of ImageNet downstream task with cleaned real labels
Column 4: Top-1 accuracy of average performance across all five downstream tasks (ImageNet, CIFAR-10, CIFAR-100, Pets, Flowers).
Column 5: Top-1 accuracy of Visual Task Adaptation Benchmark
Column 6: Throughput in images/sec/core on TPU-v3
Column 7: Total pre-training time on TPU-v3 accelerators

Table 1: Transfer performance, inference throughput, and training cost of the MLP Mixer compared to state-of-the-art models from literature [1].

The Mixer model is competitive in terms of accuracy with all other tested models across all performed downstream tasks. In terms of throughput during test-time, Mixer outperforms ViT and BiT.

It can be observed that the MLP Mixer pre-training time on the large JFT-300M (300M images, 18k classes) dataset clearly outperforms its competitors, while for the smaller ImageNet-21k (14M images, 21k classes) dataset it is actually slower to train compared to its competitor models.

It seems that especially for large datasets, the MLP Mixer is a competitive alternative to other state of the art models, since it almost achieves state-of-the-art (SOTA) performance, while being more efficient during training and test time. One must decide, which metric is more important in one’s respective application.


The MLP Mixer tackles the quadratic scaling issue of computational resources of attention layers by introducing a simplified architecture consisting of MLPs and transpositions. It achieves near SOTA performance while decreasing the need for computational resources. It’s a trade-off, which one must decide for a given application.

[1] Tolstikhin, Ilya, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, et al. “MLP-Mixer: An All-MLP Architecture for Vision.” ArXiv:2105.01601 [Cs], June 11, 2021.

[2] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. “Attention Is All You Need.” ArXiv:1706.03762 [Cs], December 5, 2017.

[3] Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. “An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale.” ArXiv:2010.11929 [Cs], June 3, 2021.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: