Original Source Here
Harder, Better, Faster, Lighter Deep Learning with Direct Feedback Alignment
A gentle guide to improving deep learning efficiency through randomness and lasers
In this article, we are going to talk about one of the most interesting subjects I have faced since I discovered deep learning: Direct Feedback Alignment (DFA). This method is an incredible and mind-bending alternative to the standard deep learning training method, the backpropagation, while allowing for more efficient training. By the end of this article, you’ll understand how mixing neural networks, lasers and randomness can deliver much more power!
This article is not meant as an introduction to neural networks. Even though we will do a brief recap on the maths behind neural nets, it is made to share notations and to introduce the differences between backpropagation and DFA. If you want to know more about neural networks and deep learning before diving in, you can read this article.
Neural networks 101 recap
Neural networks are widely used because of their capacity to approximate well enough every possible function, as well as scaling to high dimensions. Their main use is supervised learning where they are used to infer the relation between inputs and outputs using examples of data where we try to reduce the error between prediction and targets. However, they go beyond that and can be used to minimize or maximize, any given numerical goal.
Neural networks are composed of neuron layers connected to one another. This structure allows the input signal to be processed into an output signal using a series of non-linear transformations, which in the end creates a global complex transformation. The transformation performed by a given neural network is defined by its parameters (a.k.a. weights & biases) and activation functions. For the sake of clarity, we will only consider fully connected networks in this article.
Let’s now dive into the maths behind it.
First, how is an input processed by the neural network:
The input is fed to the first layer where a linear transformation is applied using the weights and biases. The output of the linear transformation is then modified by the activation function, which allows the introduction of non-linearity (using functions like sigmoid, tanh, or swish), or linear modifications (like ReLU and its variants). The process is then repeated through the other layers using the result of the activation function as the input to the next layer.
At the beginning of the process, the resulting outputs of the forward pass (and thus the network’s predictions) are essentially random due to the fact that the weights have been randomly initialized (which is the most common type of initialization in neural networks).
The fundamental question in neural networks is:
How do I change the weights so that the neural net does what I want it to do?
Knowing how to change the weights depends on what we mean by “What do I want the network to do”. The specification of this goal is done through an objective function, usually referred to as the Loss. The loss (denoted L or J) is a function of the neural network’s parameters (the weights and biases, denoted theta) which describes how to compare the output of our neural network to the expected output. It is essentially a quantification of how wrong the network was and is thus made to be minimized. For example, in binary classification, a common loss is the binary cross-entropy:
Hence in this scenario, the goal is to minimize this loss function to make the network do what we want. We can now update our question to:
How do I change the weights so that the neural net minimizes the loss?
The loss tells us how far we are from our objective, we want a way to propagate this information through the network to change the weights (and biases) to a value that would result in a lower loss. This means uncovering the contribution of each individual weight to the output (and thus the loss) to be able to adapt them properly. Let’s begin by adapting the weights of the last layer. To do this we first want to know how our loss (or error) varies with the output of the last layer:
Now that we know how the loss varies with the output layer, we want to use this knowledge to make the activation vary in a way that will reduce the loss. We thus want to know how to change the weights in such a way that will induce the desired variation on the activation:
Now we apply the same scheme to the other layers:
And … voilà! We’ve updated our weights (a similar process applies to biases) to reduce our loss. We can now repeat the process on new examples to incrementally lower the loss until convergence.
So what’s the problem with backpropagation?
Although this process works, it doesn’t come without flaws:
- Backpropagation is a sequential process: We need to compute the gradient of the last layer before computing the gradient for the previous layer, which is then needed to compute the gradient for the previous layer, etc … For deeper networks, this makes the process very slow
- The gradient computation requires a matrix transposition: this is an expensive computation (with a complexity of O(n^2)) and it requires a lot of memory. For larger networks, this can quickly become a problem.
Another issue, less problematic but still relevant according to the authors of the original paper, is that backpropagation is not “biologically plausible”. This is due to the fact that in real neurons the consensus is that there is no signal backpropagating.
« The backprop learning algorithm is powerful, but requires biologically implausible transport of individual synaptic weight information. For backprop, neurons must know each other’s synaptic weights,. […] On a computer it is simple to use the synaptic weights in both forward and backward computations, but synapses in the brain communicate information unidirectionally »
Source : Random feedback weights support learning in deep neural networks, T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman, 2014
We’re not going to discuss the veracity of this claim in this article, but we will still note that this belief led to the birth of methods aiming to be more biologically plausible. The train of thought is that if we mimic the brain more closely, we will have better methods since the brain is more efficient than the existing methods. One of those more biologically plausible methods is the one we’re interested in today.
So, can we solve backpropagation’s flaws?
Direct Feedback Alignement
Let’s start from backpropagation and construct our dream update without sequential computation and matrix transposition. We don’t care if it makes sense or not… yet.
To remove the dependency on the previous layer, and thus remove sequentiality to allow for parallel computation, we make all layers use the same feedback: Instead of using the gradient of the loss w.r.t the previous layer’s activation (delta 2 in the equation above), we will use the last layer’s error (denoted e in the equations above) across all layers.
Then, to remove the matrix transpose, we replace the transposed weights matrix with a fixed random matrix of the appropriate size. No transposition computation to run each update anymore, just generate one random matrix per layer and use it for all computations.
Voilà, we have an expression that doesn’t require any sequential computation thanks to the absence of dependencies between layers, once we have computed the last layer’s error we can compute all other layers’ update simultaneously. And we also removed the matrix transposition which should accelerate the computation even further! This update is called Direct Feedback Alignment (DFA):
DFA is part of a family of similar techniques:
- Feedback Alignment is the first step of the process, only removing the matrix transposition but keeping the backward signal.
- Direct Feedback Alignment, which we have already discussed.
- Indirect Feedback Alignment, which removes transposition and propagates the last layer’s error directly to the first layer and then proceeds to propagate it forward. This allows using a single random matrix instead of one matrix per layer, thus saving memory. In practice this version is less interesting than DFA unless you’re really tight on memory, so we won’t focus on it.
But wait a minute …. with this “better update” we’ve basically destroyed backpropagation in the process haven’t we? How can our new update allow for training neural networks? How can a random matrix be of use in a meaningful update?
That’s a very good question! First, let’s see how it compares to backpropagation:
Astonishing isn’t it?
We achieve similar results to BP (better would have been really shocking, as we have only made the signal worse) while allowing for simpler computations. We can get the same results faster!
But … how?
That’s great but … how? Let’s try to understand why our seemingly broken update still carries a meaningful signal for our network.
The authors of the original paper explain that the network “learns how to learn”. Mhhh, we’ve seen clearer. They state that since the feedback is now a fixed and randomly modified error, the feedforward weights have to adapt on their own. The feedback is no more explicit, it’s implicit.
« The network learns how to learn — it gradually discovers how to use B (the random matrix), which then allows effective modification of the hidden units. At first, the updates to the hidden layer are not helpful, but they quickly improve by an implicit feedback process that alters W so that eT WBe>0. To reveal this, we plot the angle between the hidden-unit updates prescribed by feedback alignment and backprop, ∆hFA]∆hBP Initially the angles average about 90° . But they soon shrink, as the algorithm begins to take steps that are closer to those of backprop. This alignment of the ∆h’s implies that B has begun to act like W^T . And because B is fixed, the alignment is driven by changes in the forward weights W. In this way, random feedback weights come to transmit useful teaching signals to neurons deep in the network. »
Source: Random feedback weights support learning in deep neural networks, T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman, 2014
Let’s try and interpret that, shall we? Basically, the signal is much poorer in terms of meaningful information, however, the update direction is approximately the right one, allowing for learning nonetheless. The update direction is approximately the same as with backpropagation, hence the term “alignment”.
The updates of DFA “oscillate” around the true BP updates.
Let’s try to build some intuition about why that happens (Most of this intuition building comes from Yannic Kilcher’s video about the article with a wrapper of my own understanding).
Even though the backpropagated signal is modulated by a random matrix, it does not remove all the information from the error. This is due to the property of the multiplication of a vector (the error e in our case) by a random matrix to represent a projection that conserves angles and distances (why that happens has to do with the Jonhson-Lidenstrauss lemma but I’m not 100% sure about why, so we will just assume that it’s true in this article). In other words, by multiplying our signal by a random matrix, we are just applying a random rotation without other modifications.
Let’s try and visualize that by starting with the last layer. For the sake of the example, we will assume that there are 3 possible classes in a classification problem: c1, c2, and c3. Our network makes a prediction for each example (y hat), and the update tells us in which direction and how much we should change the weights to correctly attribute the example to its class. In other words, we want to align the prediction (y hat, a vector of probabilities over the different classes, e.g. [0.7,0.1,0.2]) with the correct class (e.g. c1:[1,0,0]). In that case, the modification to make would be [0.3,-0.1,-0.2].
Now for the previous layers, we have to deal with the random multiplication. This means that our coordinate system has been randomly rotated while conserving angles and distances. In this system, the error does not communicate how we should modify the prediction to make it closer to the target, but how to make it closer to some random rotation of the target.
From an individual example point of view, this is bad, as the updates seem to be going in a randomly wrong direction. However, if we consider multiple examples, since the rotation stays the same, examples belonging to the same class will still be made closer as they will be projected onto the same rotated class. As a consequence, the propagated signal does not allow to specify the absolute direction in which the weights should be updated. However, we can still do this in a relativistic manner: Instead of moving the weights in the right absolute direction every update, the network learns to bring together examples that should be close together. In some sense, this comes down to clustering the examples and then classifying the clusters instead of individually classifying them.
In other words, we do propagate the error, but we don’t propagate how it should solve the final task. We only propagate how examples should be grouped together. This leaves the task for the final layer to actually classify the examples greatly simplified. To confirm that, let’s see how the internal representation of examples fed to the network compares between BP and DFA through the network’s layers.
Looking at those representations confirms that this “clustering” approach leads to similar results to the representations constructed by BP. In other words, we achieve the same result in a different and simpler way as it seems that the hidden layers do not need to explicitly map examples to a specific part of the space, they only need to cluster them. The hidden layers construct transformations that allow for easy separation for the last layer in both methods. It’s the hidden features that matter, not the way we get them. DFA is a weaker version of the BP update that conserves the clustering property and therefore, the learning property. DFA is “good enough” to work.
Now that we have a clearer intuition of why it works, let’s move back to its performances.
Why should I use DFA?
As we have seen, with a similar number of examples, DFA converges as fast (in terms of examples) as backpropagation.
However, an update of a network using DFA should be much quicker to compute than with backprop thanks to the lack of transpose and the ability to parallelize all the layers update. Although we will see exact figures later, it is hard to compare DFA to BP in terms of speed because of the next point.
If DFA is faster while conserving performances, why aren’t we using it then?
Backpropagation has been around and heavily studied since 1986. Basically, all of the existing deep learning revolves around it. The authors claim that the lack of optimized implementation of DFA and the habits of researchers and practitioners prevent the adoption of DFA. However, another trick should help DFA win …
Lasers, yes, Lasers.
Until now we have compared DFA and BP on standard hardware (CPU or GPU). However, the specificity of DFA’s update relying on a random matrix multiplication allows us to leverage a very interesting physical phenomenon: dynamic light scattering
« A monochromatic laser at 532nm is expanded by a telescope, then illuminates a digital micromirror device (DMD), able to spatially encode digital information on the light beam by amplitude modulation, as described in section III B. The light beam carrying the signal is then focused on a random medium by means of a lens. Here, the medium is a thick (several tens of microns) layer of T iO2 (Titanium Dioxide) nanoparticles (white paint pigments deposited on a microscope glass slide). The transmitted light is collected on the far side by a second lens, passes through a polarizer, and is measured by a standard monochrome CCD camera. »
Source : Random Projections through multiple optical scattering: Approximating kernels at the speed of light, A. Saade, F. Caltagirone, I. Carron, L. Daudet, A. Drémeau, S. Gigan, F. Krzakala, 2015
In other words :
We’re phsyically computing the random matrix multiplication at the literal speed of light.
You may say that this technology is only limited to labs and research papers… but no, it already exists! The French start-up LightOn already designs and sells Optical Processing Units (OPUs) that do precisely that (random matrix multiplication can also be used in other applications as well).
« LightOn, founded in 2016, has worked with in partnership with Paris-based cloud computing service provider OVH Group and claims performance improvements for certain machine learning tasks.
The OPU uses laser light that is shone on to a digital micromirror device (DMD) to encode for 1s and 0s with the light then redirected through a lens and random scattering medium assembly before being polarize and read by a conventional camera. This allows very large matrices to be manipulated in parallel. One of the operations that can be done is kernel classification. Typically, the DMD can handle matrices of the order of 1k by 1k.
For a task called transfer learning the OPU showed six-fold speed up at five times greater energy efficiency than a GPU-based solution. This translates to 30x less power consumed. Another benchmark on time-series analysis with a recurrent neural network demonstrated a 200x speedup over conventional CPUs with large RAMs. »
To summarize, this technology allows to run random matrix multiplication computations faster, with a hugely improved scaling and allowing for much bigger inputs.
Through this article we have discovered an alternative to backpropagation that allows, through smart mathematical tricks, to approximate the same updates with a much simpler to compute expression. Despite being less explicit in the signal transmitted, this expression still allows to propagate the significant information and thus allows for correct updates in an implicit manner. In addition, this way of computing the updates opens the gate for greatly accelerated computation through dedicated hardware that leverages light physical phenomenons to compute at the literal speed of light.
In summary, once we have access to efficient implementations in common libraries as well as affordable and easy to use dedicated hardware, Direct Feedback Alignment will surely lead to huge improvements in terms of computation speed, memory requirements as well as energy consumption reduction for deep learning and all its applications in the near future.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot