Original Source Here
First, let’s clarify what we mean by “learning simple algorithms”. The copy-paste algorithm is one such algorithm. Basically, given a series of arbitrary-size input vectors, we want the outputs to be the same as the inputs. Now you might ask — isn’t that the identity function? Surely, even the simplest neural network can learn that! That is true — neural networks can learn the identity function for a given input size. Neural networks cannot learn an arbitrary identity function that works for any input size. You can try this yourself: train a neural network to learn the identity function for a vector of length 10, then test the accuracy on vectors of length 20. The performance will be poor. To sum up — our goal is to learn the general form of an algorithm with a single neural network, not just one specific case of the algorithm.
Computers are good at the general form of algorithms. For example, when you do copy (Ctrl-C) and paste (Ctrl-V) in a text editor, it doesn’t matter how long the selected text is. The computer will execute the copy-paste perfectly for any length. Part of the reason computers are so good at these kinds of algorithms is because of their memory system. Basically, this memory system lets the computer store inputs, do processing on those inputs, and retrieve them later when needed. In contrast, the only “storage” a neural network has is the value of the network weights. The capacity of the weights is far less than the gigabytes of memory on a computer. That’s why we want to add a memory system to a neural network — to provide the storage space that many algorithms need to successfully execute.
So how did the researchers implement the memory system? The idea is pretty simple. The “memory” is a large 2-d matrix of dimension N x M. Each of the N rows represents 1 unit of memory. N and M are hyperparameters set before the network runs. The next step is to define how to read from and write to this memory. This is trickier. Depending on the algorithm being learned, we might need to read from only one row in the matrix, or all the rows in the matrix. It’s also possible that some rows will have more importance than other rows. To accommodate all these cases, we introduce a weight vector w_read of length N. A “read” from the memory is then the weighted sum of all the matrix rows: Σ(w_read_i) * (row_i). (For the rest of this article, vectors will be bolded, and scalars will be in regular font)
Writing to the memory is more complicated. First, we need to decide what to do with the existing rows in memory. Therefore we need another parameter e. This parameter is a vector of length M (the same as the rows), and has individual components between 0 and 1. The elementwise product between e and row_i determines how much of row_i remains. For instance, if e has all 0 components, that means we don’t want to keep any of row_i. If e has all 1 components, that means we want to keep all of row_i. And if e is somewhere in the middle, we want to keep some components of row_i more than others. Notice that e is the same for all rows.
Once we decide what to keep, we next have to add any new information. Therefore we have another parameter a that we add to each row. Finally, we have a weight vector w_write that determines the size (or “importance”) of the write to each row in the array. The total update per row is therefore row_i_new = e * row_i_old + w_write_i * a.
We’ve now described how individual rows of memory are read from and written to the memory matrix. We did this in terms of some parameters: w_read, w_write, e, and a. The natural question is — where did these parameters come from? Recall that the overall goal is to combine a neural network with a memory system. Therefore we make these parameters the output of the neural network part of our system. More precisely we’ll setup a neural network and with two regions of output: the read head and the write head. The read head region of the network is responsible for outputting the parameter needed for reading: w_read. Similarly, the write head region of the network will output the writing parameters: w_write, e, and a. Finally, the output of the read head is returned as the output of the entire system.
One major advantage of this architecture is that the entire thing is differentiable. Notice that all of the read/write operations are linear in terms of each component, and therefore differentiable. This makes the network conceptually simple to train: just find the gradient of each component with respect to the output and do gradient descent.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot