Original Source Here
Deep lifelong learning — drawing inspiration from the human brain
Replicating learning mechanisms from the human brain to prevent catastrophic forgetting in deep neural networks
Living beings continually acquire and improve knowledge and skills, adapting to new environments, circumstances and tasks. This ability is arguably the reason we, as a specie are still alive. As Darwin’s evolution theory states — higher survival chances belong not to the strongest but the fittest.
Unlike animals, who have the ability to adapt throughout their life, many machine learning algorithms, including neural networks are fixed during their lifetime, thus limiting their usecases. The model is trained once with parameters frozen during inference.
However, in the real world models would benefit from the ability to efficiently process streams of information, learn multiple tasks and deal with uncertainty in the input.
Endeavours to expand the knowledge of the trained neural network often results in catastrophic forgetting — where the information acquired from the earlier training stages is lost. The problem is particularly notable where concept drift is present in the new data, i.e. the distribution of training data changes considerably over time.
The field of lifelong learning (continual/sequential/incremental) is concerned with developing techniques and architectures enabling the models to learn sequentially without the need to re-train from scratch.
Interestingly, catastrophic forgetting does not happen in humans as we learn new things – going to driving courses does not result in forgetting how to ride a bike. Owing to the brilliance of the animal brain, it is, perhaps, not surprising, but nevertheless fascinating that many successful lifelong learning methods, harnessing the inspiration from the nature, replicate biological processes happening in the animal brain.
This article is a short summary and intro into the concepts and problems associated with lifelong learning. I will first talk about the example use-cases, following up with the biological factors affecting learning processes in the human brain and their congeniality to life-long learning methods.
Possible use cases and motivation
To better understand the scope of the problems that lifelong learning addresses and potential pitfalls, let’s consider a few practical examples where lifelong learning might come handy.
Chatbot: Imagine building an online assistant chat bot for a bank. The bot is trained fairly well to converse on the topics that it has already been exposed to. However, the bank decides to expand its services with a new credit card line. Upon the customer enquiry about the new service chat bot would flag the topic it has not seen before and transfer clients to an operator. And an operator would then have a conversation with a client.
A model with an ability to learn while deployed would learn the new topic from this new conversation of the customer with the operator without the need to be re-trained from scratch on all conversation base, saving time and money.
Production line: A candy factory, producing red and blue candies decides to extend its production line with a new green candy. During most of the time candies are mixed together, however need to be sorted for packaging. This part of the chain relies on a computer vision classification algorithm. The sorting algorithm now needs to be extended to classifying a new colour without the need to re-train from scratch. This case is more challenging than the chat bot as we would also need to extend the model with a new output and thus change the networks’ structure.
Can we just do a backward pass while the model is deployed?
Short answer is yes, we can. But we would also risk messing up networks weights and loosing the knowledge learned in the main training stage. Without additional mechanisms neural networks are prone to forgetting previously acquired knowledge. But what does it mean exactly and why it happens?
Neural networks have limited resources (weights), that can be rather effectively tuned to a specific task if exposed to this task over and over again. But since these resources are finite, new knowledge either squeezes and makes old knowledge richer or pushes it out.
Consider for example a network that learns to predict age of dogs (can be very useful for animal shelters). The network would encode features, perhaps related to the relative position and proportion of the dog’s facial elements (we trained only on Huskies and German Shepherds). The loss would guide weight gradients towards the location of a local minima for the task of identifying the age of these two breeds. If we were to extend the network with more breeds (e.g. a Yorkie), we could potentially get away with learning features that are common across all dogs. However, if we were now to extend the network to identifying the age of domesticated parrots, the loss would push the gradients towards the features important for the age of parrots. These would intuitively be very different from the features required for dogs. And if the network is no longer exposed to the images of Huskies, Yorkies and Shepherds we run into catastrophic forgetting — loosing all learned knowledge about the age of dogs.
To summarise, continual acquisition of incrementally available information from non-stationary data distributions leads to catastrophic forgetting or interference in neural networks. New information overwrites previous knowledge in the shared representations. In offline learning, this loss is recovered as shuffled examples are reshuffled and repeated.
Biological concepts behind learning
As the name suggests “neural networks”, some of the mechanisms in these models have already been inspired by an animal brain. Perhaps our brain also contains hint that could be the key to preventing catastrophic forgetting in deep models?
Our brain constantly learns and memorises. Through the evolution process these functions have been perfected to suit our daily life without sudden breakdowns after learning a new word or a skill. Two major mechanisms in our brain that are relevant to the way neural networks learn are stability-plasticity dilemma and complementary learning systems theory. Both can give us hints on how to prevent catastrophic forgetting and develop efficient algorithms for lifelong learning.
Stability refers to the ability of the brain to retain new knowledge and plasticity refers to the ability to acquire new knowledge.
The brain is particularly plastic during critical periods of early development. You might have noticed that kids learn very fast, but they also forget almost as quickly. Retaining knowledge in the early age is difficult — when the system constantly changes in the search for the most important signals. That is why repetition of the new material is key for children’s learning. How many times have you heard kids being asked: “What does the cat say?”.
Plasticity becomes less prominent as the biological system stabilises, we become more dependent on the learnt knowledge and operate from experience. Nevertheless the brain preserves a certain degree of plasticity for adaptation and reorganisation at smaller scales.
Complementary learning systems (CLS) theory
The theory generalises the neural mechanisms for memory formation. It is based on differentiating the complementary roles of hippocampus and neocortex in the learning and memory formation processes.
Within the memory formation the hippocampal system is responsible for short-term adaptation and rapid learning of new information. While neocortical system on the other hand is tailored for long-term storage, that is very hard to overwrite, it has slow learning rates and designed to learn generalities.
Hippocampal system rapidly encodes episodic-like events, which are then repeatedly played back over time to the neocortical system for its long-term retention. It is believed that the consolidation of recent memories into long-term storage occurs during rapid eye movement (REM) sleep.
Lifelong learning in neural networks
Mechanisms developed to deal with catastrophic forgetting do indeed get inspiration from the nature — both complementary learning theory and stability-plasticity dilemma. Not all of the common described below work well, however, trying them out was also important to find the right track in moving forward.
I provide the links to the literature used in the summary section below.
Replay methods are based on repeatedly exposing the model to the new data and data on which it has been already trained on— when there is access to such. New data is interleaved with already seen examples in a batch and fed to the model’s training step. In the simplest case, in every new batch old data is randomly sampled.
Despite simplicity this method has several caveats:
- What is the best strategy to efficiently choose examples to rehearse on? Some, already seen examples are more informative than others. Here most common strategies are looking for more informative samples, or those representative of the mean of the features originally learnt by the method.
- Seeing only a fraction of the training data also poses a problem of how to ensure that the model does not overfit? Here proposed methods rely on regularisation.
- Finding the ratio of new and old samples in the batch is also not trivial.
Overall this replay methods provide marginal performance gains, at the cost of longer training time and larger memory requirements.
Unlike rehearsal methods, regularisation methods alleviate the problem of higher memory requirements. These are focused on developing a loss term that would consolidate all previously learnt knowledge.
Prior based models estimate parameter distribution of the model parameters and do not let the model deviate too much from it when exposed to the new data.
Instead of exposing the model to already seen data, or consolidating the knowledge in the loss term, we, instead could efficiently use available network resources to prevent critical areas from being over-written, or instead expand the network to allow for more compute/storage space.
No additional resources: If we cannot expand the network by adding new weights, i.e. resources are fixed, the model could use non-overlapping representations for distinct concepts. This type of methods is also called parameter isolation methods.
Important for a given task/class parameters are isolated and frozen. Akin to the stability-plasticity dilemma, Elastic Weight Consolidation slows down the learning on the parts of the model vital for previously learnt tasks, while Incremental Moment Matching (IMM) matches the moments of the parameter distribution of two networks trained on separate tasks in a single architecture to capture the knowledge learnt by the two, preventing catastrophic forgetting.
With additional resources: If we had the capability to expand the network, we could also allocate additional neural resources for new knowledge. Within this configuration it is possible to directly train on new examples without the need to interleaf with the original dataset, however this approach obviously requires larger memory requirements — in the simplest case we would be adding a new neural network for every portion of new knowledge.
Exploring lifelong learning algorithms is of vast importance if we are to make another leap in AI. Especially with the advancements of back in the day futuristic AutoML and AutoAI. That is why we see more and more publications, especially from the top research institutes like DeepMind.
Lifelong learning is still a fairly new topic. And although several survey papers present a rather well rounded overview of what has been achieved to date:
constantly changing requirements make them outdated rather quickly. The lack of standardised protocols and datasets makes it challenging to evaluate different strategies too. However, as we know from other areas, the field will soon mature.
If you liked this article share it with a friend! To read more on machine learning and image processing topics press subscribe!
Have I missed anything? Do not hesitate to leave a note, comment or message me directly!
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot