Machine unlearning: The duty of forgetting*9-qZsi41BPf_AEQU

Original Source Here

Machine unlearning: The duty of forgetting

Image by Roman Mager from

What is the right to be forgotten? What is machine unlearning? New attention to privacy is arising leading to new regulations. Machine learning should be able to erase information but it is a challenging task. Why? How we could do?

The memory of the electronic elephant

Image generated by the author with OpenAI DALL-E

In 2020, the amount of data on the internet is 64 zettabytes (where a zettabyte is a trillion gigabytes). Moreover, there are more than 40 billion images on Instagram, 340 million tweets a day, countless posts on Facebook and so on. We share a lot of data, but we leave also a lot of track just browsing the internet.

Interest in privacy and data protection has grown globally in recent years. Users have become aware of how much data we share by using a myriad of apps or visiting countless websites. On the other hand, users have realized that this data is collected and is both used and sold. Scandals like Cambridge Analytica have increased the perception of the value of our data shared online.

The effectiveness of profiling our data is also demonstrated by how targeted ads are shown both on social networks and in our Google searches. The fact that algorithms are able to profile us so well leads us to wonder to whom this data is being sold. Indeed, harnessing the information in the data is allowing companies to earn billions.

However we change, our thoughts change, and the world change, but the data stay on the internet forever.

The right to be forgotten

image from Tingey Injury Law Firm on

The right to be forgotten is defined as “the right to have private information about a person be removed from Internet searches and other directories under some circumstances” (wikipedia). However, there is no agreement on this definition or whether this is to be considered or to be added to the human rights list. On the other hand, several institutions and governments are moving to discuss and propose regulations (Argentina, European Union, Philippines).

This concept of the right to be forgotten is based on the fundamental need of an individual to determine the development of his life in an autonomous way, without being perpetually or periodically stigmatized as a consequence of a specific action performed in the past, especially when these events occurred many years ago and do not have any relationship with the contemporary context — EU proposal

In fact, information and events from the past can still cause stigma and consequences even many years later. As a simple example, James Gunn was fired from “Guardians of the Galaxy 3” by Disney after his offensive tweets resurface. He was fired in 2018, for tweets that were written between 2008 and 2011.

“My words of nearly a decade ago were, at the time, totally failed and unfortunate efforts to be provocative. I have regretted them for many years since — not just because they were stupid, not at all funny, wildly insensitive, and certainly not provocative like I had hoped, but also because they don’t reflect the person I am today or have been for some time.” — James Gunn statement

Surely, you could delete what you have tweeted or posted on Facebook and Instagram. However, it is not so easy to delete what is shared online. for example, Facebook launched a tool called “Off-Facebook Activity” where was enables the users to delete the data the third-party app and website have shared with Facebook. However, it turns out that Facebook was just de-linking the data from the user.

In 2014, the Spanish court ruled in a favor of a man who asked that certain information be removed from Google searches. The man in 1998 had had to sell a property to repay a debt with social security. Google had refused but then both the court and the EU court ruled that Google needed to remove the search result. The court ruled since the debt was paid a long time before, the search results “appear to be inadequate, irrelevant or no longer relevant or excessive in the light of the time that had elapsed.”

A video by Google telling about the right to be forgotten, the regulation, and how to make a request to delist content

The right to be forgotten is seen as a necessity for many cases, preventing one from being indexed in search engines for revenge porn, petty crimes that have been committed in the past, unpaid debts, and so on. Those who criticize this right, however, say the legislation is seen as an attack on the right to criticism and freedom of expression. The European Union has tried to strike a balance between the right to privacy and freedom of criticism and expression.

Machine learning is perceived to be able to exacerbate the problem by collecting and analyzing all this data (from emails to medical data) by holding the information forever. Furthermore, using this information in insurance, medical, and loan application models can lead to obvious harm and amplify bias.

How to teach the machine to forget

image from Robert Linder at

Machine unlearning is a nascent field of artificial intelligence, where the goal is to remove from the model all the traces of a selected data point (selective amnesia), without affecting the performance. Machine unlearning has different applications: from granting the right to be forgotten to avoiding that AI models could leak sensible information. Moreover, machine learning would be helpful against data poising and adversarial attacks.

Companies spend millions of dollars to train and deploy large AI models and they would not like to retrain or remove the models. However, the EU and US regulators are warning that models trained on sensitive data could be forced to be removed. The UK government, in a report focused on AI frameworks, explained that machine learning models could be subject to data deletion under the GDPR. For example, Paravision collected inappropriately million of face photos and was forced by US Federal Trade Commission to delete both data and trained models.

However, machine unlearning is not an easy task as highlighted by a seminal paper in the field:

  • We have limited knowledge of how a data point is affecting the model. This is especially difficult with large neural networks where we have many different layers and a data point can affect many parameters.
  • Stochasticity in training. During neural networks we use small batches of data that are randomly selected and the order changes from epoch to epoch, thus it is difficult to reconstruct the flow of data during the training.
  • Training is incremental. If the model is updated by the presence of a training data point, all the subsequent model updates are dependent in some implicit way on that data point.
  • Stochasticity in learning. It is challenging to correlate a data point with the hypothesis learned from it

The simplest approach would be to delete the data point from the training data and retrain the model. However, this is clearly expensive. For example, OpenAI spent an estimated cost of between 10 and 20 million dollars to train GPT-3. Thus, we need better and cheaper alternatives.

One of the most known, the SISA approach was proposed in 2019 by researchers at the University of Toronto. Sharded, Isolated, Sliced, and Aggregated (SISA) approach proposes to process the data in multiple pieces The idea is that if there are data points to be deleted, only a fraction of the input data has to be reprocessed. In simple words, the dataset is divided into different shards and incrementally presented to the model during training. The parameters are saved before adding another shard allowing to start the retraining from just before the point to be unlearned was used.

SISA approach: instead to retrain the model you just need to reprocess part of the input. Image from the original paper (here)

However, this approach is not exempt from flaws: it can forget only a certain number of data points and if not ask in a particular sequence. Thus, in 2021 was published an article that aimed to solve these issues. They claimed their approach can allow the removal of much more data points.

Another promising approach is differential privacy where companies collect and share only aggregate information about user habits, maintaining the privacy of individuals. Microsoft, Google, and Apple are investing in this technology but it is still not widely used.

Although the topic is relatively new, several articles have been published and will grow in the future.

Parting thoughts

The right to be forgotten is the right to have an imperfect past — Suzanne Moore

In 2022, the right to be forgotten was reaffirmed by several rulings (Italy, Argentina, India). In general, the GDPR stipulates that companies must delete user data if requested. At present, current case law requires that each request be analyzed on a case-by-case basis. Pero, in recent years there is an increased focus by institutions on both privacy and artificial intelligence.

Most likely more new regulations will be passed in the coming years. California enacted a Right to be Forgotten law in 2020 (California Consumer Privacy Act), North Carolina is moving in the same direction, and there are also discussions at the federal level. In addition, the EU is discussing regulating other aspects of artificial intelligence, and as seen above AI models may be affected by the right to be forgotten.

On the other hand, we need to balance both privacy and the right to expression, preventing the right to be forgotten from being used as a form of censorship. In addition, new technologies such as blockchain open up new challenges to be solved.

Moreover, people are much more sensitive to the topic today. The sentiment has been received by companies, and many of them are moving to increase user privacy. Recently, for example, Google announced an expansion policy for citizens of the united states in removing personal data (email address and physical address, handwritten signatures, non-consensual explicit or intimate personal images, and so on) from search results.

As mentioned, we need to find a way that if data are eliminated the AI models that have been trained are cleaned up by the information extracted from those data points. Machine unlearning is not an easy challenge but some approaches have already been tested and others are in development. In conclusion, machine unlearning although a relatively new field is a growing field and will be instrumental as regulations increase.

If you work with machine unlearning or are interested in AI ethics, I am interested in your opinion.

If you have found it interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach me on LinkedIn. Thanks for your support!

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.

Or feel free to check out some of my other articles on Medium:

Additional resources

  • Additional information about the right to be forgotten: here, here, here
  • a GitHub repository presenting a large collection of articles about machine unlearning: here
  • a seminar about machine unlearning: here
  • about differential privacy: here


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: