Adversarial Attacks: Introduction and Example



Original Source Here

Adversarial Attacks: Introduction and Example

Attacking Machine Learning Models. Turning Cats into Lemons.

Concept

Adversarial examples are specially designed inputs that are supposed to fool a machine learning (ML) model resulting in a high confidence misclassification. However, the interesting part is that the modifications made on an image are gentle yet significant enough to trick an ML model. In this article, I would like to demonstrate how slight changes might lead to catastrophic effects. The image below summarizes the process of Adversarial attacks:

Image by the author.

Consider an image of a cat above, we add a small perturbation that has been calculated to make the image be recognized as a lemon with high confidence. To be more specific, we are going to get the image and calculate the loss with respect to the desired label, which is ‘lemon’ in this example. Thus, we can get the signs of the gradients computed for the input image and multiply it by some small constant epsilon. After multiple such iterations, we are able to get the image of the cat that makes our ML model classify it as a lemon with high confidence.

This approach is quite robust yet simple and straightforward to understand. However, Adversarial examples might be potentially very dangerous. For example, attackers might make my AI lemonade-making robot squeeze my cat and make another lemonade. It would be very sad 😦

Example

As an example, I am going to take a ResNet50 pre-trained on Imagenet. There are 1,000 classes on the list overall, and I am using a Siamese Cat as initial input and my desired label is a lemon.

Image by the author.

As seen, the model correctly classifies my image as ‘Siamese cat, Siamese’. Note that the confidence is low owing to the sizes of the image, which are greater than the ones used for training. Now we will try to fool our model to classify it as a lemon.

This is my helper function to make a prediction. The input is my PIL image of a cat. It takes my input and prints out the predicted class as well as its probability.

As I have already described, the process of how I am attacking is summarized in the ‘attack’ method. I am running this function 10 times and it is enough to make our ResNet50 misclassify it as a lemon. Note that we only take signs of our gradients, which will be either 1 or -1, and multiply it by epsilon, which is 1e-6.

Image by the author.

Voilà! I achieved our goal. The model now classifies our cat as a lemon with a really high probability yet we can clearly see that the image is still visually a cat.

by Paul Hanaoka via Unsplash

Some Last Words

As you see Adversarial attacks is pretty simple and fun. However, it can be potentially dangerous and it might question the reliability of AI. It has been a major research area recently. Reinforcement learning agents can also be manipulated by adversarial examples. I will leave a useful blog by OpenAI. If I managed to get you interested, I encourage you to study it on your own. Thank you for your time spent reading my article. Stay confident and don’t let small changes affect you 🙂

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: