Evaluating adversarial examples with similarity metrics in Python


Original Source Here

For classification neural network, an adversarial example is an input image that is perturbed (or strategically modified) in such a way that it is classified incorrectly on purpose. There are various algorithms that exploit the inner working — gradients and feature maps — of a given classification model and modify the input image such that it is either only misclassified (untargeted attack) or always misclassified into a specific class (targeted attack).

In this article, we will look at a few white-box attacks (algorithms that generate attacks after knowing the inner working of a model) and use similarity metrics to point out the increased robustness of some of them.

The attacks:

Generating the attacks

There are a few Python libraries that have implemented these (and other) attack algorithms and also provide ready-to-use modules to generate them for our use case. Cleverhans, FoolBox, and ART are three popularly used and regularly maintained open-source libraries for adversarial example (AE ) generation. For now, we use ART.

Install the Adversarial Robustness Toolbox as below:

pip install adversarial-robustness-toolbox

Or refer to the official repository for further guidance.

Next, we can generate attacks as shown below:

To generate attacks we use the InceptionNet V3 model as the base using which the algorithms generate their attacks. We use real-world images as found in the ImageNet dataset and, thus , also the PyTorch model of InceptionNet V3 pretrained on ImageNet.

The output of the attack.generate() method is a List containing the perturbed image in the same format as the input (in (channel, width, height) format and with pixel values in range [0,1]).

Sample image for each attack. Image by author

Visually, in the above images, we can see that the FGSM attack makes perturbations that are visible to the human eye. It is said to be a comparatively weaker attack than the rest. However, CW shows the ideal case! No perturbations are visible and the attack is proven to be more robust against classifiers than the rest.

Next, we can go about measuring similarity of each of these attacked images with the original image.

For the human eye it is easy to tell how similar in quality two given images are. However, if one wanted to quantify this difference we’ll need mathematical expressions. From cosine similarity to ERGAS, there are several such metrics available to test the “quality” of an image compared to its original version.

Typically, when new images are generated from existing images (after denoising, deblurring, or any such operation) it would be good to quantify how dissimilar the regeneration is.

We can take this application and think of it for our use case as well.

We know from existing literature that DeepFool and CW are robust attacks that have a higher success rate of fooling the classifier. They are also difficult to detect because of the minimal perturbation (or noise) that they add to the target image. These points have been evaluated using the reduction in the model’s classification accuracy and how the image looks visually, respectively.

But let’s try to quantify the latter part using this quality indices.

Read more about implementing these image similarity metrics in Python.

Similarity Metrics in Python

We’ll use the sewar library in Python to implement a few of the available metrics.

Start with pip install sewar and import the required modules as below

Out of these we’ll only be using PSNR, ERGAS, SSIM, and SAM. I have chosen only these few because in case of robust attacks like CW and DeepFool, out of all those listed above, only these few are able to capture and amplify the differences in a noticeable fashion.

The imported sewar modules can be used directly such as

ergas_score = ergas(original, adversarial)

Below you can see the results for various attacks and various scores. Clearly ERGAS and SAM amplify the differences across attacks more than the rest.

Similarity scores between original and adversarial image for four effective metrics. Image by author.

As per our hypotheses, we see that the similarity score for CW attacked images is more than FGSM/PGD attacks. Meaning that the adversarial images are more similar to their original images for CW/DeepFool than other lesser sophisticated attacks.

Feel free to try out for other types of attacks yourself!

Thank you for reading all the way through! You can reach out to me on LinkedIn for any messages, thoughts, or suggestions.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: