Phishing Domain Detection using Neural Networks



Original Source Here

Source: https://blog.idrive.com/2018/10/10/dont-be-a-victim-some-tips-to-avoid-phishing-scams/

Phishing Domain Detection using Neural Networks

Applying neural networks on domain name analysis to detect phishing

Domain Name Analysis

StreamingPhish is one of the implementations of domain name analysis. The idea behind domain name analysis is to train curated data of known benign and phishing domains. In StreamingPhish a feature vector of 348 dimensions is derived from the domain name and then trained using LogisticRegression. Some of the features are:

  1. If a brand name exists in the domain name
  2. If certain defined phishing words exist in the directory which the StreamingPhish author cherry-picked from https://github.com/SwiftOnSecurity/PhishingRegex/blob/master/PhishingRegex.txt
  3. If a word similar to phishing words exists, while similarity is defined as Levenshtein distance

These are great features and the author shows pretty good results (98.9% accuracy on test results). In this post, first, we will try to see if we can use the features as is in the neural networks and what kind of results we get, and then eventually if we can get away from the features altogether.

Simple Neural Network

First, we will create a network similar to one for MNIST in TensorFlow examples. This network has an input layer of the shape of features. In MNIST it is two-dimensional but in our case, it is one-dimensional of size 348. Then we have a dense layer of 128 nodes and finally two output nodes for benign vs phishing class.

The above code implements what we just discussed. We also get a test accuracy of 98.9% which is comparable to the existing method. Though these are interesting results, what do we gain here? We haven’t improved on accuracy (Please note that we aren’t doing model selection and cross-validation so we should take results from the original implementation and this with a grain of salt.) or removed the necessity of features. So let us remove the necessity of features and see how do we fare?

Embedding-Based Convolution Neural Network

Embedding with one-dimensional convolution is a very simple way to do text classification. In our scenario, the text is a domain name, and classifying them as either benign or phishing is our goal. Here is a great StackOverflow thread that explains how it works: How does Keras 1d convolution layer work with word embeddings — text classification problem? (Filters, kernel size, and all hyperparameter)

Our simple model looks similar to one in the StackOverflow thread:

In our scenario, we are converting each character of the domain name to its ASCII value using the following code which makes the embedding vocabulary size to be 128 (the first parameter of the Embedding layer).

convert_to_ascii

convert_to_ascii method takes the string and based on predefined maximum length, either using the biggest string in data or manually chosen, converts the string characters to their ASCII values.

For this model, we get a test accuracy of 96.12% which is lower than using curated features and learning using Logistic Regression and a simple Neural Network. Though at the same time this model doesn’t really need pre-defined features so it is more general.

LSTM-Based Neural Network

LSTM is widely used for text classification so we construct a simple network with them:

This model fares much worse and gives a test accuracy of 93.19%. Though it is not optimized at the same time.

Feature-Less Input: Neural Network Performance

Even though in a simple setup, featureless input with neural network fares worse than with features, we argue there is value as they could be generalized and handle the different scenarios. For us next steps are the following to see if we can improve the performance:

  1. In general neural networks perform better when there is more data. In this setup, we have around 10k+ samples which is not a lot so including more data would be one way to see if we can improve the performance
  2. Hugging Face has great tokenizers and models for text classification so it would be worthwhile to look into those
  3. Optimizing hyper parameters of current simple models would be another way to look into improving performance

Google Colab Notebook

To make the StreamingPhish code work directly in the Google Colab notebook we did a few changes:

  1. We downloaded the StreamingPhish from GitHub and unzipped it. And then we installed some dependencies.

2. After that we did some path changes. Since the unzip happens in the current directory you are in Colab we provide a relative path from that place:

3. Rest of the script remains the same and is executed as-is. Here is the final notebook, save a copy in drive and have fun:

Medium is a great platform to learn about the latest and greatest in technology. If you aren’t a member please consider becoming one using my referral link: https://salilkjain.medium.com/membership and I would receive a portion of your membership fee. Connect with me on LinkedIn or Twitter.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: