Routing network traffic based on firewall logs using Machine Learning



Original Source Here

Routing network traffic based on firewall logs using Machine Learning

Image by Tumisu from Pixabay

The Internet is such an essential part of our lives nowadays, that it can be considered as one of the necessities to live a better life. However, with that said, it is also necessary that it is not being used for illegal activities and more so in private networks such as university campuses or corporate offices to protect and secure confidential data. That’s where the firewall comes into picture. A firewall is a network security system that monitors and controls incoming and outgoing network traffic based on predefined security rules. Hence it can be thought of as a safeguard that establishes a barrier between a trusted network (i.e. campus or office) and an untrusted network (i.e. internet).

Business Problem:
The important part of firewall systems is obviously the security rules
governing the network traffic. The proper routing of the traffic ensures that
the network is secured with compliance security policy and users don’t experience any unwanted hassle to access allowed web pages. Setting up a firewall system is a complex and an error-prone task as it is carried out by the administrators of the network and hence, it may be subject to human error and moreover with time the security rules may be required to be modified to remain compliant with the security policy.

For instance, if a person wants to access a prohibited website on the network under consideration, the firewall should be able to block that particular website on the network. Corporate or university networks for that matter might have strict reservations about the fraudulent websites that should not be accessed by its employees or students. Given the enormity of the internet, it is really a cumbersome task to maintain these rules for a team of firewall administrators as any mistake by them may incur significant penalty or even exposure of confidential data to the outside network. To eliminate these errors and to provide stable internet traffic, the proper maintenance of the firewall becomes very important with regards to these networks. As the historical data of network traffic can be easily obtained from firewall log reports, machine learning models can be of great help to solve this problem which is what this case study is all about. Sounds fun? Then strap in.

Business Constraints:

The machine learning model designed should be able to quickly predict the action to be taken in order to avoid an unwanted hindrance to the network latency as it might be counterproductive to the network performance. In short, irrespective of the training time complexity the test time complexity must be low. Also, the network administrators might want some form of explainability for the predicted action classes by the model, hence the model should also be able to provide importance of features in predictions.

Earlier Works:

Implementation of firewall using machine learning is not a nascent application of ever growing list of machine learning wonders by any means. There have been number of research works going on, but the ground is yet to be broken.

  1. Classification of Firewall Log files using multiclass SVMs :

The research paper mentioned above has been published in Conference Paper March 2018, by Fatih Ertam et al. who is also the author of the dataset we will use in this case study. The team has used Support vector machines with linear, polynomial and RBF kernels to classify the actions on the same dataset as this case study. The evaluation was done by using metrics such as precision, recall, F1 score. The best result was achieved with the classifier in which the RBF kernel was used with an F1 score of 76.4%. It seems that the research team did not use any preprocessing or feature engineering techniques to train the models. Hence, this work does have scope for improvement and that is what this case study will be focussing on.

2. Machine Learning driven Web Application Firewall :

This repository is published by Faizan Ahmad on github. The dataset used by the author is collection of queries in web applications. The dataset contained about 1000000 positive queries and 50000 negative queries. However, the author did mention that the dataset is labelled by him using automated scripts. The author has preprocessed the queries using tfidf vectorizer with ngrams ranging upto trigrams. The model was trained using logistic regression stating 99% accuracy, however, the author has not reported final F1 score or log loss explicitly.

3. A Machine Learning Approach for Network Traffic Analysis using
Random Forest Regression :

The paper was published by Shilpa Balan et al. to build a network intrusion detection system (NIDS) by training random forest on data set generated by the Canadian Institute for Cybersecurity (CIC) and referred to as CICDS2017 which contains diverse network traffic captured in real time alongwith network attacks such as Brute Force Secure Shell (SSH) and File Transfer Protocol (FTP) attacks. The F1 score of 0.965 indicates that network traffic can be controlled by training machine learning algorithms.

Dataset:

The dataset used in this case study is compiled by Fatih Ertam at Firat University, Turkey and can be downloaded from here. The dataset contains 65532 instances with 12 features collected by logs of the university firewall system. The summary of the features is given below-

Image by author

The action which is our response variable is categorized in four classes. Namely-

Image by author

Exploratory Data Analysis (EDA):

At first, we can try univariate analysis of some important features. Let us look at distribution of Action classes which is our target variable.

Image by author

As we can see, this is a case of imbalanced dataset. reset-both class occurs very few times compared to other three classes.

Now, let’s look at Bytes feature. We can plot distplot and boxplot with respect to Action classes for better insight.

Image by author

As we can see, the Bytes feature is extremely right skewed. This may be due to presence of outliers. Boxplots also reveal that higher the number of Bytes involved in the traffic, more likely it is that the traffic will be allowed by the firewall. Similarly, same insights can be drawn from the plots of Packets and Elapsed Time (sec) features-

Image by author
Image by author

From the univariate analysis, it can be said that ‘allow’ action class can be separated from other three classes based on their feature values. However, nothing can be said about other three action classes. For that let us try with bivariate analysis between the same set of features.

Image by author

The plot shows that there is a strong linear relationship between Bytes and Packets feature. The Action ‘allow’ is dominating the plot due to random distributions of the features. But, it can be inferred that, higher values of Bytes , Packets and Elapsed Time (sec) generally attributes to the allowed traffic. However, nothing can be said about lower values of these features as the classes overlap.

We can also visualize the features in two dimensional embedded space using T-sne. This will give us a better idea about the separability of the Action class.

Image by author

The visualization backs what we have derived from earlier analysis. Action class ‘allow’ is separable from other three Action classes. But, the numerical features alone are not enough to distinguish the Action classes. We should employ feature engineering techniques on port features for better classification.

Feature Engineering:

As the exploratory data analysis showed, numerical features on their own are not enough to classify all the Action classes. However, we can use information about port numbers involved in the traffic to engineer some features which may help us classify the Action classes better. First, let us understand the port numbers in the dataset. Port numbers in network terminology refer to the numbers that are assigned to specific services that help to identify for which service each packet is intended. There are 65535 total ports available to carry out any network communication. Some of the examples are File Transfer Protocol (FTP) which uses port 20 to transfer files, Email delivery services such as Post Office Protocol (POP3) which uses port 110 and Internet Message Access Protocol (IMAP) which uses port 143.

While port number denotes the type of service requested, it does not say anything about the host device. The host device is identified by the IP (Internet Protocol) address which is simply an address of that device on the internet. There are two types of IP addresses. Private IP address which is an IP address of the device accessing the internet assigned by the router and public IP address which is provided by Internet Service Provider (ISP) to enable the connection to the internet. The IP addresses are defined by Internet Protocol version 4 (IPv4) which is a global convention by using an unique 32-bit number. Hence, there is still a limitation on the number of unique addresses available. More specifically, there are 2³² unique IP addresses available.

Photo by Stephen Phillips — Hostreviews.co.uk on Unsplash

But an important caveat is that devices not connected to the Internet, such as factory machines that communicate only with each other via TCP/IP, need not have globally unique IP addresses. These types of private networks are widely used of which our University network is a good example. There are three non overlapping ranges of nearly 18 million IP addresses reserved for these private networks. This enables the devices connected to the shared network environment to connect to internet using the same public IP address. For this purpose, the routers or the networking hub implement a technique known as Network Address Translation (NAT). Basically, NAT allows a single device, such as a router, to act as an agent between the Internet (or public network) and a local network (or private network), which means that only a single unique IP address is required to represent an entire group of computers to anything outside their network (i. e. internet). While NAT allows for better use of IP address space, it is not always called into action for each communication at router level. That means, sometimes port numbers can be used over the internet without translating as well.

The dataset we are using has recorded port numbers on private devices as well as port numbers translated by NAT. Hence we can create two features for source and destination based on information if Port Translation (NAT) was required while passing the traffic. The features Source Port Translation and Destination Port Translation can be encoded by 1 if port numbers on devices and NAT are different indicating requirement of NAT. Otherwise, they will be encoded by 0 indicating no translation was required.

We can also extract more features by building a network of source and destination ports by the help of networkx library. There may be an implicit importance of particular port involved in determining the Action of the traffic. By building a network, we can explore these features. In the dataset, we have source and destination ports on the sides of both host devices as well as NAT. Hence, we can build two networks from these port numbers.

Some of these features can be explained as-

  1. Common neighbours: Number of common ports between source and destination ports.
  2. Jaccard Index: Common neighbours divided by union of all the ports connected to source and destination ports.
  3. Salton Index: Quantified measure of cosine similarity between source and destination ports.
  4. Sorensen Index: Similar to Jaccard index except the denominator is net sum in contrast to union.
  5. Adamic/Adar Index: This feature is specifically designed for comparison of two web pages. It is similar to common neighbours but it penalizes rare web pages by taking log which in our case may be fraudulent ones.
  6. PageRank: computes a ranking of the ports in the network graph G based on the structure of the incoming links. It was originally designed as an algorithm to rank web pages in Google search engine. We can calculate page ranks for both Source and Destination ports from Host and NAT networks.

Choosing the performance metric:

To evaluate the performance of any machine learning model, choosing the right metric is very important. The model is only as good as the metric shows it to be. But, the metric itself does not guarantee that the model will work as good in real-time. Hence, for any underlying task, the metric must be suitable for it’s application in real-time.
The task here, is to classify Action class of a firewall log data. This is a multiclass classification task as Action class belongs to four distinct classes. The distribution among those classes is also unequal which signifies that this is an imbalanced dataset. Hence, simple accuracy or AUC cannot be used as it does not take class imbalance into account. Also, predicting every class correctly is equally as important. Moreover the model implemented should be able to predict probability of the predicted class for interpretability rather than hard constrained predictions of classes directly.
That leaves us with multiclass logarithmic loss as the metric. Typically, even if it is interpreted as a measure of error by the model, we can use it as metric to evaluate the model in a sense that the performance of the model with lower log-loss is considered better. This is because it penalizes predictions which are not confident. It is defined as-

Modelling:

  1. KNeighborsClassifier:

We can start by KNeighborsClassifier which is one of the simplest machine learning models around. It generates the output probabilities by taking into account labels of it’s nearest k neighbors by virtue of distances. The optimal k value can be found out by using RandomizedSearchCV hyperparameter tuning technique.

Image by author

As we can see, the model has test log-loss of 0.028 which is very low compared to 1.45 of the baseline model. But, even if the model is doing very well in classifying among all the Action classes except ‘reset-both’. It may be struggling because there are only 10 instances of that class in the test data.

2. LogisticRegression:

The LogisticRegression is a classification model which explicitly tries to minimize the log-loss. The parameter ‘C’ controls the amount of regularization with smaller values having more regularization. ‘penalty’ parameter controls the type of regularization.

Image by author

As we can see, the model has test log-loss of 0.38 which is low compared to 1.45 of the baseline model but higher than 0.028 of K-neighbors model. Also, similar to k-neighbors, the model is doing very well in classifying among all the Action classes except ‘reset-both’. It may be struggling because there are only 10 instances of that class in the test data.

3. Support Vector Classifier:

The SGDClassifier is a classification model which explicitly tries to minimize the hinge-loss. Hence, it is primarily a support vector classifier with linear kernel. We are not using support vectors with rbf kernel because of high number of training samples. The parameter ‘alpha’ controls the amount of regularization with smaller values having less regularization. ‘penalty’ parameter controls the type of regularization.

Image by author

As we can see, the model has test log-loss of 0.09 which is low compared to 1.45 of the baseline model but higher than 0.028 of K-neighbors model. Also, similar to k-neighbors, the model is doing very well in classifying among all the Action classes except ‘reset-both’. It may be struggling because there are only 10 instances of that class in the test data.

4. RandomForestClassifier:

The RandomForestClassifier is an ensemble model which trains large ensemble of decision trees. The parameter ‘n_estimators’ controls the amount of decision tree to train. ‘criterion’ controls the information gain metric on which to split the tree. ‘max_depth’ is the maximum no. of levels of trees allowed in construction of decision trees.

Image by author

As we can see, the model has test log-loss of 0.009 which is very low compared to 1.45 of the baseline model and also 0.028 of K-neighbors model. Also, in contrast to to k-neighbors, the model is doing very well in classifying among all the Action classes including ‘reset-both’. This seems to be a best model yet.

5. LGBMClassifier:

The LGBMClassifier is also an ensemble model which trains large ensemble of decision trees. The parameter ‘n_estimators’ controls the amount of decision tree to train. ‘objective’ specifies the learning task which is ‘multiclass’ in our case. ‘max_depth’ is the maximum no. of levels of trees allowed in construction of decision trees.

Image by author

As we can see, the model has test log-loss of 0.008 which is very low compared to 1.45 of the baseline model and also lower than 0.009 of Random forest model. Also, the model is doing very well in classifying among all the Action classes including ‘reset-both’. As this seems to be the best model, let us check the importances of the features in predicting the Action class.

Image by author

Notably, lot of the engineered features are very important in classifying the log. Namely, Page rank and translation features are very important.

6. StackingClassifier:

The StackingClassifier is an ensemble learning technique to combine multiple classification models via a meta-classifier. We can use well performing models like KNeighbors, RandomForest, LightGBM and Adaboost as base models and LogisticRegression as meta-classifier to minimize the log-loss. The hyperparameters of the StackingClassifier can be tuned by using RandomizedSearchCV technique.

As we can see, the model has test log-loss of 0.0107 which is very low compared to 1.45 of the baseline model but not better than 0.008 of Light GBM model. Also, the model is doing very well in classifying among all the Action classes including ‘reset-both’. Hence, the better model can be chosen by computing run-time complexity.

Image by author

As we can see, the model has test log-loss of 0.0107 which is very low compared to 1.45 of the baseline model but not better than 0.008 of Light GBM model. Also, the model is doing very well in classifying among all the Action classes including ‘reset-both’. Hence, the better model can be chosen by computing run-time complexity.

Summary of the model performance:

We can now compare the performance of all the models on attributes such as train loss, test loss and run time in predicting the Action class.

Image by author

From the results it is clear that Light GBM Classifier is the best model with test log-loss of 0.0083 and run-time of only 0.0009 seconds. Hence, the model satisfies the low latency requirement along with being capable of producing reliable predictions on firewall traffic logs.

Future Work:

The model is pretty capable to classify real time firewall traffic logs without affecting network latency. The only small caveat of the model stems from the 0.5 recall of ‘reset-both’ Action class. It can also be improved by using large sample of data having better distribution of the given class. This type of router based firewall systems can be easily deployed by providing the firmware updates on routers. Similar to this router based firewall, other firewall frameworks can also be improved by using similar machine learning techniques.

References:

Dataset hosted on UCI machine learning repository

Network Administration: Packet-Filtering Firewall

Review on Learning and Extracting Graph Features for Link Prediction

Predicting Good Probabilities With Supervised Learning

The case study is hosted on github. Refer here. Find me @ LinkedIn.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: