# Confusion Matrix in CyberSecurity

https://miro.medium.com/max/1200/0*NM3gouCFSQ74tTuC

Original Source Here

Let’s start by evaluating the results calculated by the Confusion matrix

Accuracy — To predict how often the value predicted by the classifier is correct, calculated by the formula. Accuracy = TN+TP / TN+FP+FN+TP

Misclassification Rate — To predict how often the value predicted by the classifier is incorrect, calculated by the formula. FN+FP / TN+FP+FN+TP.

Misclassification rate = FN+FP = 10+5

TN+FP+FN+TP 50+10+100+5

= 9%

Precision — When the values predicted say yes, but the percentage of it is correct?

TP/predicted yes = 3/5 = 0.6

Recall –How many correct true positives are found from all the possibilities involving positives.

Recall = TruePositives / (TruePositives + FalseNegatives)

= 3/4 = 0.75

Prevalence — How often does the yes condition actually occur in our sample?

actual yes/total =3/10 = 0.3

F1-Score- It is difficult to compare two models with low precision and high recall or vice versa. So to make them comparable, we use F-Score. F-score helps to measure Recall and Precision at the same time. It uses Harmonic Mean in place of Arithmetic Mean by punishing the extreme values more.

2*0.75*0.6

0.75+0.6

= 0.667

Sensitivity — It computes the ratio of positive classes correctly detected. This metric gives how good the model is to recognize a positive class.

CyberSecurity refers to the process of protecting computers or other devices from the theft of information, damage of software or hardware, and other intellectual properties whose damage may affect a serious problem to the person/organization, who is the owner/in charge of that information.

AI and ML technologies have become critical yet useful in the information security world as they can quickly analyze millions of events and identify many different types of threats. Let’s start by understanding what are these threats we are discussing, our focus is mainly on cyber-attacks which can be defined as any attempt to gain unauthorized access to a computer, computing system, or computer network with the intent to cause damage. They may cause

1. Identity theft, extortion of information which might result in blackmailing
2. Malware induction into the systems, affecting multiple systems by injecting viruses
3. Spoofing, Phishing, and Spamming
4. Denial of various services may further lead to multiple attacks
6. Sabotaging vital information
7. Vandalism through various websites
8. The exploitation of privacy through web browsers
9. Account hacks and money scams
10. Ransomware
11. Theft of Intellectual Property

The attack can be caused due to error type-1 (False Positive) and type-2 (False Negative).

Confusion matrix application used in machine learning

An IDS (Intrusion Detection System) is used as a traffic monitoring system for suspicious activity and issues alert when such activity is discovered. It is a software application that scans a network or a system for the harmful activity or policy breaching. Any malicious venture or violation is normally reported either to an administrator or collected centrally using a security information and event management (SIEM) system. A SIEM system integrates outputs from multiple sources and uses alarm filtering techniques to differentiate malicious activity from false alarms.

Although intrusion detection systems monitor networks for potentially malicious activity, they are also disposed to false alarms. Hence, organizations need to fine-tune their IDS products when they first install them. It means properly setting up the intrusion detection systems to recognize what normal traffic on the network looks like as compared to malicious activity. It also monitors network packets inbound the system to check the malicious activities involved in it and at once sends the warning notifications.

In the case of the binary classifier IDS, four possible outcomes are possible. Attacks correctly predicted as attacks (TP), or incorrectly predicted as normal (FN). Normal correctly predicted as normal (TN), or incorrectly predicted as an attack (FP). The False Positive and False Negative are the errors and the tradeoff between these two factors can be intuitively analyzed with the help of the Receiver Operating Characteristic (ROC) curve. However, in the case of multi-classifiers, when a class of attack is incorrectly predicted as another class of attack, it could not be any of the existing four instances. Here, a new approach is proposed to evaluate the anomaly-based IDS. A new proposed metric F-score per Cost (FPC) is one value calculated for each attack predictor.

In this, there are two classes of connection points “Normal” and “Attack”. The attack type is not identified here. The data set which contains labeled or unlabeled data points are used to evaluate the IDS. For example, the KDD CUP ’99 competition presented a data set of five classes, a normal class, and four classes of different attacks. The KDD’ 99 data set is considered a benchmark data set for evaluating IDS. Most of the previous studies have used the KDD’99 data set for training, testing, and validating their proposed IDS.

The task for the KDD ’99 Cup was to build a classifier capable of distinguishing between legitimate and illegitimate connections in a computer network. This data set is now considered the de facto data set for intrusion detection. The connections in the data set are either normal connections or intrusions, of which there are four main categories: Probing (Surveillance, port scanning, etc.), DoS (Denial of Service), U2R (Unauthorized access to local superuser privileges), R2L (Unauthorized access from remote machine). They have applied the evaluation metrics that were derived and calculated based on the four instances TP, TN, FP, and FN. These four instances are the outcome of comparing two actual classes with the two predicted classes. However, the attacks of a certain class that are wrongly predicted as a different class of attacks could not be related to any of these four instances

Here PD stands for Probability of detection and FAR stands for False Alarm Rate.

KDD’99 has been the most widely used data set for the evaluation of anomaly detection methods. This data set is prepared by Stolfo et al. and is built based on the data captured in DARPA’98 IDS evaluation program . DARPA’98 is about 4 gigabytes of compressed raw (binary) tcpdump data of 7 weeks of network traffic, which can be processed into about 5 million connection records, each with about 100 bytes. For each TCP/IP connection, 41 various quantitative (continuous data type) and qualitative (discrete data type) features were extracted among the 41 features, 34 features (numeric), and 7 features (symbolic).

The accuracy (AC) is the proportion of the total number of the correct predictions to the actual data set size. It is determined using the equation: