Network Monitoring Meets Deep Learning

Original Source Here

Network Monitoring Meets Deep Learning

Using Neural Networks to Analyze Behavior at the Network-Level

Cover image by Jackie Clark.

By Jude Canady, William Rannefeld

People are creatures of habit. Your morning routine, what you order at Starbucks, the route you take to work, and your late night snacking are all driven by habits. These habits also appear in the digital world. How you browse social media, what sites you visit, and in today’s modern environment, how you conduct work. When habits change in the physical world, it can be a sign of something good.

But, when your digital work habits change it almost always means something bad.

Changes in digital habits are a strong indicator of compromise — this could be an insider threat, stolen login credentials, or a compromised machine. Identifying and then remediating these changing digital behaviors in real-time improves the security and integrity of enterprise, or personal networks.

Communication between networked devices is enabled by the transmission of network packets routed between devices using IP addresses. We can collect statistics about these packet transmissions, and treat this data as the behavior of those producers over time, where producer refers to the users and machines generating the traffic.

Throughout this article, we examine the validity of using NetFlow V9 as a behavioral data source. This data can be collected at the network router for all machines transmitting or receiving data on a given network, and there are a variety of open source tools that support its collection. This approach is non-intrusive and enables collection for both managed endpoint and bring-your-own-device environments.

With machine learning (ML), we can begin to understand and make predictions about this data. Particularly, we can create behavioral signatures, which summarize the activity of a given producer for a window of time. These signatures can enable a variety of applications. In this article, we explore one system which uses them to perform periodic producer authentication, i.e., we show how to authenticate a producer’s behavior for the full duration of their network connection.

It is important to note that NetFlow only captures high-level statistics for network data, as opposed full packet capture where the message content of the packets is captured. Our results suggest that resource intensive data collection methods, like full packet captures, are unnecessary for generating comprehensive behavioral signatures and performing traffic attribution. In the following sections, we provide performance metrics and prediction visualizations to support this claim.

Collecting Data

NetFlow as a Behavioral Data Source

A network flow is defined as the transmission of information between two devices communicating on a network. NetFlow is the meta-data associated with this communication. As previously stated, NetFlow data does not include the contents of the packets being transmitted. For our work, we use NetFlow V9 [3], which is a template-based implementation configured at the router. Table 1 lists a subset of the default features that we use as input data to the ML models.

Table 1. NetFlow V9 features considered by the ML models.

With the router configured to export NetFlow records, we capture and process these records with Nfdump [1], a popular package for interacting with NetFlow. Nfdump consists of a daemon that converts the NetFlow packets into byte files, which we then convert to formats more suitable for subsequent ML processing. We use the Comma-Separated Value (CSV) export format to simplify the inclusion of this data source into our ML pipeline. Prior to any processing in the pipeline, the CSV files undergo a scaling and encoding procedure.

The setup described results in a multi-variate time series for each IP address observed on the network. The current implementation uses a uni-directional, aggregate approach, i.e., for a given source IP address, we consider all flows from that source to any destination IP address. In essence, this enables the ML models to track how an IP address’ behavior evolves over time as it interacts with other machines on the network. The models can observe and condition on the number of packets the IP address typically transmits, to what ports, and the average size of those packets. In addition, if we include temporal information, such as the transmission date and times, we can establish the “pace” of the producer generating the traffic. All of this information is combined to create behavioral signatures, or latent representations, for the selected IP addresses. These signatures underlie the ML pipeline, and the quality of these signatures directly impacts the performance of the authentication system.

The Machine Learning Pipeline

Generating Behavioral Signatures

A behavioral network signature represents a producer’s activity across a network over time. Due to the application constraints of Nfdump, we receive NetFlow samples for IP addresses in five minute increments. Therefore, a behavioral signature with our current configuration represents five minutes worth of producer behavior. So, how do we actually generate them?

Prior to training a model for this task, the NetFlow data needs to be labeled so that it is suitable for the classification setting. These labels can come from any device that maps an IP address to a specific machine. It is worth noting that these labels need to be consistent across the duration of the flow. If a producer’s IP address changes, then we need to ensure that the labeling mechanism tracks the change, i.e., it must note that the activity is coming from the same producer even though the IP address has changed. We must also be wary of changing MAC addresses if they are involved in the labeling scheme or serve as identity labels directly, e.g., in the case of virtual machines where the MAC address changes every time the machine restarts unless it is configured to prevent this behavior.

Once we have labeled data, the next step is to select a model that is capable of emitting the signatures we desire. Behavioral signatures from the same producer should be neighbors in the latent space, relative to the distance metric used in the training procedure. Due to the sequential nature of NetFlow, we selected an efficient transformer [5] for the underlying signature generation model. Transformers are currently the state-of-the-art architectures for multiple problems across machine learning, particularly in the sub-discipline of Natural Language Processing (NLP). The “efficient” version is optimized to support longer sequence lengths than traditional transformers, while maintaining comparable performance.

We frame signature generation as a metric learning problem, and Figure 1 is a visual representation of the high-level metric learning procedure. This sub-discipline of deep learning attempts to learn a latent representation directly (step C), as opposed to using an intermediate layer of a softmax classifier. In the supervised metric learning setting, the distance between samples of the same category is minimized with respect to a metric (step B). Depending on the selected loss function, the model may also need to learn to maximize the distance between samples from different categories. This procedure is analogous to our desired nature for the behavioral signatures. NetFlow samples from the same producer should have similar signatures, and NetFlow samples from different producers should have dissimilar signatures (step D). This is arguably the most important stage of the machine learning pipeline. If the model is incapable of optimizing its parameters with respect to the metric, all subsequent predictions will suffer.

Figure 1. Overview of the metric learning procedure.

In our dataset, the NetFlow samples from a single producer can vary from tens to tens of thousands of items per sequence within the five minute interval. The efficient transformer is tailored to cope with this varying and large sequence length. To generate signatures at inference time, the model takes in one of these samples and produces a signature or vector of fixed length. These signatures are plotted using a dimensionality reduction technique, called Uniform Manifold Approximation and Projection (UMAP) [2], to verify that the model has successfully learned to separate the signatures from each producer. Figure 2 confirms that this is indeed the case. It shows the signatures for all NetFlow samples in the validation set, which contains data the model was not allowed to adapt to during training. Each point in the plot represents a five minute interval for a producer. Though some overlap exists between machines, it is clear that clusters are forming in this space, where each member of the cluster belongs to a single producer. We can further reduce cluster overlap by increasing the number of labeled samples and categories in the training procedure. We explore this idea further in the Next Steps section at the end of this article.

Figure 2. Plot of anonymized UMAP-reduced behavioral signatures for NetFlow samples from the validation set. Each label in the legend represents a unique producer.

Periodic Producer Authentication

In the second stage of the ML pipeline, we use the signatures generated by the previous model to perform periodic producer authentications. The model for this task requires two inputs: a historical signature for the selected producer, and a signature for the current time step from the producer claiming to have the same identity as the historical signature.

The historical signature is a running average over all signatures deemed authentic by this module, although each machine must first undergo a finite period of enrollment (averaging the signatures without authenticating them). We include the enrollment phase to obtain a better representation of the producer’s behavior over time. Referring back to Figure 2, a historical average for a producer would be an average of all of the points belonging to a single producer throughout the plot. This average is updated throughout the lifetime of the producer in order to adapt to gradual changes in behavior.

We selected a Multi-Layer Perceptron (MLP) to produce authentication predictions. To train the model, we create pairs of historical signatures and current signatures, where the signatures can belong to the same producer or come from different producers. This results in a binary classification problem across our dataset of previously generated signatures, meaning we can use the standard cross entropy loss function to optimize the model. We also implemented Dropout [4] within this network, which randomly zeros activations generated by the latent layers throughout training. This lead to significant performance gains with respect to the F1 Score metric, which has a maximum value of 1.0. Table 2 lists the performance for a subset of model configurations considered during training. Surprisingly, increasing the number of parameters did not lead to significant performance gains. We attribute this to the overlap between producer clusters produced by the signature generation model. In these cases, the authentication model cannot distinguish behavior from different producers because their signatures, and therefore their underlying behavior, is too similar. In order to further improve performance for the authentication task, we need to generate signatures that do not suffer from this defect. The best approach to achieve this is to include a larger number of unique producers and NetFlow samples in the signature generation model’s training phase. We expand upon this idea in the Next Steps section at the end of this article.

Table 2. Performance of model configurations for signature pairs from the validation set on the authentication task.

When performing authentication, we first extract all historical signatures for producers that appear in the current NetFlow export file from a relational database. Next, we iterate over the identified producers and calculate authentication scores. These scores provide a measure of how well current behavior aligns with historical behavior for each producer. If the likelihood value is low, this suggests that something fundamental in the producer’s behavior has changed at the current time step. This anomalous interval could have a variety of causes, such as compromised user credentials, spikes or lulls in activity, or a change of purpose for the producer. We do not attempt to determine the root cause of the authentication failures in this work. However, this information, along with other explainability measures, would assist system administrators and other end users in determining the correct course of action to remediate these issues. Building and including this capability is an interesting problem for future research.

Analyzing the Results

Figure 3 illustrates the authentication model’s predictions for use two cases: authentic and inauthentic connections. The data used to generate this figure comes from the validation set of signatures, therefore the authentication model has not been exposed to this data during training. We label the scores based on inauthenticity, i.e., higher values indicate a producer is behaving in a less authentic manner relative to its historical behavior. We believe this visualization decision better aligns with a system administrator’s intuition given that this system is similar in nature to an anomaly detection system, where higher values signify increasing anomaly severity. We also include a decision threshold, which was determined using a Receiver Operating Characteristic (ROC) curve on samples from the validation set. Authentication scores with intervals exceeding this threshold are classified as inauthentic, and their corresponding activity during this time period should be flagged as such.

Figure 3. An authentic and inauthentic connection plot using validation set signatures to demonstrate the nature of authentication scores. The date range is not identical for the sub-plots because the NetFlow data represents real machines with different levels of activity.

Sub-plot 3A of Figure 3 features authentication scores where the historical signature and current signature come from the same producer. We expect the majority of scores to fall below the decision threshold, and we can see that this is the case. There is a brief interval of inauthentic behavior detected at the end of the sample. This is most likely caused by an insufficient enrollment period for the machine, which would result in an inadequate historical signature for that machine. However, two false positives across a five day period may be considered an acceptable margin or error depending on the end user’s deployment requirements.

Sub-plot 3B of Figure 3 illustrates the opposite case. In this scenario, the historical signature comes from a different producer than the signature for the current time step. We find a parallel of this situation in the real-world when, for example, a user’s credentials are stolen by a malicious party. In this case, we expect the majority of the predictions to be inauthentic. The figure shows that this expectation is correct. The scores hover around a value of 1.0, indicating the authentication model is certain that the producer claiming this historical signature is an impostor. We also see two false negatives in this plot, although their appearance may be insignificant, depending on one’s tolerance for this type of error.

Proof of Concept Environment


Similar to many of our previous efforts, we chose to deploy a minimal version of the system with Docker on our local network. Two containers are required to run the system. The first houses all of the Nfdump functionality for deployment to a compute resource configured to receive NetFlow packets from a router on the network. The second container is comprised of the ML models, a relational database for signature and flow meta-data storage, and a REST API that orchestrates machine-IP (flow) labeling.

The REST API within the second container is particularly important for successful system execution. End users are required to provide the machine-IP address labels, which we use throughout the prediction procedure. The API discounts the method or device that produces them in order to support system integration across a variety of different infrastructures. In our case, we rely on a next generation firewall (NGFW) appliance to provide these labels. To collect this data, a Python script parses the logs from the NGFW and transmits the data to the Docker container via the API. All requests for updating machine-IP address time span records must match our JSON schema. The system reads this data into the database, and queries it to label flows within the NetFlow exports produced by Nfdump. Next, the labeled flow records are fed to the signature generation model, then to the authentication model at each time step, along with the corresponding historical signature. Both of these steps require correctly assigned labels. If the labeling appliance is improperly configured, or if these labels become corrupted in any manner, it will lead to nonsensical predictions at all stages of the ML pipeline.

Prediction Visualization

Also included in the system is a lightweight dashboard for displaying predictions from the models. The dashboard utilizes Streamlit, a Python package, that enables developers to render data in the browser with common visualization libraries like Plotly or Seaborn. Figure 4 is an image of the dashboard for a real producer. Scores are plotted over time to demonstrate how the predictions vary. Future versions of the dashboard may provide the ability to view running NetFlow metrics alongside these scores to enable detailed analysis of the inauthentic intervals.

Figure 4. Streamlit dashboard rendering authentication predictions and meta-data for a real producer over time.

Extensions: Zero-Trust Dynamic Policy Enforcement

In the zero-trust setting, a policy constitutes the privileges that a producer (machine or user) has on the network. Policies are largely rule-based and do not vary with system usage or behavior. For example, an administrator can create a policy that only allows HTTP traffic with a destination port of 80 for a server dedicated to handling web traffic. Any requests that do not match this requirements, e.g., trying to establish a shell session to the machine at port 22 for a random user, will be rejected by the policy enforcement device.

These hard-coded rules can be effective, but they are also static and cannot adapt to changing circumstances. The ML system outlined above can augment this setup to provide data-driven updates to these policies. By harvesting NetFlow data for producers on the network, we can implement dynamic policies to periodically verify a producer’s authenticity for the duration of that producer’s connection to the network. For example, if a given producer is labeled as inauthentic by the authentication model for a number of successive time steps, we can revoke that producer’s access to resources on the network. As the inauthentic behavior subsides, we can re-enable this producer’s permission to other machines, in line with the static policies previously defined by the administrator. We are currently exploring implementation strategies for this use case on our network, and we expect this exploration to reveal various areas for improvement and optimization in the ML solution.

Next Steps

Milliseconds are significant when it comes to malicious activities like data exfiltration. At the moment, our solution is constrained by the periodic update limit of Nfdump. Specifically, we must wait until the NetFlow export files rotate to predict new behavioral signatures and their corresponding authentications. We are considering modifying Nfdump, or developing a similar solution, to capture and store NetFlow packets as they arrive from the router. This would enable continuous producer authentication, which will allow administrators to detect and remediate anomalies as they occur, instead of minutes after they occur.

As with the majority of supervised ML models, the performance of the neural networks in our solution will increase with the amount of labeled data they are exposed to during training. Unfortunately, when it comes to NetFlow, we are subject to real-time collection constraints and are bound by the number of producers on the training network. If we want to train on a month’s worth of data, then we need to collect data for a month. If we want to verify how the signature generation model scales with the number of producers, then our network must contain a large number of producers. Both of these aspects introduce significant delays in solution development and prototyping. To reduce this cost, we intend to explore methods that rely on generative models to produce realistic, though simulated, NetFlow and behavioral signatures. In addition, we will examine the benefits of self-supervised pre-training on the models, prior to adapting them for specific tasks, like signature generation. We will include these extensions in the training procedures for both models and analyze their effects on overall system performance.

To discuss how the techniques above could be applied to your applications and data sources, or to learn more, please contact us at


  1. Haag, Peter (2021), Nfdump [Source Code],
  2. McInnes, Leland, John Healy, and James Melville. “Umap: Uniform manifold approximation and projection for dimension reduction.” arXiv preprint arXiv:1802.03426 (2018).
  3. “NetFlow Version 9 Flow-Record Format [IP Application Services].” Cisco, Cisco, 16 June 2011,
  4. Srivastava, Nitish, et al. “Dropout: a simple way to prevent neural networks from overfitting.” The journal of machine learning research 15.1 (2014): 1929–1958.
  5. Tay, Yi, et al. “Efficient transformers: A survey.” arXiv preprint arXiv:2009.06732 (2020).


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: