# How Data Science Can Save You From a Heuristics Headache

Original Source Here

# How Data Science Can Save You From a Heuristics Headache

Author: Brenna Gibbons

Brenna Gibbons is a Data Scientist at ActZero. She first learned how powerful the combination of domain expertise and machine learning can be while earning her PhD in Materials Science and Engineering at Stanford University. She has since become fascinated at the potential for similar gains in cybersecurity. She especially likes that the impact of her work, and the algorithms she trains, protects people from cybersecurity threats.

With the over-hype of AI, it’s tough to blame people for thinking that they might be able to achieve a similar outcome using rules or basic statistics (the folks you should really blame are the marketing people!) … That being said, in this blog post I’m going to explain why these simple heuristics will pale in comparison to what a proper data science-fueled machine learning algorithm can do.

What do we mean by a Statistical / Analysis approach?

Many cybersecurity detections begin as simple queries or rules that originate from either known incidents or expert knowledge. These simple rules or heuristics can be great! They are usually informed by evidence and often tailored to a specific company’s environment. These detections can range from string matches for known indicators of attack (IOAs) to statistical approaches that analyze whether a given event or set of events falls into a typical distribution for the endpoint or environment.

Think statistics/analysis is sufficient? Think again

Despite their advantages there are significant drawbacks to these basic analytical approaches. Static rules and heuristics risk becoming stale. When a query is written against published indicators of attack (IOAs), for example, an attacker need only change the file name or hash slightly to evade detection. Distributions of events and commands may drift over time, with the number of alerts slowly creeping up as a result. And, many IT admins can likely relate to the ever-increasing length of allowlists some detections need to function as a company grows and diversifies.

Even user and entity behavior analytics (UEBA) is necessarily limited by the size of the entity pool from which the analysis is drawn. Behaviors that are new to a given user or environment may be perfectly benign, but get flagged due to their novelty alone. (While some UEBA products are based on data science methods like anomaly detection, many are built on basic statistics, and the latter are susceptible to false positives each time a user acquires new software or learns a new skill.)

A better approach: Anomaly Detection

Anomaly detection can complement a statistical or rules-based approach to mitigate certain drawbacks. In anomaly detection, the machine learning (ML) algorithm looks for outliers — in other words, anything that looks “weird.” Humans (and cybersecurity professionals especially) are natural anomaly detectors. Think of a time you’ve sifted through a bunch of data looking for that “needle in a haystack,” without thinking about the exact parameters of what you are looking for. We can’t expect humans to sift through the quantity of alerts generated in modern environments. Thankfully, anomaly detection algorithms work in much the same way, learning what is normal for an environment without ever being given strict boundaries.

Our anomaly detection algorithms use characteristics similar to what a cybersecurity expert would look at, or even to what a statistical approach might use, but with a complexity that would be challenging to write into a statistical heuristic and an ability to process far more events than a human. In addition, our anomaly detection models can go beyond what is normal vs. weird in a specific user environment to analyze trends across businesses similar to yours — we’ll come back to an example of why that’s useful in a minute.

Let’s look at some concrete examples. In the following scenarios, we’ll look at detections involving PowerShell commands. (For more information on PowerShell and other scripting that can be used maliciously, check out our Threat Insight.) In these examples, I will contrast the use of traditional rule- or heuristic-based approaches to detecting specific malicious PowerShell techniques with an anomaly detection approach.

• Example 1: Avoiding a False Negative from Obfuscation
• Malicious actors often try to trick rule-based detections by throwing in obfuscating characters — for example, a command line that begins “-w 1 dow`nlo`ad(bad.exe)” might trick a simple string match looking for the term “download.” A common countermeasure would be to preprocess the command line, removing unusual characters or punctuation and thereby increasing the likelihood of correctly matching specific words against the detection rules. The problem is that this approach risks throwing the baby out with the bathwater. In this case, the command is anomalous precisely because it has those extraneous characters in the middle of a word. Additionally, many script obfuscators will use elements like environment variables that are difficult to process safely. An anomaly detection model can pick up on the strange way the word is split, the presence of unusual characters, and the presence of the word “download” simultaneously, greatly increasing the probability of a detection.
• Example 2: Avoiding False Positives from Benign Processes
• On the flip side, because many legitimate PowerShell scripts are functionally so similar to malicious PowerShell, they can often cause false positives on heuristic-based detection systems. Take this powershell command, common on machines running Visual Studio software: