While signature-based detection isn’t enough on its own to protect a network against structured attackers, it is one of the cornerstones of a successful network security monitoring capability. If you have ever managed a signature-based detection mechanism then you know that you can’t simply turn the device on and let it work its magic. A signature-based detection mechanism like Snort, Suricata, or various commercial offerings requires careful deployment and tuning of signatures (often called rules) to ensure that you receive reliable high quality alerts. The ultimate goal, while never fully achievable, is that every alert that is generated by these detection mechanisms represents the activity it was designed to detect. This means that every alert will be actionable, and you won’t waste a lot of time chasing down false positives.
A key to achieving high fidelity alerting is to have the ability to answer the question, “Do my signatures effectively detect that activity they are designed to catch?” In order to answer that question, we need a way to track the performance of individual signatures. In the past, organizations have relied on counting false positive alerts in order to determine how effective a signature is. While this is a step in the right direction, I believe that it is one step short of a useful statistic. In this article I will discuss how a statistic called precision can be used to measure the effectiveness of IDS signatures, regardless of the platform that you are using.
Before we get started, I think its necessary to describe a few terms that are discussed throughout this article. If we are going to play ball, it helps if we are standing on the same field. This article is meant to be platform-agnostic however, so the equipment you use doesn’t matter.
There are four main data points that are generally referred to when attempting to statistically confirm the effectiveness of a signature: true positives, false positives, true negatives, and false negatives.
True Positive (TP): An alert that has correctly identified a specific activity. If a signature was designed to detect a certain type of malware, and an alert is generated when that malware is launched on a system, this would be a true positive, which is what we strive for with every deployed signature.
False Positive (FP): An alert has incorrectly identified a specific activity. If a signature was designed to detect a specific type of malware, and an alert is generated for an instance in which that malware was not present, this would be a false positive.
True Negative (TN): An alert has correctly not been generated when a specific activity has not occurred. If a signature was designed to detect a certain type of malware, and no alert is generated without that malware being launched, then this is a true negative, which is also desirable. This is difficult, if not impossible, to quantify in terms of NSM detection.
False Negative (FN): An alert has incorrectly not been generated when a specific activity has occurred. If a signature was designed to detect a certain type of malware, and no alert is generated when that malware is launched on a system, this would be a false negative. A false negative means that we aren’t detecting something we should be detecting, which is the worst-case scenario. False negatives aren’t detectable unless you have a secondary detection mechanism or signature designed to detect that activity that was missed.
The Fallacy of Relying Solely on False Positives
Historically, the measure of success for measuring the effectiveness for IDS signatures was the count of false positives over an arbitrary time period compared to a certain threshold. Let’s consider a scenario for this statistic. In this scenario, we’ve deployed signature 100, which is designed to detect the presence of a specific command and control (C2) string in network traffic. Over the course of 24 hours, this signature has generated 500 false positive alerts, which results in an FP rate of 20.8/hour. You have determined that an acceptable threshold for false positives is 0.5/hour, so this signature would be deemed ineffective based upon that threshold.
At face value, this approach may seem effective, but it is flawed. If you remember from a few paragraphs ago, the question we want to answer centers on how well a signature can effectively detect the activity it is designed to catch. When we only consider FP’s, then we are actually only considering how well a signature DOESN’T detect what it is designed to catch. While it sounds similar, detecting whether or not something succeeds is not the same as detecting whether or not it fails.
Earlier, we stated that signature 100 was responsible for 500 false positives over a 24-hour period, meaning that the signature was not effective. However, what if I told you that this signature was also responsible for 5,000 true positives during the same time period? In this case, the FPs were well worth it in order to catch 5,000 other actual infections! This is the key to precision -- taking both false positives and true positives into consideration.
At this point it’s probably important to say that I’m not a statistician. As a matter of fact, I only took the bare minimum required math courses in college, but fortunately precision isn’t too tricky to understand. In short, precision refers to the ability to identify positive results. Often referred to as positive predictive value, precision is shown by determining the proportion of true positives against all positive results (both true positives and false positives) with the formula:
Precision = TP / (TP + FP)
This value is expressed as a percentage, and can be used to determine the probability that, given an alert being generated, the activity that has been detected has truly occurred. Therefore, if a signature has a high precision and an alert is generated, then the activity has very likely occurred. On the other hand, if a signature with low precision generates an alert, then it is unlikely that the activity you are attempting to detect has actually occurred. Of course, this is an aggregate statistic, so more data points will generate result in a more reliable precision stat.
In the example we identified in the previous section, we could determine that signature 100, which had 500 FPs and 5000 TPs has a precision of 90.9%.
5000 / (5000 + 500) = 90.9
That's pretty good!
Using Precision in the SOC
Now that you understand precision, it is helpful to think about ways it can be used in a SOC environment. After all, you can have the best statistics in the world but if you don’t use them effectively then they aren’t providing a lot of value. This begins by actually making the commitment to track a precision value for each signature being deployed for a given detection mechanism. Again, it doesn’t matter what detection mechanism you are using. Whenever you deploy a new signature it would be assigned a precision of 0. As analysts review alerts generated by signatures, they will mark the alerts as either a TP or FP. As time goes on, these data points help to determine the precision value, expressed as a percentage.
Precision for Signature Tuning
First and foremost, precision provides a mechanism that can be used to track the effectiveness of individual signatures. This provides quite a bit of flexibility that you can tune to the sensitive of your own staffing. If you are a small organization with only a single analyst you can choose to only keep signatures with a precision greater than a high value like >80%. If you have a larger staff and have more resources to devote to alert review, or if you simply have a low risk tolerance, you can choose to keep signatures with lower precision values like >30%. In some instances, you can even define your precision threshold based upon the nature of the threat you are dealing with. Unstructured or commodity signatures could allow for precision >90% but structured or targeted signatures might only be considered effective if they are >20% precision.
There are a few different paths that can be taken when you determine that a signature is ineffective based upon the precision standards you have set. One path would be to spend more time researching whatever it is that you are trying to detect with the signature, and attempt to strengthen it by making your search more specific, or adding more specifics to the content of the signature. Alternatively, you can configure low precision signatures to simply log instead of an alert (an option in a lot of popular IDS software), or you can configure your analysis console (SIEM, etc.) to hide alerts below your acceptable precision threshold.
Precision for Signature Comparison
On occasion, you may run into situations where you have more than one signature capable of detecting similar activity. When that happens, precision provides a really useful statistic for comparing those signatures. While you some instances might warrant deployment of both signatures, resources on detection systems can be scarce at times. As such, it might be helpful for performance to only enable to most useful signature. In a scenario where signature 1 has a precision of 85% and signature 2 has a precision of 22%, then signature 1 would be the best candidate for deployment.
Precision for Analyst Confidence
Whenever you write a signature, you should consider the analyst who will be reviewing alerts that are generated from that signature. After all, they are the ultimate consumer for the product you are generating, so any context or help you can provide the analyst in relation to the signature is helpful.
One item in particular that can help the analysts is the confidence rating associated with a signature. A lot of detection mechanisms will allow their signature to include some form of confidence rating. In some cases, this is automatically calculated based on magic, which is how a lot of commercial SIEMs operate. In other scenarios, it might be a blank field where you can input a numeric value, or a simple low/medium/high option. I would propose that a good value for this field would be the precision statistic. While precision might not be the only factor that goes into a confidence calculation, I certainly believe it has to be one of them. Another approach might be to have a signature developer subjective confidence rating (human confidence) and a precision-based confidence rating (calculated confidence.)
A confidence rating based on precision is useful for an analyst because it helps them direct their efforts while analyzing alerts. That way, when an alert is generated and the analyst is on the fence regarding whether it might be a TP or FP, they can rely somewhat on that confidence value to help them determine how much investigative effort should be put into attempting to classify the alert. If they aren’t finding much supporting evidence that provides the alert is a TP and it has a precision-based confidence rating of 10%, then it is likely a FP and it might not be worth it to spend too much time on the alert. On the other hand, if it has a rating of 98%, then the analyst should spend a great deal of time and perform thorough due diligence to determine if the alert is indeed a true positive.
Signature management can be a tricky business, but it is such a critical part of NSM detection and analysis. Because of that it should be a heavily tracked and constantly improving process. While the precision statistic is by no means a new concept, it is one that I think most people are unaware of as it relates to signature management. Having seen precision used very successfully to help manage IDS signatures in a few varying and larger organizations, I think it is a statistic that could find a home in many SOC environments.