Guest Column | May 25, 2017

How To Find The Right Next-Gen Security Solution

Security

By David Corlette, director of product management, VIPRE

Working in the cybersecurity world, I’ve noticed a lot of grandiose messaging from many security vendors lately, particularly in the wake of the recent #WannaCry ransomware malware attack . A non-skeptical observer might believe any one vendor could solve all your security problems — and probably mow your lawn too.

One vendor in particular struck me as particularly emblematic of the systemic trend. Paraphrasing, a booth at this year’s RSA Conference invited me to “go beyond next-gen.” I think the vendor in question has recognized “next-gen” has become a trite, meaningless phrase and decided to update it with a completely new trite, meaningless phrase. Go figure.

What do vendors mean by terms like “next-gen” and “advanced” and others of that ilk? Clearly, they intend to communicate something more than the claim they are the best. Let’s examine some of the advancements in anti-malware technology and what us vendors mean when we talk this way.

First, let’s look back at the earliest techniques used to detect malicious files. In the early days, malware was simple: someone would create a file, maybe “fun.exe,” and then send it around to a bunch of people. The file was always the same, so it was easy to construct a list that said, “If you see a file called ‘fun.exe’ that is 12345 bytes, block it.”

Eventually, things got more sophisticated to a point where we could calculate file “hashes” (essentially a unique fingerprint that is very unlikely to be the same as any other file) to help quickly identify known malware. That worked for a little while, but malware authors got craftier and started to ship multiple file versions, changing the filename or adding a bit of junk to change the file size.

In some cases malware does this automatically every time it executes, right before it grabs your contact list and sends a copy to everyone you know. Defenders needed to get a little more sophisticated, so we developed ways to look for key patterns within the malware file. These “signatures” range from static chunks of bits within a file to blocks of code we find by disassembling or emulating the malware, both of which are hopefully unique to that particular class of malware.

Of course, since we’re now just looking at a part of the malware it means we have to be careful in constructing these signatures or so-called “expert rules,” as we may accidentally match them against perfectly benign files that happen to use the same constructs, generating false positives. You may have heard of “Yara rules” which are one embodiment of this kind of detection, as are many anti-virus “definitions” that you get regularly from your AV vendor.

More recently, malware authors have developed ways to circumvent both hash-based detection and even very good expert rules. They do this through a variety of techniques including anti-disassembly, code re-ordering, packed and encoded/encrypted binaries, sandbox environment detection, and so on. In many cases, particularly with targeted malware, it’s now very hard to accurately block malware based on looking for one known-bad feature such as the hash or a block of code or similar feature.

WannaCry is a good example of this; when it burst on the scene late on May 12 no one had ever seen it before, so many antivirus engines that depend on signatures only failed to stop it leading to 200,000 infections in just a few hours.

Which brings us to today’s advanced techniques, a.k.a. the “next-gen” solutions mentioned above. The idea is to make anti-malware software more sophisticated, and instead of looking for one malware feature to identify malware it will look at a variety of features in combination to make a determination.

The problem is it’s very hard for humans to figure out which features are relevant (Is it the filename? Is it the compiler options? Is it a particular block of code? Is it the PE32 header?) and, if they are relevant in some way, how much weight to give to that particular indicator. Enter machine learning and Big Data.

We can use machine-learning techniques to “train” a system to identify malware. To do so, we build a detection engine that can “learn,” and then we send lots of known malware plus some benign files through that engine. We tell the engine in advance which is which. The engine extracts a set of features (possibly thousands) from each file ranging from header data and code snippets to behavior based on sandbox execution, and then figures out how much any particular feature contributes to the likelihood that a file is malicious.

For example, imagine a feature such as “file is executable.” For any given malware file and any given benign file this will be true, so this feature is 50/50 in terms of predictive capability — in other words, not predictive at all. On the other hand, consider something like “file deletes the MBR.” Something like this might be present in a lot of malware but is very unlikely to be in benign files, except for a very few disk formatters. It’s quite predictive, but not 100 percent. Combine this predictability across dozens of features and you can start to build up a good probability model of what is malware and what is not. If your engine is well written, uses the right algorithms, is properly trained and pruned, and so on — in other words, it’s a “good” engine.

So what makes a “good” detection engine? Well, there are four core predictors: one, the quality of the model, which is dependent on which statistical techniques are used; two, which features the model analyzes (some use just a few features about a given file and some can even use deep behavioral analytics); three, the size and quality of the training data which informs the model; and four, the speed at which the model can be used to match against new, unknown files (which again depends on the techniques used in the software).

A “good” engine does a good job at examining a new file same and predicting whether it will exhibit malicious behavior without missing any, but at the same time does not produce too many false positives.

If it’s not obvious, number 1 and number 2 above often conflict with number 4—the more detailed and presumably accurate the model, the slower it usually is. Most of these features that determine whether an engine is “good” or not are hidden to the end user, which makes it a challenge to select the right vendor. But here are two things you can look at as trusted security providers looking to provide the best solution to your customers:

  1. Training a model requires a lot of input data — a LOT. Newer vendors might have exciting technology, but it is likely prohibitively expensive for them to gain access to the billions of malware samples they need to really train their engines. Look for companies that have been around for a while, and have had a chance to collect the many, many samples needed.
  2. In the end, a good model will do a good job of matching both prevalent and zero-day malware. I encourage you to review a number of independent testing organizations that specifically test this sort of thing using diverse sample sets. One of the best and clearest such tests is AV-Comparatives’ Real World Protection test. As an example, the “beyond next gen” vendor we introduced at the start of this article does a decent job detecting malware under this test, but for all their talk, they also blocked 67 known-good files and domains, which would annoy your users to no end.

In the end, terms like “next-gen” and “advanced” don’t really mean much by themselves. What matters is whether your anti-malware solution works for you, and can block emerging zero-day threats like WannaCry. Detection rates and false positives should be a key component of your buying decision. Look to independent testers and company longevity for guidance, but don’t forget about other aspects as well: performance, usability, manageability, and total cost of ownership over the long term. The more things you consider in a security partner, the more your customers will appreciate it.