Virus Counting Is a Numbers Game

The malware seen by antivirus companies will jump by at least half this year. But what does that mean?

Robert Lemosarchive page

July 23, 2009

Duck and cover.

That might be the first reaction to the latest statistics on the fast expansion of malicious software–“malware” for short–from antivirus firm McAfee. The company announced on Wednesday that it had found and analyzed more than 1.2 million unique malware programs in the first half of the year, compared to 500,000 in the first six months of 2008. In total, McAfee processed between 1.5 million and 1.6 million pieces of malware last year.

The data suggest that malware is on track to grow by 50 to 150 percent this year. Of course, the question remains: Just what is a unique piece of malware? Cybercriminals regularly compress their programs or obfuscate them in ways to elude detection by antivirus companies. In some cases, two victims can go to the same compromised website and get the same program packaged in two different ways, potentially leading an antivirus company to classify it as two different malicious programs.

“Literally, every time you hit the refresh button, you get a new build,” says David Marcus, director of security research and communications for McAfee.

McAfee defines unique malware as a program that requires a separate “driver,” which other companies might call a signature. There typically are three types of drivers: a generic driver that recognizes whole classes of viruses or Trojan horses, a heuristic driver that recognizes certain bad or exploitive behavior, and a specific driver for a single program–essentially a binary hash.

Each piece of malware has to go through McAfee’s analysis system, which manages workflow and automatically processes some 90 to 95 percent of the distinct files it receives from its installed base. Without the system, the company–and other antivirus firms–would be buried under an avalanche of code: McAfee’s analysis system, dubbed Artemis, must process tens of thousands of potentially malicious programs every day, of which nearly 7,000 are considered “new.”

Writing generic drivers to detect tens of thousands of the binary-distinct programs helps compress the sheer volume of the work, Marcus says.

“We are going to look at all the things we saw in a particular day, and ask, ‘How can we write a generic driver to detect all of those samples?’” he says. In addition, generic drivers help speed the firm’s virus-scanning engine.

In some ways, the expansion of viruses equates to a greater workload for antivirus firms’ analysts. Most of the firms have opened overseas analysis centers to help them deal with the multiplying challenges of classifying the malicious from the valid. In addition, better software tools–such as McAfee’s Artemis–help manage the workload more efficiently.

Yet, the numbers lie as well. Efforts to make McAfee’s database more efficient–three generic drivers are far faster to apply than 100,000 individual signatures–requires a great deal of work not reflected in the current tally. In fact, such work actually reduces the apparent increase in viruses and malicious code. “The 1.2 million that you are seeing is not inclusive of the generic and heuristic number,” Marcus says.

While Marcus acknowledges that such issues make it almost irrelevant to keep track of the numbers, he points out that customers want the data. “A customer wants to know how much stuff we are protecting them from today.” He adds, “At the end of the day, [cybercriminals] are writing more malware than they were a year ago.”

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.