When Nothing is Better than Something

Any reasonable information security system consists of two fundamental components: (1) a risk assessment; & (2) controls that minimize those risks.  In this article I want to talk about the risk component of risk assessment & the companies that sell cybersecurity products & services — the controls.

Understanding the concept of “risk” is tricky for a number of reasons. It is not black & white.  Risk is nuanced.

First we need to understand that the study of risk is not just a dry, boring exercise confined to finance, cyber-security, insurance, program management, & the like.  Each of us informally manage & evaluate risk each day of our lives. We do it unconsciously.

Consider the potentially lethal bowl of cereal you ate this morning.  The grain could contain ergot that could poison you. The milk may not have been pasteurized properly exposing you to pathogenic listeria bacteria. You rely upon people or organizations – risk proxies –to provide you with milk & grain that is safe to consume.  Even then, you still are ultimately responsible — a miscalculated ratio of milk to cereal could cause you to choke to death.  The point is, we evaluate so many risks so often that the process becomes a habit, & because we do this risk calculus so automatically, we often develop bad habits when we do so.

These bad risk calculation habits — that often lead to adverse consequences — can be grouped into two categories: (1) an inability to discriminate between good data & bad data; & (2) an emotional rather than reasoned decision making process.  Individuals who have mastered these bad habits are often the recipients of a Darwin Award – an award that recognizes individuals who have supposedly contributed to human evolution by selecting themselves out of the gene pool via death by their own actions.  Organizations that manifest these same bad habits usually fail.

Regarding the perception of risk — It has been well established that peoples’ perception of risk of is different from the reality of risk.  For example, your risk of dying in a bathtub or by a fall (or even by falling in the bathtub) is far, far greater than is the risk that you will be killed by terrorists;  yet the visceral threat of airplanes hitting buildings, beheadings, or immolation is much more emotionally compelling than the threat of a bathtub.

Regarding bad data — All programmers are familiar with the concept of GIGO – Garbage In, Garbage Out.  Economists & statisticians understand that no amount of clever manipulation will ever allow them to extract any meaningful information from bad or dirty data.  This concept is so idiomatic that even pigs understand it:  “You can’t make a silk purse out of a sow’s ear.” 

And yet, almost every day, purses of some type or another are published by providers of cybersecurity products & services.

My concern in this article is with the almost complete lack of transparency that surrounds the claims, & in particular, the quality of both the raw numbers & statistics, that firms, whose sole existence is predicated upon the sales of security products & services, make available to their buyers & to the public.

Scott Adams (the author of the comic strip, Dilbert) coined the portmanteau, confusopoly, which describes the current information security product & services marketplace.

“A confusopoly is a situation in which companies pretend to compete on price, service, and features but in fact they are just trying to confuse customers so no one can do comparison shopping.”

Within the information security industry, everyone is aware of the term, FUD.  FUD is an acronym derived from the phrase, Fear, Uncertainty, & Doubt.  FUD & confusopoly may not have identical meanings, but both have had & continue to have a major influence on purchasing decisions & assessment of risks within the field.

Quantifying risk in any environment — finance, health care, insurance, home mortgages, even breakfast — requires good data. In the field of cybersecurity, we almost always view data that is derived from surveys, & we almost always are denied access to the raw survey data.

Let’s step back & evaluate these data for a moment. Descriptive data is data that is one step removed from the raw data, & yet encapsulates in some way an amount of raw data that is often less understandable in its raw form.  An example of a simple descriptive datum is a baseball player’s batting average.  For example, Ted Williams had 7706 at-bats during his 19 years in baseball.  Of those at-bats he had 2654 hits.  These two numbers are the raw data that define the descriptive statistic of “batting average”. In Ted’s case, his BA was 2654/7706 or .344.  Most baseball aficionados won’t remember the raw data, but they will remember & understand that his BA of .344 was exceptional.

A short list of additional simple descriptive statistics include the mean (average) of a distribution, the median (a point in a distribution in which half of the data lie on one side of the mean & half lie on the other side), the correlation coefficient (a normalized measurement of how two variables are linearly related), the variance of a distribution (an indication of the spread of the data), & the standard deviation, which is simply the square root of the variance.

The point I’d like to make here, is that given a set of raw data, the descriptive statistics described above are almost always necessary in order to understand & draw conclusions about the raw data.

After reviewing about a dozen of the major cybercrime surveys that have come out in the past two years, I’ve come to the realization that not a single one of them contain data that can be trusted, or that is statistically significant in any way.  In the only two cases where I have been able to find a study that provides both the mean & the median values for losses or incidents, the data (such as it is) is in both cases extraordinarily heavy tailed – that is, in both studies the mean (or average) is extraordinarily higher than the median.

This means that several outliers (outliers being cases with either an extraordinary amount of losses or an extraordinary number of security incidents) account for disproportionate number of the total.  This skews the average or mean distributions to the point that the distributions become meaningless, in that they do not reflect reality.  These skewed distributions though do provide a significant FUD boost.

In addition to the lack of publicly available backing statistical data in cyber crime statistics, there is an additional significant complication.  Almost all of the publicly available cybercrime statistics rely upon survey data; & survey data – particularly in the area of cybercrime – is difficult to get right.  The best arguments for this can be found in the document, Sex, Lies and Cyber-crime Surveys, from Microsoft Research.

Our risk-based thinking & calculations rely upon data that is robust to the degree that the answers given in cybercrime surveys are both representative & reliable. Survey error, sample error, heavy tailed distributions, & methodological bias, combined with an almost complete lack of statistical context provide us with “data” that is neither representative or reliable.  As if this weren’t bad enough, almost all of the statistics that we see are presented in the form of “infographics”, which are thrice removed from the raw data & are consequently, impossible to fairly evaluate.

For example, in two widely quoted surveys by two security industry heavyweights, in 2012, Symantec estimated cybercrime losses at ~ $US 110bn, while in 2009, McAfee’s estimate of the same type of losses was an order of magnitude higher. The $890bn difference is certainly not a rounding error, & suggests that there exist significant errors (or bias) in one or both studies.  Thus, we are obligated to make our risk calculations & assessments based entirely on FUD.

While this situation works in favor of security vendors, it negatively impacts the community of security practitioners who rely upon the vendor statistics to plan & implement security controls.  And frankly, it is embarrassing or it should be to the industry as a whole.  Cybercrime is a growth industry.  It is simple to find real data on property crime, personal crime, automobile crime, etc.  Isn’t it about time that we had the same data on cybercrime available?