Sunday, 18 May 2008

Quantitative risk requires statistics

Quantitative risk requires statistics (a rant).
It is not hard to do Quantitative risk modelling in IT Security as long as you have the maths. The difficulties are missing values (requiring longitudinal data analysis and multivariate methods) and incomplete risk profiling (requiring Bayesian methods).

The risk is a survival function with compounding time factors (heteroscadesis).

Pulling a number "out of your ass" is qualitative. If another can not re-calculate the same value, it is qualitative and NOT quantitative. Quantitative methods are not based in subjectivity.

Why this is not commonly done in IT risk assessment (BASELL II DOES require a qualitative risk assessment).

  • Lack of math skills in IT people
  • SAS and other quant people earn more (2.5-3x) the IT salaries
A good quant in a hedge fund can earn $300-500k US without too much trouble. This type of person rarely cares to do IT security. Hence few people who are statisticians AND security people.

Hence few quantitative risk reviews.
Some standards (BASEL II, GLBA) have requirements for quantitative risk. This is mainly banks, hedge funds etc. Few others can afford it.

The large ones do some. The smaller ones issue fake numbers more than not.

As for being delusional if you trust in quant-based methods, that is for anyone who trusts a qualitative assessment where people pull numbers. These assess perceived risk - these do not assess risk. There is a distinction.

Qualitative = Perceived risk
Quantitative = Risk (within confidence bounds)

Some say that ncircle IP360 is quantitiative, it does nothing of the sort. ncircle IP360 is fluffy qualitative assessment.

You need to feed all the data you can and do a little dimensionality reduction, letting the numbers chose the factors and including the errors.

If you want to start learning how:

It has been stated in one form or another that:
"Bottom line: I personally do not believe that it is possible to do a quantitative risk assessment and anyone who thinks otherwise either does not understand today's risk environment, or is delusional."

No, the opposite. Qualitative risk is for those who like to think they know. The data is far too complex to be assessed by ANY person and requires computational methods. I have yet to see a qualitative assessment that when compared to a REAL quantitative one comes close. The issue being many naive qualitative methods that are falsely called quantitative.

Look at ARO, ALE etc. This relies on a risk calculation. The likelihood of an event for the type of organisation. The ONLY way to do this is to use survival analysis with multivariate analysis taking compounding factors into account. The issue is that people pull a figure out of their proverbial as was stated. ANY addition of non-quantitative data makes the ENTIRE calculation qualitative. ALE is ONLY a quant measure if the likelihood calcs are completed using hazard factors and survival calcs.

The difficulty is the cost. I have seen PCA, PLS, SIR and k-dimensional factorisation for 80+ dimensions that can take a few weeks of computer time and this costs $. Look at the rates of C++ programmers with quant skills. The question is why use these skills for security risk when market risk pays $600-$800 an hour. Even at the security risk calcs, few want to pay. My charge rate for this is $370 ex tax. For 80 hours plus work per system, the cost of the process is often greater than the assess value and risk for smaller firms.

However, once done, the model generally only needs to be updated yearly with the principle 5-6 components accounting for over 98% of risk by asset. This leaves an error of the 1-2% which is not material for most organisations.

In theory and practice.
How do you model historical data? Well the answer is multivariate means. Longitudinal data analysis.

The whole basis of what most people consider to be probability theory is starting from a flawed foundation and is being built without substance. The chance based theory being implemented is basing a determination on methods from the 17th century (literally). Although we still teach this in high school, it is not the basis of a modern curriculum in statistics.

The how is the same how used in heteroscadestic financial modelling, biostatistics and similar disciplines.

You are starting with a qualitative assumption. You have in your own mind decided on the risk factors. As I specified - you can not do this. This is another flaw in understanding quantitative methods. You need to use a dimensional reduction technique and allow the data itself to determine the correlative factors.

"For example: Historically, the chances of a Windows box on a secure network getting rooted were less than 1 in 100,000."

I am sorry, how do people make these type of figures up? I see no basis in reality. These are perception from analogy. The assumptions are also incorrectly factored this into a single dimension. Wrong. "Bad Zutt, Naughty Zutt".

You need all the data. For a start:
  • Type of industry
  • Location
  • Traffic volume and patterns
  • Router and firewall rulesets
  • (and it is easy to feed these into a correlation engine)
  • ... (no skimping)
As for factoring video card bios root kits, I have done this for many years.

I did a paper on the use of ARIMA (autoregressive integrated moving average) methods for the prediction of malware a couple years ago. Although people such as yourself scoff at this type of modelling, my predicted model is still accurate after 2 years (based on a 95% confidence).

"Except perhaps for risks associated with Mother Nature. And with climate change"

Please, are people kidding. IT risk is simple compared to weather modelling. The dimensionality in IT risk comes at most in the order of 60-100 factors. Weather modelling comes in the 10's of thousands.

The problem with this type of attitude is that you see this as hocus pocus just as people do not understand it. Yet the maths is the same in many cases as that which allows a GPS to not drift the minutes a day that relativity theory dictates it must due to velocity differentials to earth. It is the same that allows our phones work.

You can not make the dimensional reduction to a windows host has a 1 in x chance of being compromised.

You need to model EACH host:
  • Workstations in network A,
  • Servers on DMZ with config A,
  • Servers on DMZ with config A that are patched a week later,
  • Workstations on a hub
  • Workstations on a switch
  • Workstations on the same network as a win 95 box
  • ...
As I stated, this type of modelling is not cheap. Doing is not hard, it just requires more maths than most have. In fact I have the problem of getting staff for this reason. I had a grad, a year ago. He left as one of the investment banks offered him 150% of my salary. Now he models hedge funds.

Most end up doing BI (Business Intelligence) modelling for banks and telcos to predict client churn. Same maths, but IT people with maths are rare. I am not talking B.Sc. I mean a good post grad research math degree.

In Australia, we produce less than 250 of these per year. Of those, in any field, of IT there is about 5% - and most of this goes to bioinformatics.

So is there a great volume of quant snakeoil. Answer as yes and you are correct. The issue is that few can do the maths to see if it works.

How do you tell what is real. Well look at track records. Those who are willing to publish their models and who have a track record over the years and can be validated etc are more likely to keep doing this. Those who refuse to publish their models and algorithms as they are "proprietary" are basically snake oil sales organisations.

As for future aspects, my models take EVERYTHING into account and I let a dimensional reduction method choose those factors that have a statistically significant effect remain.

As an example, I am already factoring the impact of 3d printing technology on IP (intellectual property) protection.

"how do you base risk on historical data"
Again, people are thinking high school stats. I have pointed out a few methods. LDA and other methods are used for missing data projections. These have been around for 15 years or so now and have proven themselves. My data analytic team is learning these, most do not learn them. Just because most people do not know grad level statistics, does not make it magic.

Multivariate data analysis using Bayesian techniques accounts for the gaps in data. What you get is a range and confidence interval. As an example, a calculation would provide something of the type (based on real data):

System Expected Risk at 95% CI
Windows host A (patching daily) $3,521 - $4,210
Windows host A (patching weekly) $5,422 - $6,585
Windows host A (patching monthly) $13,895 - $15,510

System Expected Risk at 99% CI
Windows host A (patching daily) $3219 - $4512
Windows host A (patching weekly) $5002 - $6905
Windows host A (patching monthly) $10275 - $22130

The trade off is that the more accurate the confidence level, the wider the range. What this then allows is a determination of the benefits.

For instance, if the Windows host A cost estimate at a 95% CI is set daily at $35 (+/- $2.50) we have a years cost range for daily patching of ($11862.50, $13687.50). SO we are 95% confident that patching the system on a daily basis will cost us between $11,862.50 and $13687.50.

The calculated costs of patching weekly are ($4,225, $5,362.50)
The calculated costs of patching monthly are ($1,482.20, $1,596.21)

So looking at the expected benefits:
System Cost of patching (CI = 95%)
Windows host A (patching daily) $16,640.50 (+/- $1,257.00)
Windows host A (patching weekly) $10,797.25 (+/- $1,150.25)
Windows host A (patching monthly) $16,241.71 (+/- $864.50)

So we see that the additional effort to patch the system for this organisation daily is a cost. That doing this less than monthly is a cost. So the best (lowest cost) strategy is to patch weekly.

The results where statistically significant at the alpha=5% level for a determination that the effort to patch daily would cost more than it saved. Equally, the cost "savings" of patching the system on a monthly basis added additional risk.

If the client had wanted to pay more we could have modelled this to the inflection point and determined the exact benefits, but the model was not significantly better than the simple model in any event and did not justify the cost addition.

(So Matt and others at iDefense, MacAfee, the Certs etc, this is what I do with that zero day data.)

No comments: