Wednesday, 17 February 2010

A response to modeling risk

This post follows from a prior post.

“P(compromise) already has the possibility of a vulnerability and the possibility of it being attacked by someone with sufficient motivation and resources built in, doesn't it?”
Correct

A normal or Gaussian curve is a good fit for white noise and random error. The number of software bugs is fixed on each iteration and hence the use of a normal distribution is an error. The number may be unknown, but it exists. An unknown set number is not random. The rate at which vulnerabilities is random, but it is not normally distributed.

“There are two problems with your approach that I think I would encounter while trying to use it. First, my company does a lot of in-house development. When a new web application is deployed, assuming we've tested thoroughly and remediated the vulnerabilities we've found, it would appear that the number of vulnerabilities is zero. This is false, of course.”

This is the wrong approach. In developing software, you have ‘a prior’ information already. You have statistics on the native coding error rate from prior exercises. This will be greater than zero. The SLOC (Source lines of code) data is also available. You should have some idea of the number of users.

This means that you should be using a Poisson decay model for bugs and vulnerabilities. This can be made more accurate as a Weibull function that incorporates users, SLOC, data from coding in past assignments (based on lines and correlated errors by programmer if available).

The assumption that all bugs are re-mediated is flawed. This would require that no software bug had every been discovered post remediation. Unlikely at best.

So, the very beginning of the modeling exercise must start from an unknown number of possibly theoretical vulnerabilities.

Unless you are starting with formally verified software, the start is an unknown but estimable number of flaws. For remote compromise, you need to include all paths. This is a network analysis (not as in hardware network, but mathematical)[1]. For a web application assuming a non-local attacker, this needs to incorporate the OS, your app, services used and any applications that an attacker can access remotely.

“What I don't know is the density function to apply when estimating (since I have no historical data at time zero).”
Actually you do. There is the data from other products, but for your own, unless this is the first exercise ever done by the company, data will exist.

The simplest method is to create a poisson decay model. Base this on data from prior projects.

“In addition to this, I may not have a detailed understanding of my user base to factor into the chance of vulnerability.”

What matters is the number of users. As an Internet application is open, this is difficult to model, but it will be estimable by traffic. You can also model the risk based on the level of knowledge regards the site based on how many people come to it.

“I can choose a standard density function, like the Gaussian for example, but I believe there will be cases where it's difficult to predict with certainty what the risk will be, due to not knowing what key factors will push a particular population of users to produce even one attacker, especially when the population is smaller and more restricted than the Internet at large.”

You have a set but unknown distribution of bugs. This is not a Gaussian (normal) distribution.

"I believe that better models might be produced with more data, but I also believe those models will be influenced by observation. Risk modeling in the financial sector has to be a sure sign of this. Taleb predicted in 1997 that the model being used wouldn't be accurate and he had analysis to back that position up."

There are a few issues here. Models need to be tested against real world conditions. They have to be tested. At the least, remove some data from the source used to create the model to test the completed model.

Black Swans are not the issue in financial models. Freddy and Fanny have been basket cases for years. Models have demonstrated the problems for years. But bailouts and subsidies have covered these failings for many people.

Like the financial crisis, data exists in the case you have noted. The issue is whether people are honest enough to use it.

[1] See also Graph Theory.

No comments: