Thursday, 18 February 2010

Vulnerability Modelling

Vulnerability rates can be modelled extremely accurately for major products. Those with an extremely small user base can also be modelled, but the results will fluctuate.

What most people miss is that the number of vulnerabilities or bugs in software is fixed at release. Once the software has been created, the number of bugs is a set value. What varies stochastically is the number of bugs discovered at any time.

This is also simple to model, the variance being based on the number of users (both benign and malicious) of the software. As this value tends to infinity (a large user-base), the addition of any further users makes only a marginal variation in the function. Small user-bases of course have large variations as more people pay attention (such as the release of a vulnerability.

As I have noted in prior posts, this is a Cobb-Douglass function with the number of users and the rate of decay as variables. For largely deployed software (such as Microsoft’s Office suite or the Mozilla browser), the decay function can be approximated as a decay function.

This is, for a static software system under uniform usage the rate of change in, N, the number of defects is directly proportional to the number of defects in the system.


Here, a Static system is defined as one that experiences no new development, only defect repair. Likewise, uniform usage is based on same number of runs/unit time. As the user-base of the product tends to infinity, this becomes a better assumption.

If we set time T to be any reference epoch, then N satisfies


This means we can can observe A(t) — the accumulated number of defects at time t.



With continuous development, an added function to model the ongoing addition of code is also required. Each instantaneous additional code segment (patch fix or feature) can be modelled in a similar manner.

What we do not have is the decay rate and we need to be able to calculate this.

For software with a large user-base that has been running for a sufficient epoch of time, this is simple.

This problem is the same as having a jar with an unknown but set number of red and white balls. If we have a selection of balls that have been drawn, we can estimate the ratio of red and white balls in the jar.

Likewise, if we have two jars with approximately the same number of balls in approximately the same ratio, and we add balls from the second jar to the first periodically, we have a most mathematically complex and difficult problem, but one that has a solution.

This reflects the updating of existing software.

In addition;

Where we have a new software product, we have prior information. We can calculate the defect rate per SLOC, the rate for other products from the team, the size of the software (in SLOC) etc.

This information becomes the posterior distribution. This is where Bayesian calculations are used.

No comments: