## Saturday, 1 November 2008

### Entropy

By testing the distribution of characters on a Hard Disk Drive (HDD), it is possible to determine its entropy score. Entropy is the relative randomness of a given data unit. As data stored on a computer HDD is anything but random (i.e. patterns in data occur even in data of apparent randomness, thereby providing a less than perfect entropy score), the occurrence of relatively higher randomness, or higher entropy, is indicative of intervention, namely encryption.

Entropy as applied to computer forensics is a measure of the randomness on the HDD. Normal data has an entropy value between 1.0 and 7.85. For the most part, data will tend towards the lower side the distribution. The more random the data distribution and hence the higher the entropy value, the lower the compression rate that can be achieved by compression algorithms (such as Zip and RAR). Any value greater than a 7.85 is related to an encryption process (including Pseudo-Random Number Generators, or PRNG’s).

Entropy is the information density of the contents of the file, expressed as a number of bits per character. When the entropy of a file is high, compression of the file is unlikely to reduce its size. Hamming (pp. 104–108) offer a sample of C source code with entropy of about 4.9 bits per character as an example. This level of entropy would indicate an optimal compression of the file could reduce its size by 38%.

Perfect entropy demonstrates perfectly random data. Most good encryption programs achieve a level of near perfect entropy, this is around 8.0 bits of randomness for each 8 bit word. With programs such as PGP Disk and TrueCrypt, the entropy of the drive section can be calculated to equal an entropy of 8.000000. The likelihood of this level of entropy occurring naturally is less than one chance in 100 billion[1]. This may be calculated using a multinomial distribution with eight (8) degrees of freedom. Entropy calculations should be conducted using both a bit stream (an analysis of the 0 and 1 values) and a byte wise analysis (this is a character analysis). For details of how entropy is derived, see: Shannon, C. E. (1948) “A Mathematical theory of communication”, Bell Systems Tech. J. 27, 379-423 and 623-656 ].

The “Source Coding Theorem” is used to calculate the expected code-length given a prefix-free source code. Having calculated the Kolmogoriv complexity of a random possible value, the lower bound of an expected value of complexity can be calculated [see also the works of Vitanyi and Li (2000)[2]].

The reasons that both stream (bitwise) and bytewise entropy calculations (and also slice variations) are calculated is that the standard entropy calculations for an encrypted file or HDD partition have a narrow range with a low standard deviation. This is the entropy of the tested slice remains similar for small and large slices, bitwise and bytewise streams. Tests for variation is the standard deviation may thus be used to determine the likelihood of a file or partition being encrypted. In the case of a product such as PGPDisk, this is not important as the existence of the file is not generally the issue. TrueCrypt and similar products attempt to hide partitions. This method (although processing intensive) may be used to determine the existence of hidden partitions.

The selected entropy scores can be determined by utilising the extensions to Kolmogorov (1956)[3] proposed by Cover and Thomas (1991)[4]. The chai squared calculation can help in validating the results[5].

Testing Randomness

The following process involves a test of the entropy obtained from an encrypted partition (bytewise entropy = 8.000) against the standard UNIX random generators (/dev/random and /dev/urandom).

A test of the general level of randomness that can be obtained from a computer system was conducted to create a test baseline. To conduct this test, 5,000 files were created with random content. These were created as follows:

dd if=/dev/urandom of=/mount/data/urandom.\$1 count=2000

The variable \$1 was used in a for-next loop to create 5,000 files (urandom.1, …, urandom.5000) that contain random data. This process was repeated using the Linux /dev/random device through which the remaining 5,000 files where created (random.1, …, random.5000).

Both /dev/random and /dev/urandom have been included in the sample to ensure that all possible distributions of random data have been included in the sample. I will follow this post with an analysys of /dev/random against /dev/urandom another time.

A test of the entropy was conducted for each of the 10,000 files and a distribution of the results was entered into “R”[6].

> summary(entropy)

Min. 1st Qu. Median Mean 3rd Qu. Max.

7.992 7.995 7.996 7.996 7.996 7.999

> max(entropy) # The maximum value recorded in 10,000 samples

[1] 7.999082

> mean(entropy) # The mean value of Entropy

[1] 7.995761

> sd(entropy) # The standard Deviation

[1] 0.0009401717

>

At this level (which is over 4.509 standard deviations away), we find a p-value[7] = (2.2e-16) that any 1Mb random section of data will have en entropy calculation of 8.000000. A test of the hypothesis that the entropy of the hidden partition may be conducted by differencing the values from the encrypted drive partition and that of the test sample[8]. The hypothesis test is conducted using the t-test statistical function in R. An array (diff) is created by subtracting the value to be tested (the array called entropy) from the hypothesised value (8.00).

Using Student’s T-Test (below) we see that we have to reject the hypothesis that the means could be equal (that is that the entropy of the encrypted partition could occur naturally). The difference in entropy calculated is small, but it is statistically significant at a level of alpha =0.0001. This is a confidence level of 99.99%.

These results are a clear indication that a file or HDD is encrypted and not just randomly occurring information. It should also be noted that data created using a random number generator is significantly more random than that created through normal use.

> t.test(diffs, conf.level =0.95)

One Sample t-test

data: diffs

t = 450.8372, df = 9999, p-value <>

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

0.004220214 0.004257073

sample estimates:

mean of x

0.004238643

>

> t.test(diffs, conf.level =0.9999)

One Sample t-test

data: diffs

t = 450.8372, df = 9999, p-value <>

alternative hypothesis: true mean is not equal to 0

99.99 percent confidence interval:

0.004202050 0.004275236

sample estimates:

mean of x

0.004238643

>

These tests can be used to distinquish random files from encrypted data. Further, an analysis of the distributions and standard deviations of each of these can be extended into determining the type of encryption used in some cases. Work is continuing on this.

References / Footnotes
[1] See Gallager’s Theorem [Gallager, R. G. (1978) “Variations on a theme by Huffman”, IEEE Trans. on Information Theory, 24, 668-674] which details the process need to calculate the Huffman code for a source. The Kullback-Lieber or Mahalanobis distance “d” may also be used to calculate the expected value ”Ep” and hence the likelihood of the entropy occurring naturally. The expected probability using this method is incalculably small.

[2] Vitanyi, P. M. B. and Li, M. (2000), “Minimum description length, induction, Bayesianism, and Kolmogorov complexity”, IEEE Trans. Information Theory, 46, 446-464.

[3] Kolmogorov, A. (1956) “On the Shannon theory of information transmission in the case of continuous signals” Information Theory, IRE Transactions on Volume 2, Issue 4, December 1956 Page(s):102 - 108

[4] Cover T. M. and Thomas, J. S. (1991) “Elements of Information Theory” Wiley Interscience

[5] This method is based on a variation of the ENT process derived by John Walker (2008) which are based on the work of Hamming, Richard W. (1980) “Coding and Information Theory”. Englewood Cliffs NJ: Prentice-Hall. See also: Park, Stephen K. and Keith W. Miller. (1988) “Random Number Generators: Good Ones Are Hard to Find”. Communications of the ACM, October 1988, p. 1192.

[6] R is a statistical analysis language and program. http://cran.r-project.org/

[7] This is an indication of the probability associated with finding the occurrence of this level of entropy naturally.

[8] Which has been designed to simulate the natural distribution of random data on a computer system through repeating the creation of random files 10,000 times.

Please use the public key below to discuss this and other crypto matters.

```-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: SKS 1.1.4
Comment: Hostname: pgp.mit.edu

OxsSHohRgZpG146slEpASDdD//TJCH72yTtModYKWI7z6WWgXSNQvwKb+q2G9SeJ2N+2t1Bk
FuM+WOGstGTQiEB+Oj6OlJPI9I3YoE8T2VC8m5BrHdIi2W4R1vbuCGry0To7L9MygtuxmGfl
qrUiG0teiKNy0mpZaMDXJHAyLLaamHj25HXcHhq8LyyQHCE6iCcXY8iD/Dma98+ZcEEEQDfO
rmK7HVSU/Rh29VNJ2fgnJM+hhsmJnIPpxt6NIwhtY66U0lKWozOHnJSc1xIMv562NMxQUs3P
Vzrqyd5I/3gSnU+dhoHTSkbjNKWvAhIpEHNNYQ/4lub5bEblGhtvVYp67DUgjYrQmxscK3da
dG9tYWlsLmNvbT6JAb8EEwECACkFAkkJIQMCGwMFCRSvv40HCwkIBwMCAQYVCAIJCgsEFgID
AQIeAQIXgAAKCRBJH5vfD3vUrfWXC/9arLWyt3zRKU7RMr8sGtD2Uh2gBsk2okqTgdWF+wn3
z8IPLER7zQ/sLPklTHtwi0lNzY7DS+w5NJTPSF4NbqcM8UOXOrQvqCatlLNHftbOuPCNoJpI
mIdJfsvW5HhKEUydvJCQn8Pm1i69MHB1Pv4ZLzbf/3iiH+/2A9Ug5upwB4+QJPBE7mC+88xn
YPWRNVCuGF6Bny0Q+b+MPvnD9rFxCzyQUlPhM229cDnVjwDRWSapEVvC2VDAki93x9fzOOiE
96sINhal0atie/9jwkMqAMIlgWCoVBX+4IrION2k0N562JSzwv1+TfpIURLu1dNuxK20uVWT
E1ltVJqkGVyK4JUKluXeDJORr9pDvkr2GLgLDCx/9ynZK/cR54Tt78d0RfYKzwPnTzTU0V5f
5lfHfHwhA4kuoGMdp9a/dmomq71RZ1SfQgeiPae5AY0ESQkhAwEMAMIve+FvEYR21WfWPLMR
qNibYK2jUkoaoZUIFTS/eQ9gN+F2UcaN4ncdh9ZkAFSB5hLNQ9UatLf76PizHFKbel42DyfJ
uajxozCsYcmZ4/7F6s3b0/GtrVM3HP+ed0zXBlg36EcD7OshCLJhD0iJRE69FK7PkJJNFiHk
/Ql197O5pMWS1Ewwg2LMGRI4o0zYrP87h7QnsUggLmgz+4i2EAy7Bi6K8NNHrXycKjQo5u7J
3gRqTtTG9iUDwvcNr9W3ZDbOC2r2SZFk2gKF/OWMFFBv7PjEj/Xc5XtGYgeNp0acL9kl3w5f
KWf5JZsE+mImZDf29k6bEBjjPwzbCS2jsgu2VTIv7cNJSNdJxgUlWrxUNbnmTla8oh0GOKlc
42L3/nktmck8qGW3kin9rMVCdnllxUMUrp1vbnvC8uiMuoEQc0WG5W5jPcA2Y96OWyNTjAmL