Monday, 28 November 2011

What is a hash

With conferences and the like I have been behind in writing up a load of material on topics such as NAP etc. I have not forgotten these. For now, I will start by detailing some terms I have seen used poorly.

To start, I will look at what a hash function is.

Formally, a hash function H is defined as a transformation that takes a variable-size input m and returns a fixed-size string. This fixed string is what we term the hash value h.

We can express this as:    h = H(m)

There are some things we need to know in developing our function h:

  • the input m can be or any length (this includes being smaller than the resulting hash output h, larger than the output or even of the same size).
  • The size of the output h remains the same no matter what the input is. That is, if the output of the hash function returns a length of L bits for a given input m, it must return an output of length L bits for ANY input m.
  • The hash function H(m) is one way. If we have a value h we cannot use this to determine the initial input m.
  • The function, H(x) must be simple and computationally inexpensive to compute. That is, making a table of mapped values and indexing calculated hashes against input must be expensive when compared to making the hash.

Hashing is used primarily in digital signatures and for integrity checks. They also aid in time stamping.


It is stated that a hash needs to be “collision free”. This is wrong. If we think that an 8-bit hash has only 256 possible values, we see that there is an infinite number of collisions as we can have an infinite variability in input. Collisions abound!

Collisions always EXIST IN A HASH FUNCTION. This is NOT the issue. The issue is not the existence of a collision, but the fact that we may be able to calculate the collision and predict it.

A hash function H(x) will have collisions, but the distribution of these collisions should be unpredictable.

If we constrain the length of the input m to a certain value, the number of collisions can be stated to be in the order of:

No. Input messages possible


No. Hashes by Hash length


  • No. Input messages possible = 2(length m)
  • No. Hashes by Hash length = 2^(hash length)
  • Collisions exist.

No comments: