Wednesday, August 11, 2021

Security Concepts: Hash Functions

A note on timing: I know that it's been a little while since I posted last, and I haven't given up on blogging, I'm just balancing life. My last post on MTE was a big one that I worked on pretty constantly for a while and I got a little burned out, to be honest. I've been focusing on work and personal projects that I may write up at some point. Just saying, I'm here, talking into the aether.

To the post! 

I'm starting with hash functions because they have the simplest explanation in theoretical terms, even if their implementations are among the most complex algorithms. I'm not going to go into how specific hash functions are implemented here.

If you've taken one of the most basic computer science courses, usually called something like "Data Structures", you are likely familiar with a structure called a "hash table". The key piece of a good hash table is the hash function. If you remember that class really well (or you're currently in that class) you'll remember that the general requirements are that the function must be fast and is usually desirable for it to have a flat distribution of outputs based on inputs. (Spread the keys across as many buckets as possible.) With these you're on the right course, but Cryptographic Hash Functions (the only ones we care about in security) have more complete requirements.

Pause for nomenclature:

hash - /haSH/

  1. (hash function) (N.) Function that performs a hash operation
  2. (V) to perform a hash operation via a hash function
  3. (N.) the output of a hash function
  4. (N., illicit) ask your stoner friends
I bring this up to clarify that the use of the word "hash" can be confusing because it is the actual function, the action taken by the function, and the output of the function, depending on the context within which it is used.

Now, a little background on why we want hash functions in the first place.


The purpose of a hash function is to take an arbitrary data input, having an arbitrary size, and process it and output data of a predetermined size. 

I know this sounds like saying "The purpose of a hammer is to hit things," which is technically true, but it obviously misses some important context. So...

Where would a hash function be used?

Generating Random Data

Generating random data from a hash function is the less common use case, but it's worth mentioning because it does happen, and it's important to know that this is in your toolbox.

There are plenty of reasons that you may need random data, but since hash functions guarantee that the output will be the same for any consistent input, they're useful for key perturbation (e.g. mixing stream specific attributes with a master key to derive stream key), generating seed data for RNG things (seeding rand with current time is great, but what if you seed it with super random things?), or any other case where you need a pseudo-random source of data that is either shared or re-calculable.

Obviously the first example is more clear that you would want a cryptographic hash function since it invokes a cryptographic use case, but the second may be just as important, what if the RNG is driving an online gambling platform? As close to pure random as possible can be extremely important sometimes...

Digesting Data for later Validation

Just the wording of this use implies something like a checksum, and this is true sometimes, hash functions can be used, and often are used, as checksums. The important thing here is that, while checksums traditionally protect the recipient from transmission errors, using a cryptographic hash will also protect the channel integrity from attackers who wish to modify the message. If an attacker was to modify the message but leave the hash alone, the recipient would know the message was invalid. Further, by using a cryptographic hash function (as opposed to a simpler checksum technique), the attacker has a harder (potentially impossible) time making the new data fit the original checksum.

Proof/Evidence of Existence/Knowledge at a Certain Point in Time

Imagine you could type some message, the tweet the digest and some metadata about your message (but not the message itself). Then later, you may choose to publish that message along with a reference to the tweet. As long as you haven't modified the message, it can be said that it is exceedingly likely that you had already typed that message out before you published the tweet.

Passwords work in a similar way, except the whole point is to obscure the "message" (password). Properly implemented password systems will use cryptographic hash functions in order to derive some data, then that data is stored, instead of storing the password. Then, when your identity needs to be verified, you supply your password, the data is derived again and compared to what is stored in the server. In fact, this can be implemented in such a way that the website never actually sees your password. This is a form of zero-knowledge proof. Basically, you're proving to the website that you have some data that can be used to derive some number, you know the solution to a complicated problem that, at the time, the server does not posses.

I can foresee myself writing a post at some point going into all the different ways, most of them incorrect and dangerous, that password management and validation can be handled, but this is not the place.

*Sigh*... I guess I should mention "blockchains"...

I may... MAY create a post about blockchains later, they're kind of played out on the explanation side, and it's become such a buzzword... At one point, I literally sat through a meeting with some business people and some engineers brainstorming how a company could market a "blockchain product"...

Anyway. Hash functions are at the core of blockchains. Each consecutive block in a blockchain has the hash of the previous block, this field must be included when calculating the hash for use in the next block to include. This is where the term "blockchain" comes from. The "block" is the data to be committed to the dataset, and the "chaining" happens when you include a hash of a previous block in a new block. The blocks are then all stored in a giant hash table by their cryptographic hashes. This way, if you would like to trace the chain from the head to the original block, you can just fetch by hash.

How would one use a "hash function"?

Hash functions are typically pretty easy to use, actually. Here's some examples:

echo "asdf" | md5um


import Crypto.Hash.MD5 as MD5
md5 = MD5.hashFactory() # Get an MD5 context
md5.update(b"asdf\n") # add data to context
from binascii import hexlify # Get a way to print pretty
print(hexlify(md5.digest())) # return the hash and print it as hex

I would give a C example, but C can be dependent on which implementation you choose. Sometimes it uses uint8_t arrays for everything, sometimes there's typedef's for everything, sometimes you get a magic function that does it all, sometimes you have to construct and maintain and destroy a context, etc. It will usually look similar to the python above, construct a context, feed data through the hash function to update the context, get the output.

If you really want to play with it in C, you're better off finding the crypto library you wish to use and looking up an example of that specific library.

What makes a "Cryptographic Hash Function" a Cryptographic Hash Function?

I've talked a bit about use cases, how it's used, why it's used, why you should choose a cryptographic hash function over a non-cryptographic hash function, but what makes a "cryptographic hash function"?

Here, we'll have to go to the theoretical side and I'll describe the "theoretically perfect cryptographic hash function". We'll call this function: H. You know, for... Hash.

What does H look like? What is its signature?

H takes in some set of data and outputs a hash of that data that is always the same size, in pseudo-C, it would look like:

uint8_t[HASH_OUTPUT_SIZE] H(uint8_t* data, size_t data_size);

 How does it work?

This is actually not an important question. There are different approaches to hash functions that are used, but "how" is not actually important as long as it meets certain metrics.


All of these requirements are met in a theoretically perfect has function.

  1. Given the output, the input cannot be derived
  2. Output is only dependent on the input
  3. Outputs given two different but similar inputs are unpredictably different
  4. Output is collision resistant

Input Cannot be Derived

Hash functions are meant to be one-way. This should be fairly intuitive. H can take an infinite number of inputs, length ranging from 0 to infinity, on length alone, the number of possible inputs outnumbers the number of possible outputs by a factor of... well.. infinity. This means that, if you had a perfect mapping of inputs to outputs, it would be impossible to derive the input that caused a given output with 100% certainty because there would be an infinite number of inputs that could have caused that output.

So, by necessity, hash functions must be at least in part 1-way, but, the idea here is that for ANY output it should be exceedingly difficult to derive the possible inputs and choose the correct one.

Consider H is used to perturb the value of a password, you would hope that the original password could not be derived from H(password).

Output is Only Dependent on Input

This is exactly as it reads. Essentially, if you take the same input data and feed it through H, regardless of what system it runs on, what time of day, date, timestamp, astrological sign, color of sky, anything else, it should ALWAYS return the EXACT SAME THING.

Outputs Given Two Different Inputs are Unpredictably Different

This requirement becomes more important in cryptographic systems. If some attacker has some possible set of message formats, and they may observe the hash of some message, you don't want them to be able to derive which format is in use based on the hash (so long as there is some unknown portion of the message). The real reasoning behind why this is really important is hard to explain without understanding other topics that I have yet to cover, so I'll exclude those reasons for now, but just know that this is an important property.

Output is Collision Resistant

Collision Resistant sounds more like a property that I wish my last car had than a property of cryptographic functions, but it's a pretty foundational concept in hash functions. The idea is that for completely random inputs, there should be a flat distribution of outputs. For two randomly selected inputs, there should be two random outputs to pair with them. This really comes down to an efficiency issue, if the output of your hash function is 32 bits wide, why actually output less than 32 bits of entropy?

Common Misconceptions

Hash Functions are Encryption Functions

This is just not true. It may seem like semantics, but there is a really important distinction. Encryption functions are reversible; they are two-way functions. Remember earlier when I said that hash functions map an arbitrary size (potentially very large) of input to a constant size output? And remember how I said that this contributed to hash functions being irreversible? Hashing is not the same thing as encrypting. Every time I hear some cyber news source talk about how some company leaked a database of passwords but not to worry because "the passwords were encrypted" it makes me cringe. If they are wrong, and the password hashes were leaked, should this person be reporting the news? If they're correct and the passwords were protected by a reversible algorithm... Why shouldn't I worry?...
Again, it may seem like I'm nit-picking here, but, especially when you start talking about different security algorithms, hash functions are absolutely not encryption.

Hash Functions Add Entropy

Reference above where I said that hash functions always output the same data for a given input. Imagine if you have 2 bits of entropy. So, you feed a a hash function 4 different inputs: 0x00, 0x01, 0x01, 0x03. Every time you randomly select one of those inputs and feed it into the hash function, you will get one of four outputs. If you were publishing those hash results, while an observer may not be able to derive the mapping of output to input, they would be able to discern that there were four possible values. This is due to the amount of entropy in the input and the fact that the hash function doesn't change it. In fact, in actuality, hash functions may dump entropy. Sounds weird, but if you think about it, it makes sense. A data set cannot have more entropy than its length. So if you got data from a theoretically perfect random number generator that gave you 4KB of entropy in a 4KB message, then you calculated the hash of that message, you would loose almost all of that entropy. The hash would only contain at most as much entropy as its size allows.

That said, hash functions can be used to expand, disguise, or condense entropy. If you can generate a pure 64-bit random number immediately but must wait a long time to generate the next, you may feed that number into a hash function, take the lower 64 bits of the result and release it as a next number, re-hash the output to get a new output, release the lowest 64 bits, repeat. This does not add entropy, knowing what's happening, there's still a maximum search space of 64-bits, but if a pseudo-random stream that only has 64-bits of entropy is ok with you, then...
Disguising entropy goes along the same lines as expanding or stretching in the last example, basically, you may have some relatively small entropy pool and for functional reasons that is ok, but you want it to appear more random. To a human. That is NOT doing any numerical or statistical analysis.

Condensing entropy is actually an interesting concept and my favorite use of hash functions because it really is neat. Imagine you have some data source that has lower entropy than its length. imagine a book. Human text is really low entropy, especially when considering that even if you had a random stream of english, ASCII characters, you're using at most 7 bits out of every byte and not even the full range of those 7 bits. But, if the text to be hashed is longer than the output of the hash, you can start to get closer to the maximum entropy of the hash function. So, if the input text and output size were matched, the output would have less entropy than its length, but if the input were 500 times the length of the output, it may have the maximum entropy allowable in the hash.

HMAC is the same thing as Hash Functions

This is like saying that the ground beef is a burger. While this is technically... sort of correct... There's a lot more to it and the beef is only part of the burger.
First, HMAC is something called a Keyed Hash Function. Essentially, it's a hash function that derives extra perturbation from a secret key. This has some of the same uses as IVs do in encryption (if you know what those are for). HMAC works by running a hash function multiple times taking both a message and a key as input. I direct you to search HMAC on wikipedia for more information, it's a quick read.

That's it!

You should be pretty set at this point. Obviously I'm not covering all different hash function or how any hash function works in one blog post but... Hopefully someone finds this helpful.

No comments:

Post a Comment

News Updates: MTE

 For anyone who reads that is actually tracking this issue, this won't be news, but I haven't seen that much buzz about it online an...