A note on timing: I know that it's been a little while since I posted last, and I haven't given up on blogging, I'm just balancing life. My last post on MTE was a big one that I worked on pretty constantly for a while and I got a little burned out, to be honest. I've been focusing on work and personal projects that I may write up at some point. Just saying, I'm here, talking into the aether.
To the post!
I'm starting with hash functions because they have the simplest explanation in theoretical terms, even if their implementations are among the most complex algorithms. I'm not going to go into how specific hash functions are implemented here.
If you've taken one of the most basic computer science courses, usually called something like "Data Structures", you are likely familiar with a structure called a "hash table". The key piece of a good hash table is the hash function. If you remember that class really well (or you're currently in that class) you'll remember that the general requirements are that the function must be fast and is usually desirable for it to have a flat distribution of outputs based on inputs. (Spread the keys across as many buckets as possible.) With these you're on the right course, but Cryptographic Hash Functions (the only ones we care about in security) have more complete requirements.
Pause for nomenclature:
hash - /haSH/
- (hash function) (N.) Function that performs a hash operation
- (V) to perform a hash operation via a hash function
- (N.) the output of a hash function
- (N., illicit) ask your stoner friends
Now, a little background on why we want hash functions in the first place.
Purpose
The purpose of a hash function is to take an arbitrary data input, having an arbitrary size, and process it and output data of a predetermined size.
I know this sounds like saying "The purpose of a hammer is to hit things," which is technically true, but it obviously misses some important context. So...
Where would a hash function be used?
Generating Random Data
Generating random data from a hash function is the less common use case, but it's worth mentioning because it does happen, and it's important to know that this is in your toolbox.
There are plenty of reasons that you may need random data, but since hash functions guarantee that the output will be the same for any consistent input, they're useful for key perturbation (e.g. mixing stream specific attributes with a master key to derive stream key), generating seed data for RNG things (seeding rand with current time is great, but what if you seed it with super random things?), or any other case where you need a pseudo-random source of data that is either shared or re-calculable.
Obviously the first example is more clear that you would want a cryptographic hash function since it invokes a cryptographic use case, but the second may be just as important, what if the RNG is driving an online gambling platform? As close to pure random as possible can be extremely important sometimes...
Digesting Data for later Validation
Just the wording of this use implies something like a checksum, and this is true sometimes, hash functions can be used, and often are used, as checksums. The important thing here is that, while checksums traditionally protect the recipient from transmission errors, using a cryptographic hash will also protect the channel integrity from attackers who wish to modify the message. If an attacker was to modify the message but leave the hash alone, the recipient would know the message was invalid. Further, by using a cryptographic hash function (as opposed to a simpler checksum technique), the attacker has a harder (potentially impossible) time making the new data fit the original checksum.
Proof/Evidence of Existence/Knowledge at a Certain Point in Time
Imagine you could type some message, the tweet the digest and some metadata about your message (but not the message itself). Then later, you may choose to publish that message along with a reference to the tweet. As long as you haven't modified the message, it can be said that it is exceedingly likely that you had already typed that message out before you published the tweet.
Passwords work in a similar way, except the whole point is to obscure the "message" (password). Properly implemented password systems will use cryptographic hash functions in order to derive some data, then that data is stored, instead of storing the password. Then, when your identity needs to be verified, you supply your password, the data is derived again and compared to what is stored in the server. In fact, this can be implemented in such a way that the website never actually sees your password. This is a form of zero-knowledge proof. Basically, you're proving to the website that you have some data that can be used to derive some number, you know the solution to a complicated problem that, at the time, the server does not posses.
I can foresee myself writing a post at some point going into all the different ways, most of them incorrect and dangerous, that password management and validation can be handled, but this is not the place.
*Sigh*... I guess I should mention "blockchains"...
I may... MAY create a post about blockchains later, they're kind of played out on the explanation side, and it's become such a buzzword... At one point, I literally sat through a meeting with some business people and some engineers brainstorming how a company could market a "blockchain product"...
Anyway. Hash functions are at the core of blockchains. Each consecutive block in a blockchain has the hash of the previous block, this field must be included when calculating the hash for use in the next block to include. This is where the term "blockchain" comes from. The "block" is the data to be committed to the dataset, and the "chaining" happens when you include a hash of a previous block in a new block. The blocks are then all stored in a giant hash table by their cryptographic hashes. This way, if you would like to trace the chain from the head to the original block, you can just fetch by hash.
How would one use a "hash function"?
Hash functions are typically pretty easy to use, actually. Here's some examples:
Bash:
echo "asdf" | md5um
Python:
import Crypto.Hash.MD5 as MD5md5 = MD5.hashFactory() # Get an MD5 contextmd5.update(b"asdf\n") # add data to contextfrom binascii import hexlify # Get a way to print prettyprint(hexlify(md5.digest())) # return the hash and print it as hex
I would give a C example, but C can be dependent on which implementation you choose. Sometimes it uses uint8_t arrays for everything, sometimes there's typedef's for everything, sometimes you get a magic function that does it all, sometimes you have to construct and maintain and destroy a context, etc. It will usually look similar to the python above, construct a context, feed data through the hash function to update the context, get the output.
If you really want to play with it in C, you're better off finding the crypto library you wish to use and looking up an example of that specific library.
What makes a "Cryptographic Hash Function" a Cryptographic Hash Function?
I've talked a bit about use cases, how it's used, why it's used, why you should choose a cryptographic hash function over a non-cryptographic hash function, but what makes a "cryptographic hash function"?
Here, we'll have to go to the theoretical side and I'll describe the "theoretically perfect cryptographic hash function". We'll call this function: H. You know, for... Hash.
What does H look like? What is its signature?
H takes in some set of data and outputs a hash of that data that is always the same size, in pseudo-C, it would look like:
uint8_t[HASH_OUTPUT_SIZE] H(uint8_t* data, size_t data_size);
How does it work?
This is actually not an important question. There are different approaches to hash functions that are used, but "how" is not actually important as long as it meets certain metrics.
Requirements:
All of these requirements are met in a theoretically perfect has function.
- Given the output, the input cannot be derived
- Output is only dependent on the input
- Outputs given two different but similar inputs are unpredictably different
- Output is collision resistant
Input Cannot be Derived
Hash functions are meant to be one-way. This should be fairly intuitive. H can take an infinite number of inputs, length ranging from 0 to infinity, on length alone, the number of possible inputs outnumbers the number of possible outputs by a factor of... well.. infinity. This means that, if you had a perfect mapping of inputs to outputs, it would be impossible to derive the input that caused a given output with 100% certainty because there would be an infinite number of inputs that could have caused that output.
So, by necessity, hash functions must be at least in part 1-way, but, the idea here is that for ANY output it should be exceedingly difficult to derive the possible inputs and choose the correct one.
Consider H is used to perturb the value of a password, you would hope that the original password could not be derived from H(password).
Output is Only Dependent on Input
This is exactly as it reads. Essentially, if you take the same input data and feed it through H, regardless of what system it runs on, what time of day, date, timestamp, astrological sign, color of sky, anything else, it should ALWAYS return the EXACT SAME THING.
Outputs Given Two Different Inputs are Unpredictably Different
This requirement becomes more important in cryptographic systems. If some attacker has some possible set of message formats, and they may observe the hash of some message, you don't want them to be able to derive which format is in use based on the hash (so long as there is some unknown portion of the message). The real reasoning behind why this is really important is hard to explain without understanding other topics that I have yet to cover, so I'll exclude those reasons for now, but just know that this is an important property.
Output is Collision Resistant
Collision Resistant sounds more like a property that I wish my last car had than a property of cryptographic functions, but it's a pretty foundational concept in hash functions. The idea is that for completely random inputs, there should be a flat distribution of outputs. For two randomly selected inputs, there should be two random outputs to pair with them. This really comes down to an efficiency issue, if the output of your hash function is 32 bits wide, why actually output less than 32 bits of entropy?
No comments:
Post a Comment