What are hash functions? What are they good for? How do you pick them? A fast nearly-impromptu introduction, from the weekly npmjs engineering meeting.
integer key, and n is the number of buckets, not a power of 2: f(k) = k(k+3) mod n const buckets = 19; function knuth(k) { return (k * (k + 3)) % buckets; }
• initialize to 0 • take the input in chunks of the size that you want • do something to the chunk & XOR it into the result • when out of chunks, return The "something" can either be very simple, like the Knuth division hash, or very complex.
you want to stick into memory for fast access by a key. You use a hash function to map the keys into buckets, one for every output number, and put the matching data for a key into the key's bucket. There might be more than one thing in the bucket, but that's fine because you can look inside the bucket to find your item in a small collection, instead of having to search the whole thing. This is the data structure underlying associative arrays, caches, and a million other things in software.
package.tgz 0030be42121988078dca0ec982d04f72 She gives that output number to Bob, which he compares to the md5 sum of his download to see if he has the data she meant to give him.
• output is uniformly distributed, not clustered • output value has a fixed size • usually: fast • sometimes: similar inputs produce nearby outputs • sometimes: similar inputs produce distant outputs (avalanche effect)
into running her npm package instead of the one Alice wrote. Because Alice used the weak MD5 algorithm to sign her data, Carol is able to craft a tarball that has the same MD5 digest but different data inside. She man-in-the-middle attacks Bob and serves him her cleverly-crafted package.tgz instead of Alice's. Bob is now pwned.
• finding collisions might be feasible • ditto reconstructing the original • use when speed matters • use when defense against malicious input doesn't matter
output • store the output only • repeat the transformation when checking the password to see if you get the same result An attacker who can run that transformation frequently & quickly is one who can brute-force attack your users' passwords.
used (git shasums!) • designed by NSA in 1995, 1st attack in 2005 • better attack in 2015 • collision at > brute force speed by Google in 2017 • do not use as a cryptographic hash
sign her package tarballs. Carol is an employee of Google or maybe of a nation-state's spy agency. Carol has a lot of computing power available, and really wants to pwn Bob. She cleverly crafts a tarball that has the same shasum as Alice's, and serves it to Bob. In ten years, Carol will be able to do this with a Raspberry Pi 7 the size of her thumbnail instead of a fleet of cloud computers.
variants • SHA-256 & SHA-512 are the most-used • designed by NSA in 2001 • no feasible attacks known • do use freely • replace SHA-1 with this, generally
variants • won a competition to choose next SHA standard • adopted as standard in 2015 • chosen for its differences from SHA-2 • no feasible attacks known • do use freely
denial-of-service attacks by attacking a language's underlying hash implementation. • send crafted data to an app, which stores it in a hash table • the keys are designed to cause collisions • all the data goes into one hash bucket • hash table performance collapses into merely a linked list • which is, as you know, Bob, a truly awful data structure
are all fine choices • SHA-2 is safer because of implementation availability • we are not size-sensitive about the output • we care more about picking an algo that will last • SHA-512 is a solid, safe choice
when git was young, when it would've been easy. Linus rejected the suggestion and didn't seem to understand the threat. He wired assumptions about SHA1 deeply into git. In the next few years, nasty people will teach him the threat model, with ungentle manipulations of his and many other peoples' source trees. --Gilmore
do use SHA-2 and newer hashes • use bcrypt to store passwords • don't invent unless you're an expert • in which case you should be giving this talk not me