The process of converting lots of text into a really small text. Usually used to verify data integrity.
Note: Not the edibles.
A gentle introduction
Let’s say we need to compare two pages manually. This would be our algorithm:
- Take all words from the first page, take all words from the second page
- See if all words are the same in both pages
Joking. Who has time to read everything? More realistically, this is what we would do:
- Take first 2 words on the page (“good morning”), last 2 words on the page (“okay bye”)
- See if those 4 words are the same in both pages (“good morning”, “okay bye”)
Magic! Instead of checking all words on the page, we looked at 4 words and decided if two pages are the same. We have reduced the whole content of the page to just 4 words, kind of like an identifier that represents the whole page. These 4 words are called the hash. Hash: A short text of a particular length that represents larger text.
Obviously, there is a high chance of duplicates.
Any page that starts with “good morning” and ends with “okay bye” will give us this hash.
When different content results in the same hash, it’s called a collision.
Improving the algorithm
Can we improve our algorithm to reduce chances of collision?
- Instead of just the first and last words, take all the words in the page.
- Replace the alphabets with numbers - A with 1, B with 2 and so on to get a large number.
- Do random mathy stuff. Add 19237, divide by 842, multiply by 91, divide by 1928 etc.
- We might get the number “8364181238938917”. I’d say that’s pretty unique. Better than “good morning okay bye”!
You get the idea - we generated the hash considering only first 2 and last 2 words, but the computer can generate a hash where it considers all the letters in the content!
This means that even if 1 character is changed, the hash will vary by a large margin.
That’s it, you now know what hashing means.
A quick review: what have we learnt from our 2 algorithms?
- Hashing is one way.
When we are given only the hash (“good-morning-okay-bye” or “8364181238938917”), there’s no way we can find the complete original content of the page. - Hash value is repeatable.
No matter how many times we regenerate the hash: for a particular input, the hash will always be the same. - (very) hard to find any input that can give us a particular hash.
If I give the hash8364181238938917
, how do you find an input that generates this exact hash?
The only way to find an input that gives that exact hash is to try different values repeatedly. And there could be like a billion values, so…yes, pretty hard. As long as the algorithm is good.
Examples of popular cryptographic hash algorithms:
SHA, BCrypt, MD5.
With me so far? Congrats! You crossed some boring theory.
But where is this actually used? Let’s get practical!
Used to Verify Data Integrity - Checksums
(Checksums are just another name for hashes. One cool word free.)
We upload product EXEs. But we want a way for customers to verify that what they downloaded is exactly the same as what we’ve uploaded.
The best way to do this would be to call them up to verify:
Support: “Hi, support speaking. Please open your license page”
New cx: “It doesn’t open, error says ‘sorry for the inconvenience please contact support’”
Support: “Cool, known issue on our build, can confirm you’re running our actual exe” hangs up
But a simpler way that gives support techs time for at least one Zoho meal a day would be:
- Generate a hash of the exe we’ve uploaded on our website (and call it checksum instead of hash, ofc)
- Ask customers to generate a hash of the exe they’ve downloaded.
- Ask them to check if that’s the same as the hash we’ve displayed on our website.
- If both hashes match, it’s the same.
You can see this in our DesktopCentral upgrade page where we list the SHA hash for each file (SHA is the hashing algorithm used).
The steps to generate hashes and the reasons have also been listed in our website!
Go on, open those two links in different tabs. Prove to yourself that you know how it works (and that at least some of what I say is right)
🏆Achievement unlocked:
Now you know why customers get so pissed when we re-tag a build and the checksum on the website changes for the same build without incrementing the version number.
When they check the website, hash for their build will be different from what they have installed - for all they know, they could have downloaded a malicious update that we did not produce!
Used to easily compare data - like User Passwords.
Let’s say your password is “your_crush_from_2nd_grade”.
Instead of storing user passwords directly, we hash it and store the hash of the password in the DB.
During login, we hash the entered password and compare it with the value in the DB. If it matches, you’re in.
The advantage here is that even if someone gets access to the DB, your password won’t be exposed. Your secret crush is safe.
But wait - oh no. Remember hash collisions where multiple inputs can give us the same hash value?
Yup, this means that if your password was “Stonebraker@123” and the hash for this is “8008135”, login will succeed if you enter any password that produces the exact hash “8008135” since we only compare hashes and not the actual passwords.
Don’t worry though - we use a good hashing algorithm, where collisions are extremely rare.
This is why many customers ask us which hashing algorithm we use.
MD5 is considered a weak hashing algorithm these days; we use BCrypt, which is stronger.
🏆 Achievement unlocked:
You now know the answer to this seemingly trick question: “If you don’t store my password on the server, how do you check if the password I enter during login is correct?”
Used to prove you have put work into it - Bitcoin
I said it’s “hard to find inputs that can give us a particular hash”. But really, how hard can it be?
If you don’t know Bitcoin - just think of it as online money.
Anyway, if you’ve been in Zoho for at least 24 hours and you know who Sridhar is, you know new money notes are being printed left and right.
Who prints these new notes, and who owns them? Well, the government prints and owns the new notes of course. US prints US Dollars and uses it, India prints Rupees and uses it.
However, no country “governs” Bitcoin. So who gets the $$ when a new Bitcoin is printed?
To decide that, they have a mechanism called “Proof of work”. It’s simple.
They give you a hash; you have to find an input value that gives a hash less than that. This process is called “mining”.
ie., if they give you 009999 - come up with any input value that gives you a hash less than that, and you get free Bitcoins!
Eg: Inputs that give 009998, 001111, 005612 etc
If it feels funny, let’s get real: if you had figured out just one single hash last year, you would be richer now by about 3 crores!
That’s how hard it is to reverse a hash.
People buy farms of thousands of computers, trying input values one by one to be the lucky winner - and they still fail. It’s a lot of work (that’s why this is called proof of work - if you have a valid input, it proves you’ve put work into it).
Are you feeling lucky? Try your luck! (but not on company servers, you have been warned).
Example hashes
Note that even with a single character change, results differ completely.
|
|
You can also play around with hashing online!
That’s it! You should now know enough about hashing to identify it around you, and also read more about it online and understand that geek-speak.
Thank you and good night. :)