On Encoding



🚧This is still under construction. Please share any feedback you might have! (and yes, I used marquee)🚧

TLDR: Encoding is just changes made to avoid confusion.
But if there’s one thing you should take away from this, it’s that encoding has nothing to do with hiding data.

Encoding in our everyday lives

I want to share a list of movie names with a friend who doesn’t know tamil. Usually when sharing a list, we share it as comma-separated values. The comma here has some special duties - to separate one name from the next.
But what happens when the movie itself happens to have a normal comma in the name, like the movie “Ghilli - No guts, No Glory”?

> Me: Some movies - Shawshank redemption, Ghilli - No guts, No Glory, The Godfather
> You: There's a movie called "Ghilli - No guts"?
> Me: No, "Ghilli - No guts, No Glory" is a single movie 
> You: So "Shawshank redemption", "Ghilli - No guts, No Glory", "The Godfather"?

One possible way to avoid this confusion is to enclose all movie names in double quotes. This change, made to avoid confusion is called encoding.
When we want to get the exact movie names, we remove the double quotes we added during encoding. This change, made to reverse encoded data back to the original data is called decoding.

But there could be languages where they don’t even use commas - so for them, these “extra-special characters” can differ.

URLs

The URL https://search.zoho.in/searchhome opens the Zoho Search page.
When we search for the word “completed” and select “Zoho Projects”, we get a page that says “no results found”. Nothing new, but the interesting part is that the URL becomes https://search.zoho.in/searchhome?q=completed&s=projects.
What’s the ?q=completed&s=projects part? Turns out that’s the way the data we type in is sent to Zoho Search via the URL. The “?” indicates the start of extra data sent, “=” maps each value to the name, “&” separates one set of parameters from the next.
(I’m assuming q is the short form for “query” and s is for “selected-app”. Why waste time say lot word when few word do trick?)
These extra parameters sent in the URL in the format ?param1=value1&param2=value2&param100=value100 are called Query Parameters.
So Zoho Search then processes those inputs (q = “completed”, s = “projects”) and displays the output.

We now know that the characters ? & = are “extra-special characters” for URLs (like commas for lists). We also know that whatever we search for is being sent directly in the URL.
Let’s get evil! We can search for “completed?” instead of “completed”, which would result in the URL containing two ?s - Search would get confused since it won’t know where the query params start from.
But when we search for it, the query-params don’t become ?q=completed?&s=projects like we expected - it becomes ?q=completed%3F&s=projects instead.

What does the %3F there mean?
Turns out Zoho Search smartly encodes the extra-special characters in our query before sending it; the %3F is the encoded value of ?.
This process of encoding extra-special characters in URLs to avoid confusion is called URL Encoding (and this happens in all browsers and websites, not a Search thing!).
When Search servers get this encoded data (completed%3F), they decode it back to the original characters (completed?), forward it to their expert librarians who find relevant results in Zoho Projects and return it to you with a footnote that says “don’t be evil”.

Wondering what & or = look like when encoded? You can find a list here.

Keep your eye out for URLs and you’ll notice it everywhere!

Practicals time! 🧪

Tamil URL, Hindi URL, Chinese URL
Open any of these URLs right now in a new tab.
Copy the URL, paste it in any text editor or chat app. What do you see? %🤯%

Learn HTML in 2 minutes with Cliq

A quick detour: There’s a programming language called HTML that considers characters like | <> " ' to be special characters.
HTML is what makes up websites, even this one. Cool cat pictures? Typing in your password? Watching a video? All this is built on HTML (with a little help from languages like JS and CSS, but that is out of topic).

Let’s focus on this tiny part when we open Cliq:
When we open cliq.zoho.com on our browser, data is sent from Cliq server to the browser. The browser then displays the data that Cliq sends.
Data is sent from Cliq in a HTML format:

  1. HTML code is usually surrounded by <> brackets. <> indicates start code and </> indicates end code.
  2. The browser processes HTML, hides that HTML code and shows us the finished product.
  3. What do I mean by “finished product?”
    The HTML code to make text look bold is <b>. Cliq server sends <b>show this in bold!</b> to the browser; the browser processes this HTML, understands that “show this in bold!” needs to be made thicker and displays show this in bold!. Similarly, the code for italics is <i>xyz</i>, underline is <u>xyz</u> and so on.
    Beyond basic displaying, it also decides what actually happens when clicking on a button, moving your pointer over something, etc.

(1 minute left)

1
2
3
<b>Browser, show this in bold!</b>
<i>mamma mia, italics</i>
<script>alert('hello there!')</script>
  1. Open notepad or any text editor
  2. Copy paste it, but save it as “anything.html”
  3. Open it in your browser

So HTML is basically instructions for the browser. But how do all browsers know what to do with those instructions? Do they all know that <b> is bold and that bold means that the text should look thicker?
Here’s the neat part - they don’t always know. Like people, they might have different interpretations for the same instructions.
This is the reason why some things look different on Firefox vs Chrome, or some things work on Internet Explorer but not Chrome.

Back to the topic - when we open cliq.zoho.com, Cliq server sends encoded usernames to our browser in HTML format, which is then decoded before displaying it on the screen.
However, there can be instances where encoding is done but the decoding step hasn’t been done for some reason - these instances give us a peek into the otherwise invisible world of encoding.

Presenting Dinesh: The Cliq name set here is Dinesh "Desmond" Miles. Cliq encodes the data (here, the name) the double quotes are HTML-encoded into &quot;. The decoding step hasn’t happened here, which is why we’re seeing the encoded value Dinesh &quot;Desmond&quot; Miles directly.

The encoding for " is &quot;, < is &lt; (less than), > is &gt; and so on.

Achievement unlocked: Congratulations, you have now learnt a skill you can use for the rest of your life across the internet. When you see the “%” you can say “dude the encoding is muddled” instead of wondering “wHaT aRe theSe wEiRd %%%% bruH am i bEinG HACKED????!”.

Encoding and Security

Broken encoding is a minor inconvenience. Or is it?

So far, it’s been all fun and games - a funny name here, a weird URL there. But can it do actual damage?
Turns out, yes. We know that HTML sends instructions to the browser and tells it how to behave. So if you control the HTML on the page, you control what happens on the page.

The HTML code to follow someone on Connect could look like this:
Follow: <on-clicking-enter-follow-this-person>[person name]</on-pressing-enter-follow-this-person>
When the browser reads <on-clicking-enter-follow-this-person>, it understands the instruction: when the person’s name is clicked, it should make you follow that person. It then hides the HTML code.
So what we see on the screen is only Follow: [person name]

Warning: Are your dev friends telling you this isn’t actually how it works? If not, ignore this. If yes - you’re lucky, ask them to explain how it actually works.

If you want to follow your friend “Kaipulla” on Connect, the code would be:
Follow: <on-clicking-enter-follow-this-person>[Kaipulla]</on-pressing-enter-follow-this-person>
What you see is Follow: [Kaipulla].

Here’s the important bit. If you don’t understand this, read it again. If you still don’t understand it, throw eggs tomatoes (no non-veg) at me and I’ll try to explain better.
Now let’s say [Kaipulla] renames his name to <b>[Kaipulla]</b>.
Then the code becomes:
Follow: <on-clicking-enter-follow-this-person><b>[Kaipulla]</b></on-pressing-enter-follow-this-person>
The browser is now confused! It does not know that the <b> is part of the name and not the HTML code, and so it’s treated as HTML code.
What you see is Follow: [Kaipulla]

Kaipulla is a hacker. He now renames his name to <automatically-follow-this-person>[Kaipulla]</automatically-follow-this-person>.
The code becomes: Follow: <on-clicking-enter-follow-this-person><automatically-follow-this-person>[Kaipulla]</automatically-follow-this-person></on-pressing-enter-follow-this-person>

Woah, slow down. What happens now?
Similar to the previous <b> example, the browser treats <automatically-follow-this-person> as HTML code and takes it as an instruction.
Although all you see is Follow: [Kaipulla], just loading the page will cause you to follow that person because the user has “injected” HTML code into what’s supposed to be plain English.

Luckily, Cliq encodes user inputs such as usernames.
So when the name is <b>[Kaipulla]</b>, Cliq does not send <b>[Kaipulla]</b> directly to the browser. Instead, it sends &lt;b&gt;[Kaipulla]&lt;/b&gt;. This encoding is called HTML encoding.
When the browser sees the code Follow: <on-clicking-enter-follow-this-person>&lt;b&gt;[Kaipulla]&lt;/b&gt;</on-pressing-enter-follow-this-person>, it knows only the content inside <> is code. After removing the HTML code, it decodes &lt;b&gt;[Kaipulla]&lt;/b&gt; into <b>[Kaipulla]</b> and displays it like normal text instead of treating it as instructions.
So you would actually see “Follow: [Kaipulla]” on the screen.

Encoding, Encoding everywhere

Encoding is needed wherever any info passes from one medium or kind of technology to another.
The meaning of ? in Tamil (human language) is different from what it means in URLs (browser language). So when we send info that the user types in Tamil to URLs (a technology that uses a fundamentally different language), it needs to be encoded to prevent confusion while reading URLs.
<b> has no meaning in English but holds a different meaning in the HTML language. So when something typed by a user in English is sent to a technology that uses HTML, it needs to be encoded to prevent confusion while reading HTML.

So understandably, there are a lot of types of encoding. URL encoding, HTML encoding, SQL encoding, JS encoding etc.
Attacks that abuse broken encoding are generally called “injection”. HTML injection, SQL injection, JS injection.

Want to encode some data? A common format used across technologies is base64 encoding - it manages to encode special characters into normal alphabets.
Try it here!