The written word is built not only on proper spelling, but on proper punctuation too. It is no surprise, then, that when the world’s languages are expressed in written form, they feature letters (or characters), punctuation marks and special diacritics. We use these every day when we write physically. What you interpret as symbols, though, your computer interprets as binary code, systems of numbers, or encodings. Although these are mostly 1’s and 0’s, they form an intricate interface between our language and that of the machine. In the past, these encodings that helped computers to process symbols and characters were rather compact, making it difficult to represent all the world’s languages and their constituent symbols in text files. At some point, it was difficult to represent even English in a single encoding! Then came Unicode. But…
What is Unicode text?
Unicode is a Universal Character Encoding Standard. It specifies how characters are represented in text files, online webpages and many other types of documents. Above, we highlighted the more compact encoding standards of the past, a popular one being ASCII. Unlike ASCII, which was built around the English Language alone, Unicode was designed with the intent of representing symbols and characters from languages from all around the globe, with a support of about 1,000,000,000 different characters. True to Unicode’s advantage, ASCII supports only 128 characters. Essentially, Unicode text is capable of unambiguously representing any character, punctuation mark, diacritic from any of the world’s known and written languages.
To toy around with the idea, imagine what really happens when you change the font in a document. Those fancy, artsy looking characters that replace the bland, boring-looking default text are also a part of the specifications of your unicode text. Actually, among the hundreds of characters that constitute your Unicode text specifications exist variants of your current alphabet. These variants are linked to their equivalent normal alphabet so your computer knows exactly what to replace (and with what characters) when you select “Times New Roman” for example.
How it Works
As we mentioned earlier, your computer speaks the language of numbers, specifically binary code. In the case of character encodings, your computer assigns a number to each of the characters included in the Character Encoding Standard. Unicode provides each character with a unique number (think of it as the character’s ID) such that regardless of the platform or device on which the language is used, the character remains readily defined. Each character can be up to 4 bytes in size. To understand the implications of this size allowance, think back to how ASCII only supports 128 characters. It comes as no surprise given that ASCII uses only one byte per character. In essence, it has less “ID’s” to dedicate to individual characters than Unicode.
Popular Unicode Encodings
The most popular kinds of Unicode encodings are the UTF-8 and UTF-16 standards (although there are many other types of encodings). Many software programs and web pages now resort to the UTF-8 Standard as their standard encoding. You will find that although it supports up to 4 bytes per character, UTF-8 gives a lower memory allowance to the more commonly used characters. This is in the name of increasing efficiency. Therefore, characters of the English language are represented in one byte. Arabic, Latin and Hebrew characters are represented in 2 bytes and Asian characters are represented in three bytes. The full allowance of four bytes is usually reserved for any additional characters outside this scope.
The main advantage of Unicode is the uniformity it has brought to data interpretation the world over. Previously, text files and web pages were vulnerable to conflicting encodings that assigned the same number to different characters or different numbers to the same one! Computers were burdened with having to support multiple encodings to understand documents and web pages all while running the fervent risk of data corruption during transfers. With Unicode, the computing world has gained a sense of versatility in the cross platform compatibility of this Universal Character Encoding Standard. It facilitates an important part of the world’s main operating systems, browsers and search engines. In fact, the Internet and World-Wide Web owes the universality of character definition to Unicode.