Characters and Fonts
Before getting into Indian Language typing, let’s understand how does the computer understand what we type. We will understand this with the help of examples from English typing, since that is the language best handled by and understood on computers.
Computer understands individual characters. All letters, numbers, symbols are a character to the computer. ‘A’ is a character, ‘a’ is another character, “,” (comma) is another character and so on.
Then there are fonts, which tell the computer how to display a particular character. So, the same set of characters “Pothi.com” will appear different in different fonts.
As you can see, “P” is displayed by the font “Times New Roman” is different from the “P” displayed by the font “Arial”, which in turn is different from the “P” displayed by the font “Monotype Corsiva”. Similar is the case with other characters like “o”, “t” etc.
What is important here is that computer still knows that a “P” is a “P”, irrespective of the font it is displayed in. That’s why when you use the “Find” or “Search” feature while typing, it will find you the word/character you searched for, irrespective of how the font displays it. In fact, the underlying system that recognizes what character it is, does not care at all as to how the font displays it. In the following image, the first line in “Pothi.com” in a font called “MT Extra”. The second line is the name of the font!
As you can see, the display makes no sense for an English reading human. But the computer does not care.
What this means is, that you can design a font, that displays certain English characters as certain letter in one of the Indian languages. We’ll take Hindi as an example.
The same set of characters “Pothi.com” in a font called “Kruti Dev 010” become the following
Of course, it does not look anything like “Pothi.com” to English readers. Hindi readers can see Hindi alphabets (it’s not a meaningful combination). But to the computer, it is “Pothi.com”.
However, with this font to my aide, I can now concoct certain character combination, which would look like meaningful Hindi words to Hindi Readers. For example character set “dje #i” generated the following
Hindi Readers can identify meaningful words here. Even though for the computer it is just “dje #i”.
This is one way of typing Hindi. And most of the Hindi Books are typeset in this way, using one of the fonts, that display an English character as a Hindi letter.
When the ultimate aim is to print, this method works just fine. Once the book is printed, nobody cares what the original character stored in the computer was.
But this method has issues – big ones. For example
- No standardization: When you don’t have characters assigned for the letters of your language in the computer, each font developer is free to decide which character should be displayed like which letter. So, one font decides to display “A” as “अ” and the other font decides to display “d” as “अ”. What do you do then? In English, you can write something and then change the font at the click on a button. But in Hindi, if you change the font after writing, you will get totally different letters displaying on the screen, which are likely to be meaningless. Plus for each font you have to learn the typing all over again!Lack of standardization is also a problem in the Internet World. If you type the content in one font and send it to someone, the recipient has to have the same font on his computer, in order to see the meaningful text you have written. Any other font will not do. Compare this to English, where you may type in one font and the other person may not have that font. But he can still read it, because whatever English font he has understands the underlying characters and displays the correct letters for an English reader.
- Not searchable: In this system, the computer does not understand underlying characters of Hindi language. It is just the English language characters wearing a different look as far as the computer is concerned. So, there is no good way of searching through this content. In the Internet age, this is a major disadvantage. A lot of content available on Internet today is discovered only by search and if you want your content to be discovered, it is important that it is typed in a way so that it is searchable.
Unicode
It is to solve such problems that Unicode has come into picture. You can think of Unicode as something which enables computers to understand characters beyond English language. So, if your computer supports Unicode, it starts understanding not only the characters corresponding to “A”, “d”, “,” etc. but also the ones for “क्”, “अ” etc. And it’s not just the Indian languages, but it starts understanding characters corresponding to Chinese, Japanese, Arabic, Russian and most other major languages of the world!
So, with this you do not need to represent a random English character as a Hindi letter. The characters are available for Hindi and the font can now display those characters as the corresponding Hindi letters. Such fonts are called “Unicode compatible fonts”. To repeat, Unicode compatible Hindi fonts are the ones which do not represent an English characters as a Hindi letter, but which represent the Hindi characters as corresponding Hindi letters.
Typing and Input Method Editors (IMEs)
So far, so good. Computer, somehow, understands the characters for Hindi and other languages. But how do you type in those languages? Your keyboard still has only English letters on it. So, when you press the key labeled “A”, the computer knows that you want to type the character “a”. But how do you tell the computer that you want to type the character “अ”.
Multi-language keyboards are a design challenge, and at least for Indian languages, nothing great has come out. So, different ways have been devised to use the same English keyboard for inputting non-English characters. To understand how these work, let’s consider this. As far as the computer is concerned (“A” and “a”) are two different characters. But from the keyboard, the same button is used to type either of them. How? “A” gets typed if either CAPS LOCK is on or the Shift key is pressed. Otherwise, it is “a” that gets typed. So, the computer decides that the character typed is “A” or “a” depending not only on the key pressed, but also depending on the state of CAPS LOCK and Shift key.
Following a similar tactic, we can give the computer some other signal that when the key labeled “A” is pressed, you have to enter neither “a”, nor “A”, but “अ”. How to give that signal? For that there are multiple methods. Basically computer programs have been created, that come in between the keyboard and computer storing the characters and depending on certain signals tell the computer which character has been entered. These programs are typically called “Input Method Editor (IME)”.
These IMEs do two things
- They give you a way to specify the language you will be typing in
- They assign particular keys on the keyboard to particular characters, depending on the language selected
Two examples of IMEs for Indian Languages are
- Microsoft’s Indic Language IME
- Google’s Indic IME
I have the first one installed on my computer and I’d use that as an example to illustrate how IME works. Once I install and configure Microsoft’s Indic IME for Hindi, I get a language selection button my taskbar
If I select, English here, then things work as usual. If I select Hindi, then pressing Shift+D on my keyboard types “अ” instead of “D”. Pressing “j” types “र” and so on. I can keep switching the language and type a piece of text which uses both the languages (as I am doing now).
To use an Indic IME like this, you still need to learn the key combinations that type the right characters for you. This combination may vary between different IMEs. In fact, even the same IME may provide you with different options for mapping of keys to characters. Microsoft’s Indic IME provides at least two such combinations for Hindi. One is called “INSCRIPT” layout, which I use (the key-character combinations I described in the previous paragraph was according to this layout). The other is Phonetic and as the name suggests, its key-character bindings are more phonetic; e.g. (A will “अ”, R will be “र” and so on).
But the advantage over the earlier scheme of using non-Unicode compatible font is that once you have learned to use one IME, you can use any Unicode compatible font. You don’t need to learn the map for every font separately. Plus you text is standard compliant and searchable!
Google’s Transliteration Technology – Saviour for beginners!
If you are a beginner with Hindi typing, you would probably want to use an IME with Phonetic key-character combination. For example, where “A” typed “अ”, R types “र” etc. It is easier than using other combinations where the mapping may be very arbitrary.
But you still need to learn the exact key combinations for typing something. If you need to type Pothi (पोथी) in Hindi, do you type “Pothi” or “Pothee” or “Pothii”. With most IMEs only one of these will work.
Google’s IME is different here. It works more intelligently. Instead of assigning fix keys to the characters, it guesses the correct word from the combination you have entered. Basically from the various words you could possibly write, it guesses the word based on grammatical correctness and frequency of use in languages. If it guesses the wrong word, you have a way to change it to different word. In Google Transliteration all three “Pothi”, “Pothee” and “Pothii” produce the same (and correct) word पोथी.
So, you can essentially type words and so long as it is close phonetically, this IME will find the suitable word for you. This makes it a great tool for beginners. You can get started right away, write Hindi the way you do while chatting with your friends or in SMS and start getting output in a Unicode compatible font.
Google actually has an online service for this – http://www.google.co.in/transliterate . So you don’t even have to download and install anything.
All is not well here though. Once you start typing in Hindi regularly, you will start feeling the limitations of Google Transliteration. We will not get into the details here. But if that happens at some point, it may make sense to invest some time in learning another IME, which uses fixed combinations.
Complex Text Layout (CTL)
There is one major difference between Hindi (most Indian Languages) and English. In fact even between Hindi and Russian, Hindi and Chinese or Hindi and Japanese. The representation of a character changes depending on the context in Hindi and many other Indian languages. The representation of “द्”, for example is different in words तद्भव and विद्या. One has “द्” before “भ” and the other has “द्” before “य”. Compare this to English where how “d” is displayed does not depend on which letter comes before and after it. So, for displaying Hindi correctly the computer needs to understand all possible ways of displaying the characters under different contexts. The technical term for this is “Complex Text Layout”. Most computers with modern operating systems have this ability now and in all likelihood, you will not have to do anything special about it. But if you find a problem in display where a character is identified correctly, but is not displayed correctly, then you would know that it is an issue with computer not understanding “Complex Text Layout”.
In Windows XP and Vista, the complex text layout is enabled by default. In Windows 2000, you needed to enable it specifically. I have not tested it on Windows 7, but hopefully things should not go retrograde.
Finally
The description here is intended for a non-technical audience. Many concepts have been simplified and a purist technical person may be tempted to correct my usage of various terms (“You mean OS when you say computer!”). Let me just clarify that it is totally intentional. I just hope it has not become too technical for the non-technical audience 🙂
Questions are welcome as comments!