19/08/05

Lost in translation: getting India’s languages online

Copyright: WHO/TDR/Crump

By: Frederick Noronha

We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:

You have to credit our authors.
You have to credit SciDev.Net — where possible include our logo with a link back to the original article.
You can simply run the first few lines of the article and then add: “Read the full article on SciDev.Net” containing a link back to the original article.
If you want to also take images published in this story you will need to confirm with the original source if you're licensed to use them.
The easiest way to get the article on your site is to embed the code below.

For more information view our media page and republishing guidelines.

The full article is available here as HTML.

Press Ctrl-C to copy

<div class="article-wrap">
<div id="article-introduction">
<h1>Lost in translation: getting India’s languages online</h1>
<h4>By: Frederick Noronha</h4>
</div>

<div id="article-body">
Imagine getting an email or accessing a website, and finding you cannot read it without downloading extra fonts. Speed — the heart and soul of internet access — suddenly becomes impossible. 
This is a daily reality for millions in India, a country that has 18 official languages, 1,652 mother tongues (33 of them spoken by over 100,000 people), and dozens of different scripts (see <a href="/Features/index.cfm?fuseaction=readfeatures&itemid=443&language=1#table" target=_self rel="noopener noreferrer">table</A>).  
Each Indian-language script is different from the other, and can be written in different ways. Some, like Urdu and Sindhi, are written right to left, others from left to right. Others, like Hindi, have extra flourishes that act as vowels or modify pronunciation. 
Finding solutions for making these native tongues available to computer users is vital to bring communication and information technologies to India’s entire population. 
Other non-English speaking countries face similar problems. Pakistan, for instance, has been seeking software in its national language Urdu, while Bangladesh wants solutions in Bengali. 
Multinational corporations, arguably, are largely responsible for the problem. They rarely bother to translate software into local languages because of the lack of commercial gain: few of the people who speak them can afford expensive software. 
<TABLE borderColor=black cellSpacing=0 cellPadding=2 bgColor=white border=0>
<TR> 
<TD class=grey><IMG alt="" src="/scidev_images/flowermarket_300x225.jpg" border=0> Nine out of ten Indians do not speak fluent English, making computing in their native tongue essential Photo Credit: <A href="http://www.fiveyards.com">www.fiveyards.com</A></TD></TR></TABLE> 
Targeted localisation 
South Asia isn’t alone. Dwayne Bailey at the non-profit organisation Translate.org.za says South Africa has similar problems. 
Bailey and his team are translating ‘open source’ software, which is distributed and modified for free, into all 11 South African languages: Afrikaans, English, Xhosa, Ndebele, Northern Sotho, Siswati, Southern Sotho, Tsonga, Tswana, Venda and Zulu. 
All these languages are written using the Latin alphabet, so the task is not as complex as it is in India. 
The golden rule, says Bailey, is that applications chosen for translation should be appropriate for a general audience. “Our logic is that [it should benefit] the people whom language would most affect,” he says. “Someone who can program could have probably mastered English already. Localisation must be aimed at the end-user.” 
It also needs to take into account the needs of the people who use the software. Ravishankar Shrivastava, for instance, has been writing fiction in Hindi for two decades, but putting his written work into an electronic format that he could submit to publishers has proven difficult. 
Shrivastava recalls his excitement when, in the late 1980s, he came across a personal computer that allowed him to type in Hindi. “I thought it was a gift to Hindi speakers”. 
But as computing technology progressed, new software that enabled users to write in their Indian mother tongues became more expensive – and also came with limitations. 
Shrivastava tried several computer packages for writing in Hindi. Some were too time-consuming, demanding that you press several keys to type a single character. Other programs made it impossible to exchange text unless the person on the receiving end had the same software. 
In India’s dotcom surge of the late 1990s, various Hindi newspapers went online.  But they all used different, incompatible fonts. 
<TABLE borderColor=black cellSpacing=0 cellPadding=2 bgColor=white border=0>
<TR> 
<TD class=grey><IMG alt="" src="/scidev_images/snapshot11_cropped.jpg" border=0> Computer icons with Hindi application names on a PC desktop Photo Credit: IndLinux</TD></TR></TABLE> 
The open source connection 
The open source avenue could be one way out of the problem. Cutting across traditional South Asian rivalries and distrust, groups of open source enthusiasts — some in India — are talking to each other on how they can collaborate to build solutions. 
For Shrivastava, this involves the ‘Indian Linux’ or IndLinux project. Linux is an open source operating system that has been around since 1991. The people working on IndLinux want to tailor it to local Indian languages. 
They have teams working on Bengali, Gujarati, Gurmukhi, Hindi, Kannada, Malayalam, Marathi, Oriya, Tamil and Telugu. 
“We want to make technology accessible to the majority of India that does not speak English,” says G. Karunakar, a volunteer at IndLinux. 
So far, the project has designed operating systems in Bengali and Tamil, and the Hindi version is nearly complete. 
But there have been complications. For instance, the alphabets are laid out differently on different keyboards. 
And even languages spoken by hundreds of millions, such as Hindi, were devoid of IT terms. When these terms were introduced, other difficulties arose. Take, for instance, the commonly used computer term ‘file’. This alone was called faeel, suchika, sanchika or reti by different translators. 
Another obstacle has been continuity among staff participating in the translation task. Volunteers would join the venture with great enthusiasm, translate a dozen strings, and make promises to do more, only to move on after realising that “translation is a tedious, thankless, glamourless, revenueless, highly boring job” as Shrivastava puts it. 
A rural revolution 
Shrivastava believes that Indian-language computing will revolutionise a rural India where English is practically non-existent. States like Kerala and Madhya Pradesh are already introducing ‘e-governance’ projects based in local languages. 
As part of Kerala’s campaign to familiarise its citizens with electronic communication, ‘e-centres’ are being set up throughout the state. These will be connected to the Internet and linked through to a central operating centre. The goal is for at least one person in each family in Kerala to become computer-literate. 
Meanwhile, Madhya Pradesh is introducing an online initiative called ‘Gramsampark’, meaning ‘village contact’. The website offers information on how the state is governed in local languages and is available to all 51,000 villages in Madhya Pradesh. 
Getting the message out 
Karunakar believes IndLiunx’s major challenge will be to make sure the work is widely used. This means finding the actual users, reaching them, and finding those who stay away from computers because of language barriers. 
“We need to properly package the whole thing in a simple, installable format and easy-to-use interface,” he says. 
Some Indian languages, such as Urdu, Kashmiri, Konkani, Manipuri, and Sindhi have yet to be tackled by the open source taskforce. 
Looming challenges 
There is a lot of work ahead for the IndLinux team. For now, translations of the basic interface into several local languages have been completed. Next, the volunteers need to work on user manuals, ‘help’ files and more. 
Many hands make light work, and IndLinux’s current band of volunteers believe there are too few of them involved in this ambitious task. Beyond that, they need support — both financial and moral. 
Surprisingly, rather than one of the more important, government-supported languages such as Hindi, the south Indian language Tamil was the first to be localised through open source. It was only then followed by Hindi. 
Those working in Indian-language computing suggest the speedy translation of software into Tamil might also be explained, in part, by the work put in by expatriated Tamil-speaking communities settled in places like Malaysia, North America, Singapore, or Sri Lanka. 
Om Vikas, head of the Indian government’s Human Centred Computing Division at its Ministry of Communication and Information Technology in New Delhi, says that IndLinux has also spawned Indix2 — a compact disc of Indian-language software solutions — that supports 11 languages.  
Vikas says, the effort’s weak point is that there is no single national level that is pushing it forward and creating standards. 
He says creating such a consortium should be a top priority, and that part of its role should be to deploy the translated products immediately in schools under central government administration. 
Vikas says implementation efforts are slow, but “satisfactory” for more important languages like Hindi, Tamil and Bengali. 
Overall, it’s a tall task. But if India is to maintain its position as one of the IT leaders in the world, it has no choice but to win this battle. 
<A name=table></A> 
<TABLE cellSpacing=5 cellPadding=2 width="75%" border=0>
<TR> 
<TD class=tabletext> 
Language</TD> 
<TD class=tabletext> 
Numbers of people who speak it</TD></TR> 
<TR> 
<TD class=tabletext> 
Hindi</TD> 
<TD class=tabletext> 
340 million</TD></TR> 
<TR> 
<TD class=tabletext> 
Bengali</TD> 
<TD class=tabletext> 
70 million</TD></TR> 
<TR> 
<TD class=tabletext> 
Telugu </TD> 
<TD class=tabletext> 
66 million</TD></TR> 
<TR> 
<TD class=tabletext> 
Marathi </TD> 
<TD class=tabletext> 
62 million</TD></TR> 
<TR> 
<TD class=tabletext> 
Tamil</TD> 
<TD class=tabletext> 
53 million</TD></TR> 
<TR> 
<TD class=tabletext> 
Urdu</TD> 
<TD class=tabletext> 
43 million</TD></TR> 
<TR> 
<TD class=tabletext> 
Gujarati </TD> 
<TD class=tabletext> 
40 million</TD></TR> 
<TR> 
<TD class=tabletext> 
Kannada </TD> 
<TD class=tabletext> 
32 million</TD></TR> 
<TR> 
<TD class=tabletext> 
Malayalam</TD> 
<TD class=tabletext> 
30 million</TD></TR> 
<TR> 
<TD class=tabletext> 
Oriya</TD> 
<TD class=tabletext> 
28 million</TD></TR> 
<TR> 
<TD class=tabletext> 
Punjabi</TD> 
<TD class=tabletext> 
23 million</TD></TR> 
<TR> 
<TD class=tabletext> 
Assamese</TD> 
<TD class=tabletext> 
13 million</TD></TR> 
<TR> 
<TD class=grey colSpan=2>Source: Malayala Manorama Yearbook</TD></TR></TABLE>

</div>
<div class="quick-links-wrapper">
<h3>You might also like</h3>
[related-articles]
</div>
This article was originally published on <a href="https://www.scidev.net" target="_blank">SciDev.Net</a>. Read the <a href="https://www.scidev.net/global/features/lost-in-translation-getting-indias-languages-onl/" target="_blank">original article</a>.
<script type="text/javascript">
(function(e,t,n,r,i,s,o){e["GoogleAnalyticsObject"]=i;e[i]=e[i]||function(){(e[i].q=e[i].q||[]).push(arguments)},e[i].l=1*new Date;s=t.createElement(n),o=t.getElementsByTagName(n)[0];s.async=1;s.src=r;o.parentNode.insertBefore(s,o)})(window,document,"script","//www.google-analytics.com/ga.js","ga");var _gaq=_gaq||[];var _gaq=_gaq||[];_gaq.push(["_setAccount","UA-3223906-8"],["_trackEvent","article interaction","republished","https://www.scidev.net/global/features/lost-in-translation-getting-indias-languages-onl/",null,true])
</script>
</div>

Imagine getting an email or accessing a website, and finding you cannot read it without downloading extra fonts. Speed — the heart and soul of internet access — suddenly becomes impossible.

This is a daily reality for millions in India, a country that has 18 official languages, 1,652 mother tongues (33 of them spoken by over 100,000 people), and dozens of different scripts (see table).

Each Indian-language script is different from the other, and can be written in different ways. Some, like Urdu and Sindhi, are written right to left, others from left to right. Others, like Hindi, have extra flourishes that act as vowels or modify pronunciation.

Finding solutions for making these native tongues available to computer users is vital to bring communication and information technologies to India’s entire population.

Other non-English speaking countries face similar problems. Pakistan, for instance, has been seeking software in its national language Urdu, while Bangladesh wants solutions in Bengali.

Multinational corporations, arguably, are largely responsible for the problem. They rarely bother to translate software into local languages because of the lack of commercial gain: few of the people who speak them can afford expensive software.

Nine out of ten Indians do not speak fluent English,
making computing in their native tongue essential
Photo Credit: www.fiveyards.com

Targeted localisation

South Asia isn’t alone. Dwayne Bailey at the non-profit organisation Translate.org.za says South Africa has similar problems.

Bailey and his team are translating ‘open source’ software, which is distributed and modified for free, into all 11 South African languages: Afrikaans, English, Xhosa, Ndebele, Northern Sotho, Siswati, Southern Sotho, Tsonga, Tswana, Venda and Zulu.

All these languages are written using the Latin alphabet, so the task is not as complex as it is in India.

The golden rule, says Bailey, is that applications chosen for translation should be appropriate for a general audience. “Our logic is that [it should benefit] the people whom language would most affect,” he says. “Someone who can program could have probably mastered English already. Localisation must be aimed at the end-user.”

It also needs to take into account the needs of the people who use the software. Ravishankar Shrivastava, for instance, has been writing fiction in Hindi for two decades, but putting his written work into an electronic format that he could submit to publishers has proven difficult.

Shrivastava recalls his excitement when, in the late 1980s, he came across a personal computer that allowed him to type in Hindi. “I thought it was a gift to Hindi speakers”.

But as computing technology progressed, new software that enabled users to write in their Indian mother tongues became more expensive – and also came with limitations.

Shrivastava tried several computer packages for writing in Hindi. Some were too time-consuming, demanding that you press several keys to type a single character. Other programs made it impossible to exchange text unless the person on the receiving end had the same software.

In India’s dotcom surge of the late 1990s, various Hindi newspapers went online. But they all used different, incompatible fonts.

Computer icons with Hindi application names on a
PC desktop
Photo Credit: IndLinux

The open source connection

The open source avenue could be one way out of the problem. Cutting across traditional South Asian rivalries and distrust, groups of open source enthusiasts — some in India — are talking to each other on how they can collaborate to build solutions.

For Shrivastava, this involves the ‘Indian Linux’ or IndLinux project. Linux is an open source operating system that has been around since 1991. The people working on IndLinux want to tailor it to local Indian languages.

They have teams working on Bengali, Gujarati, Gurmukhi, Hindi, Kannada, Malayalam, Marathi, Oriya, Tamil and Telugu.

“We want to make technology accessible to the majority of India that does not speak English,” says G. Karunakar, a volunteer at IndLinux.

So far, the project has designed operating systems in Bengali and Tamil, and the Hindi version is nearly complete.

But there have been complications. For instance, the alphabets are laid out differently on different keyboards.

And even languages spoken by hundreds of millions, such as Hindi, were devoid of IT terms. When these terms were introduced, other difficulties arose. Take, for instance, the commonly used computer term ‘file’. This alone was called faeel, suchika, sanchika or reti by different translators.

Another obstacle has been continuity among staff participating in the translation task. Volunteers would join the venture with great enthusiasm, translate a dozen strings, and make promises to do more, only to move on after realising that “translation is a tedious, thankless, glamourless, revenueless, highly boring job” as Shrivastava puts it.

A rural revolution

Shrivastava believes that Indian-language computing will revolutionise a rural India where English is practically non-existent. States like Kerala and Madhya Pradesh are already introducing ‘e-governance’ projects based in local languages.

As part of Kerala’s campaign to familiarise its citizens with electronic communication, ‘e-centres’ are being set up throughout the state. These will be connected to the Internet and linked through to a central operating centre. The goal is for at least one person in each family in Kerala to become computer-literate.

Meanwhile, Madhya Pradesh is introducing an online initiative called ‘Gramsampark’, meaning ‘village contact’. The website offers information on how the state is governed in local languages and is available to all 51,000 villages in Madhya Pradesh.

Getting the message out

Karunakar believes IndLiunx’s major challenge will be to make sure the work is widely used. This means finding the actual users, reaching them, and finding those who stay away from computers because of language barriers.

“We need to properly package the whole thing in a simple, installable format and easy-to-use interface,” he says.

Some Indian languages, such as Urdu, Kashmiri, Konkani, Manipuri, and Sindhi have yet to be tackled by the open source taskforce.

Looming challenges

There is a lot of work ahead for the IndLinux team. For now, translations of the basic interface into several local languages have been completed. Next, the volunteers need to work on user manuals, ‘help’ files and more.

Many hands make light work, and IndLinux’s current band of volunteers believe there are too few of them involved in this ambitious task. Beyond that, they need support — both financial and moral.

Surprisingly, rather than one of the more important, government-supported languages such as Hindi, the south Indian language Tamil was the first to be localised through open source. It was only then followed by Hindi.

Those working in Indian-language computing suggest the speedy translation of software into Tamil might also be explained, in part, by the work put in by expatriated Tamil-speaking communities settled in places like Malaysia, North America, Singapore, or Sri Lanka.

Om Vikas, head of the Indian government’s Human Centred Computing Division at its Ministry of Communication and Information Technology in New Delhi, says that IndLinux has also spawned Indix2 — a compact disc of Indian-language software solutions — that supports 11 languages.

Vikas says, the effort’s weak point is that there is no single national level that is pushing it forward and creating standards.

He says creating such a consortium should be a top priority, and that part of its role should be to deploy the translated products immediately in schools under central government administration.

Vikas says implementation efforts are slow, but “satisfactory” for more important languages like Hindi, Tamil and Bengali.

Overall, it’s a tall task. But if India is to maintain its position as one of the IT leaders in the world, it has no choice but to win this battle.

Language	Numbers of people who speak it
Hindi	340 million
Bengali	70 million
Telugu	66 million
Marathi	62 million
Tamil	53 million
Urdu	43 million
Gujarati	40 million
Kannada	32 million
Malayalam	30 million
Oriya	28 million
Punjabi	23 million
Assamese	13 million
Source: Malayala Manorama Yearbook