Bringing science and development together through news and analysis

  • Software boosts online access to Indian scripts

[NEW DELHI] Scientists from India and the United States have developed new software that speeds up the accurate computer interpretation of texts written in Devanagari — a centuries-old script that is still used as the basis of a large number of contemporary Indian languages.

The software, which eliminates the need for manual checking of digital interpretations of Indian scripts that is required by existing conversion programmes could, according to its developers, be an important step towards bridging the digital divide between developed and developing nations.

Devanagari is the script used in the main Indian language Hindi and in Marathi, which is spoken in western India. It is also used in Nepali, which is spoken in neighbouring Nepal.

“The half billion people around the world whose main language is Hindi or based on Devanagari are totally missing out on the information revolution,” says Venugopal Govindraju, professor of computer science and engineering at the University at Buffalo in the United States, whose team developed the software in collaboration with the Indian Statistical Institute (ISI) in Kolkata.

Converting any written or printed text into a digital format acceptable to computers requires optical character recognition (OCR) software, which allow computers to interpret the images of a particular alphabet, using numerous scanned images of characters and words.

About 10 years ago, a team led by Bidyut Baran Chaudhuri from the ISI developed OCR systems for Devanagari and two east Indian languages, Bengali and Oriya. The accuracy was 97 per cent and errors have, until now, been checked manually.

The new software is a kind of “benchmarking system” that replaces manual verification with a computerised verification. “Benchmark softwares are useful for testing OCR systems,” says Chaudhuri.

The scientists constructed a dataset of 400 pages of Hindi and Sanskrit documents from ancient and contemporary books and periodicals that are representative of the huge variety of documents available in these languages. They used their new software to record information about these documents that indicate how OCR for Devanagari should interpret each word.

In Indian languages there are more than 50 characters in each alphabet compared to 26 in English. In addition, the shape of the character changes depending on the vowel accompanying the consonant, giving rise to about 350 different shaped characters in each language. Furthermore, some alphabets are joined to varying degrees at the top, making it difficult to apply OCR accurately. Standardisation of characters’ shape and spellings is 20 times more difficult in Indian languages, compared to English.

The scientists plan to extend the use of the new software to south Indian languages that do not use Devanagari, as well as Urdu and Arabic, under a project funded by a US$487,000 grant from the National Science Foundation ‘s International Digital Libraries Initiative. Eventually it may make ancient Sanskrit texts available online.

Details of the new software were presented by Govindraju at an International Workshop on Research issues on Data Engineering in Hyderabad in March and will eventually be available for free on the Web.

Related external links:

Indian Statistical Institute
University at Buffalo
We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit SciDev.Net — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on SciDev.Net” containing a link back to the original article.
  4. If you want to also take images published in this story you will need to confirm with the original source if you're licensed to use them.
  5. The easiest way to get the article on your site is to embed the code below.
For more information view our media page and republishing guidelines.