Software boosts online access to Indian scripts
The software, which eliminates the need for manual checking of digital interpretations of Indian scripts that is required by existing conversion programmes could, according to its developers, be an important step towards bridging the digital divide between developed and developing nations.
Devanagari is the script used in the main Indian language Hindi and in Marathi, which is spoken in western India. It is also used in Nepali, which is spoken in neighbouring Nepal.
“The half billion people around the world whose main language is Hindi or based on Devanagari are totally missing out on the information revolution,” says Venugopal Govindraju, professor of computer science and engineering at the University at Buffalo in the United States, whose team developed the software in collaboration with the Indian Statistical Institute (ISI) in Kolkata.
Converting any written or printed text into a digital format acceptable to computers requires optical character recognition (OCR) software, which allow computers to interpret the images of a particular alphabet, using numerous scanned images of characters and words.
About 10 years ago, a team led by Bidyut Baran Chaudhuri from the ISI developed OCR systems for Devanagari and two east Indian languages, Bengali and Oriya. The accuracy was 97 per cent and errors have, until now, been checked manually.
The new software is a kind of “benchmarking system” that replaces manual verification with a computerised verification. “Benchmark softwares are useful for testing OCR systems,” says Chaudhuri.
The scientists constructed a dataset of 400 pages of Hindi and Sanskrit documents from ancient and contemporary books and periodicals that are representative of the huge variety of documents available in these languages. They used their new software to record information about these documents that indicate how OCR for Devanagari should interpret each word.
In Indian languages there are more than 50 characters in each alphabet compared to 26 in English. In addition, the shape of the character changes depending on the vowel accompanying the consonant, giving rise to about 350 different shaped characters in each language. Furthermore, some alphabets are joined to varying degrees at the top, making it difficult to apply OCR accurately. Standardisation of characters’ shape and spellings is 20 times more difficult in Indian languages, compared to English.
The scientists plan to extend the use of the new software to south Indian languages that do not use Devanagari, as well as Urdu and Arabic, under a project funded by a US$487,000 grant from the National Science Foundation ‘s International Digital Libraries Initiative. Eventually it may make ancient Sanskrit texts available online.
Details of the new software were presented by Govindraju at an International Workshop on Research issues on Data Engineering in Hyderabad in March and will eventually be available for free on the Web.
Related external links:
Indian Statistical Institute
University at Buffalo