Text processing
with Indian languages.
In the context of text processing
with Indian languages, the basic quantum of information
to be processed is a syllable. The writing systems of India are based on
syllables. Computation with text in Indian languages is hence a question
of working with syllables. The representation of a syllable in the computer
assumes significance in this context.
Text processing algorithms
have generally been written for English since most of computing has been
based on the English Language and the information available electronically
is mostly in English. These algorithms work on a character of information
at a time. Text is represented as a string of characters specified through
codes (typically ASCII) for the letters of the alphabet and special symbols.
For example, an algorithm to check a word for a Palindrome simply reverses
the string and tries a match with the original. The length of a word is
specified in terms of the number of characters in the word.
The approach required for
Indian languages has to be different since all processing has to be done
with syllables. Text in any Indian language is reckoned only in this manner
and syllable identification is critical to determining the linguistic content.
Therefore the algorithm to identify a syllable gains significance.
Regrettably, the approaches
to representing text in Indian languages do not lend themselves to easy
implementations of text processing algorithms. There have been virtually
no accepted standards for coding schemes though one is constantly reminded
of ISCII, Unicode or even Font based schemes.
While ISCII and Unicode
have shown viability of implementations, they suffer from fairly serious
problems of unambiguous representations of syllables. For effective text
processing it is desirable to use codes of fixed size for a syllable.
Fixed length codes lend themselves to easy processing through "Regular
Expression Matching" which is the very basis of text processing. Both
ISCII and Unicode are variable length codes. Moreover, using these codes
the display of a syllable involves decision making in the application
handling the text. This leads to a situation where the same set of
syllables get rendered differently by different applications. When the
display of a syllable cannot be traced back to the syllable without
ambiguity, string processing of displayed text suffers.
Leaving the problems aside,
the following are representative of the type of computations one would
effect from a linguistic point of view.
String processing and pattern
matching (Regular Expressions)
Indexing text and generating
concordances
Search applications (including
searches on the web)
Data Base Applications (mysql,
sql, etc.,)
Grammatical Analysis of
text (e.g., Morphological Analysis)
Parsing Text and Translation
Taggers and generating Linguistic
Corpora
Frequency of occurrences
of syllables
Transliteration across scripts
On-the-fly conversion of
text in to different formats (images, pdf etc.)
Web interfaces to Indian languages.
Text processing applications
available with the IITM Software.
The syllable frequency count
application is a particulary useful one. This specially written application
takes care of alternate forms (linguistically equivalent but differening
in view) for writing a syllable. The results of use of the application
on different texts in Sanskrit and Tamil can be seen in the linked page.
The applications which perform
on-the-fly conversion of text in to different formats will be very useful
for serving content on the web, where the most appropriate format for the
contents could be decided before sending the same to the Browser.
The "Learn Sanskrit through self study" lessons at this site have become
popular all over the world since they can be viewed on almost any Browser.
Here the lessons are sent in the form of images, converted on the fly when
the Browser requests a page containing Devanagari text.
Search applications are easy
to implement using the software base developed at IITM. The fixed size
syllable level code has made life much simpler for string processing. In
fact, conventional indexing software such as Swish-E can be directly used
to index the local language text prepared by the IITM editor or equivalent
software.