Indexing text and generating
concordances
The syllable level coding
scheme used in the IITM Software lends itself to direct use with algorithms
used in indexing text. Indexing text is usually done by Hashing methods
with clash-avoidance.
The syllable level codes used in the IITM developed software are 16 bit in
length for each syllable. The 16 bits are divided into 3 fields of 6,5 and 4
bits preceded by the most significant bit. The fixed length subdivisions
can be mapped to standard ASCII characters thus forming a 3 byte string.
Each byte within the 3 byte string corresponds to Consonant, conjunct and
vowel value of the syllable.
The scheme allows standard regular expression matching to be applied
for string processing of the text. The interesting aspect of this representation
allows string search with the vowel or conjunt masked. Such masking is useful
when the string under search may have errors in the vowel or conjunct part.
The Indexing software developed
at IITM works the following way.
Create the required local
language files and organize them into a meaningful directory structure.
The IITM Multilingual Editor could be used for this purpose or conversion
utilities could be used to convert Indian language text in other formats
into the .llf form.
Standard indexing programs (e.g., Swish-e)
may then be used to index the text. A simple front end
can then be used to interface user interaction with the to the created index
The search applications hosted on this site (Gita, Kural) have been
created using the approach discussed here.
|