Text processing: Indexing and Concordance Generation

Home --> Linguistics and Computation > Indexing Text

Indexing text and generating concordances
The syllable level coding scheme used in the IITM Software lends itself to direct use with algorithms used in indexing text. Indexing text is usually done by Hashing methods with clash-avoidance.
The syllable level codes used in the IITM developed software are 16 bit in length for each syllable. The 16 bits are divided into 3 fields of 6,5 and 4 bits preceded by the most significant bit. The fixed length subdivisions can be mapped to standard ASCII characters thus forming a 3 byte string. Each byte within the 3 byte string corresponds to Consonant, conjunct and vowel value of the syllable.

The scheme allows standard regular expression matching to be applied for string processing of the text. The interesting aspect of this representation allows string search with the vowel or conjunt masked. Such masking is useful when the string under search may have errors in the vowel or conjunct part.
The Indexing software developed at IITM works the following way.
Create the required local language files and organize them into a meaningful directory structure. The IITM Multilingual Editor could be used for this purpose or conversion utilities could be used to convert Indian language text in other formats into the .llf form.
Standard indexing programs (e.g., Swish-e) may then be used to index the text. A simple front end can then be used to interface user interaction with the to the created index

The search applications hosted on this site (Gita, Kural) have been created using the approach discussed here.

Acharya Logo
The temples of Kerala are unique in providing rows of lamps on the outer walls. The perspective is striking!

Today is Jul. 13, 2026
Local Time: 09 54 27

| Home |

Last updated on 08/17/20 Best viewed at 800x600 or better