The approach to creating the Concordance
When an index is generated for a book, the index takes the form
of an alphabetical list with each word specified by where it is seen
in the book (typically the page number). When it is required to generate
concordance for an electronic document, page numbers do not make
sense. Usually web applications which provide search features will
identify the web page in which the word is seen.
The volumes of Deivathin Kural are available as printed books and
the different chapters of the volumes are also served in electronic
form from the known authoritative site www.kamakoti.org. Hence
generating a concordance poses difficulties in identifying a page
number as well as a web page reference depending on the needs
of the person searching for information.
This has been handled at this site as explained below.
Common words in use are not included in an index as the number of
occurrences of such words will be huge. It may be kept in mind that
the number of words in the volumes of Deivathin Kural range from 1,30,000
to as many as 2,10,000.
Therefore one has to provide for possible ways to restrict
the search to special words and also words which are emphasized in
text.
The group attempting to handle this decided to take an approach that
would reflect reasonably well the nature of the query that someone
has in mind. Typically one might be looking for words from scriptures,
names of places, etymological derivations, proverbs and the like.
These would have to be culled out from the huge list in each volume.
The selection mechanism suited for the above was based on the
following assumptions. The assumptions
are reasonable but may not conform to accepted norms
followed by linguistic experts.
Words with fewer than 3 aksharas are most likely to be common words
and may not merit a search unless they happen to be special, say
as seen in quotation marks when dealing with topics on grammar.
Words from scriptures, historical anecdotes, names of Kings and names
of places generally tend to have 6 or more aksharas.
Words between 3 and 6 aksharas may include many common words
and possibly words of interest in specific contexts.
Words in English are likely to be significant when they are used in the
text to explain or give a meaning to a word in Tamil or other languages.
Besides, these are essential when references are made to authoritative
sources written in English.
Words or short phrases in quotes also become candidates for search
as the quotation marks imply some sort of importance for their
presence a sentence.
Keeping all this in mind, Computer programs were written to analyse
the text and generate the following. The source for analysis for each
volume consisted of electronic versions of the volume coded in a
special way to identify an akshara uniquely and process it using
conventional methods of string matching. The coding scheme was
developed at IIT Madras during the 1990s to provide fast text
processing of multilingual text in Indian languages.
Full list of all the words in each volume.
List of all words with 7 or more aksharas
List of all words with 3 to 6 aksharas
List of all quoted words and short quoted phrases
having 2 or 3 words
List of all quoted longer phrases
having 3-20 words
List of all English words
Duplicate words within a paragraph were eliminated. For short words
the first 3 matching aksharas were identified as variations on a root word and only
a few (probable) root words were retained. This would be adequate
for the purpose of search
since all the words matching the root would be returned when searching
through the full list of short words in a volume. This
forms the first stage of filtering of the list. In the case of long words, root
words were identified by matching the first four aksharas.
The list of long words filtered this way would now be restricted to about 15-20,000
and the short words to about 30-35,000 for each volume. Quoted words were seen to be
about 1000-2000 depending on the volume and English words were
seen in the range 500-1000 typically.
The filtered list of long words and short words were further
manually scanned to
identify important ones from the point of view of what one might want
as a quick reference list. Manual scanning filtered out about 1500-2500
Long words and about 2000-3000 Short
words for each volume. These were identified as the most important
words one is likely to search for.
This manually prepared list would merit printing in
Hard Copy form as the Concordance of most important words. This
list is made available for download for each volume.
During the manual scanning process, typographical errors seen in the
words were tagged but not corrected. It is therefore likely that words with
spelling errors may be returned for a query.
The web application will return
matching words based on the options
chosen from the drop down menu. Search by volume may appear
redundant when all the qualifying words from all the volumes have
been included for concordance. In practice it turns out that many
short words will return tens of matches if all volumes are included.
Very long lists returned by the application will require one to
scroll through the words and this can be a bit tedious. Seach by
volume will be helpful here.
All the manually filtered lists
for long words (and separately for Short words) in all the volumes
were combined together as a global set and this represents the choice
in the "ALL-7" option for the Filtered Long (Short) words type. In this
selection approximately 20,000 words are included for each
option i.e., Long or Short.
A typical search may be effected as follows.
Select the All-7 option for
long or short words and submit the query. If the returned list satisfactorily
reflects the needed results, one need not go further.
The next search
would be a volume wise search of filtered root words and this is likley
to result many more words. If required, the search could be extended to
the full list of Long and Short words within a volume resulting in many more
matches.
If matches are not returned for a specific query from either
of the Filtered lists for a volume, one can always look for the
word from the much larger and full set of each type.
It is quite likely that
with this full set, multiple occurrences of the same word will be
returned as the full set will include all the words in the volume
matching the word type (Long or Short words). In these sets
one will see duplications as well as many different variations
for a root word, mostly constituting commonly spoken words.
To avoid going through all the seven volumes for Short or Long words,
one could attempt an advanced search across
all the volumes for both Long and Short words. Obscure words very
rarely seen in common use could be searched this way. Usually
English words written in Tamil will merit search in this manner. The
advanced search facilty is in a separate page. The advanced
search will return results (Long as well as Short words) for the query from
all the seven volumes. Results will be from the filtered sets.
The advanced search may be used to check if a word is present in
any of the volumes. One can then return to the standard search to
find multiple occurences in a volume.
In the search application, the drop down menu for the word-type
allows the selection of the list of interest. there are six
choices: Filtered Long, Filtered Short, quoted, Full list of
Long words, Full list of Short words and English words. By
selecting the full lists one after another, one would be searching
for a match from about 60,000 - 90,000 words based on the
volume selected.
The table in the page on concordance has the details of the
different word types in each volume.
|
Concordance Generation
This page discusses the factors taken into consideration while generating the
Concordance for the volumes of Deivathin Kural. The approach may not
conform to conventional indexing of documents or web pages. Viewers are
encouraged to offer their views on this.
Concordance here implies that the prepared list relates words to their occurrences
in an essay in one of the volumes of Deivathin Kural. The search here does not
extend to searching for phrases or conditional searches as one might see in
search engines on the web.
The structure of words in Tamil is based on the principles of adding prefixes
and suffixes to roots to derive variations. This linguistic speciality (Agglutinative
languages) is useful while searching for words since the algorithms for
word selection can be written to match the roots.
The algorithms written for concordance generation for Deivathin Kural are
based on the scheme of text representation developed at IIT Madras during
1990s. This scheme which represents each syllable using a fixed length
code (16bits but quite diferent form Unicode) allows regular expression
matching to be effected with ease on Tamil text (or for that matter text in all
Indian languages).
|