A Stemming Algorithm Development For Tigrigna Language Text Documents

Information Sciences Project Topics

Get the Complete Project Materials Now! ยป

Variant word forms that are likely to be encountered in indexing and retrieval are one of therncauses of the problems that are involved in the use of free-text retrieval system. The variantrnword forms used in indexing and searching are likely to be of comparable importance inrndetermining the relevance of a document to a user query that specifies just a single form.rnReducing the variant words into one from improves performance of IR system and this can bernachieved by a conflation technique, which is usually stemming that is established in this work.rnSteamers are used in information retrieval to reduce as many related words and word forms rnas possible to a common forms, which can then be used in the retrieval process.rnThis research explores the possibility of developing a steamer to conflate variant words ofrnTigrigna language for use in IR of the languageTigrigna belongs to the Semitic languagerngroup. These languages have a common grammatical system based on a root-pattern structure.rnConsonants bear the basic meanings while vowels forms different patterns. Stems are builtrnfrom consonant al roots before other words are built from stems. Tigrigna uses affixation tornderive different word forms from stems. Common affixations are prefix, suffix, prefix-suffixrnpair and reduplication. Tigrigna uses extensive concatenation of affixes and can result inrnrelatively long words, which often contain an amount of semantic information equivalent to arnwhole English phrase, clause or sentence. Due to this complex morphological structure, arnsingle Tigrigna word can have thousand variants.rnTo design the stemmer, a sample text was collected from three different sources. Thernexperiment in word-distribution on the sample data shows that words exist in their variantsrnacross the text and singleton words constitute large percentage of the text. This resulted in lowrnword-ratio and deviation from Zipfs law.rnA stemmer is developed which is iterative and uses context-sensitive rules that removesrnprefix, suffilx, prefix-suffIx pair and reduplication . of single and double letters. A semi automatedrnprocedure was used to compile stop words and affIxes. The stemmer was tested onrnsample data of 1568 words, which were selected randomly from the sample texts. In thisrnexperiment the stripping procedures were applied in the order of prefix-suffIx, double letterrnreduplication, prefix, suffIx and single letter reduplication. The result of the experiment showsrnthat, the steamer performs at accuracy of 84% and brings a dictionary reduction of 32.40%rnand 54.6% for stem and root respectively.

Get Full Work

Report copyright infringement or plagiarism

Be the First to Share On Social



1GB data
1GB data

RELATED TOPICS

1GB data
1GB data
A Stemming Algorithm Development For Tigrigna Language Text Documents

383