Word Sense Disambiguation For Amharic Text Retrieval A Case Study For Legal Documents

Computer Science Project Topics

Get the Complete Project Materials Now! »

This study demonstrates how linguistic disambiguation based on semantic vector analysis canrnimprove the effectiveness of an Amharic document query retrieval algorithm.rnAccurate document retrieval based on query criteria is important in every knowledge domain.rnThe ability to retrieve appropriate documents is made more difficult by the fact that many wordsrncan have different meanings in different contexts. If search engines could disambiguate thosernwords, more accurate retrieval of documents should be able to be achieved.rnFor this study, an Amharic disambiguation algorithm was developed based on the principles ofrnsemantic vectors and implemented in Java. The disambiguation algorithm was then used torndevelop a document search engine.rnA set of 865 Ethiopian Amharic language legal statute documents were selected as therndocument population that would be searched. Ten queries containing Amharic keywords withrnambiguous meaning were selected. An expert was used to identify which documents shouldrnideally be retrieved by each query. Depending on the query, the expert identified between 6 andrn25 documents that should be retrieved.rnThe semantic vector query algorithm created in this study was compared to the well knownrnLucene algorithm. Each query was run using both algorithms. The 20 most relevant documentsrnwere identified for each query from each algorithm.rnFor each query, the list of documents retrieved by each algorithm was compared to the list ofrndocuments identified by the expert. The number of correct (consistent with the expert’s choices)rndocuments retrieved by each algorithm was measured.rnixrnResults are that the semantic vector algorithm was superior for 6 of the 10 queries (Lucene wasrnsuperior on 2 queries, and on two they were tied). This difference was not statisticallyrnsignificant. However, if the total number of correct document identifications are taken intornaccount (not just which algorithm was superior for each query) then the semantic vectorrnalgorithm averaged 82% correct identification of documents where as the Lucene algorithm wasrnonly 49% accurate. This difference was highly statistically significant (p

Get Full Work

Report copyright infringement or plagiarism

Be the First to Share On Social



1GB data
1GB data

RELATED TOPICS

1GB data
1GB data
Word Sense Disambiguation For Amharic Text Retrieval A Case Study For Legal Documents

214