A Comparative Study Of Automatic Language Identification Of Ethio-semitic Languages

Information Sciences Project Topics

Get the Complete Project Materials Now! »

The dominant languages under the family of Ethio-Semitic languages are Amharic, Geez, Guragigna and Tigrigna. From the findings of the language identification studies on European languages, there is a conclusion that most classifiers performance reached the accuracy of 100%. Local and global studied confirmed that Naïve Bayes Classifier (NBC) classifier does not reached the accuracy level of 100% in language identification especially on shorter test strings. Comparative Language Identification studies in European languages shows that Cumulative Frequency Addition (CFA) performs close to 100% accuracies better than the NBC classifier. rnThe purpose of our study is to assess the performance of CFA as compared to NBC on Ethio-Semitic languages, to validate the research findings of CFA and NBC classifiers, and recommend the classifier, language model, evaluation context and the optimal values of N that performs better in language identification. rnIn this research we have employed and experimental study to measure the performance CFA and NBC classifiers. We have developed a training and test corpus from online bibles written in Amharic, Geez, Guragigna and Tigrigna to generate 5 different character based n-gram language models. We have measured the classifiers performance using under two different evaluation contexts using 10-fold cross validation. F-score is used as an optimal measure of performance for comparing classifiers performances. rnThe classifiers commonly exhibited higher performance when the length of the test phrase grows from a single word to 2, 3 and beyond to reach an F-score measure beyond 99%. Both classifiers performed similarly under each context corresponding to the language models and n-grams tested. The language model, fixed length character n-grams with location features, exhibited highest performance in F-score for both classifiers under each evaluation contexts on test strings as short as one word length. N=5 on Fixed length character n-grams with location features language model is the optimal value of N whereas N=2 is the optimal value for the remaining language models on both CFA and NBC classifiers and evaluation contexts. Based on our findings CFA is a classifier that performs better as compared to NBC as it is founded in sound theoretical assumptions and its performance in language identification.

Get Full Work

Report copyright infringement or plagiarism

Be the First to Share On Social



1GB data
1GB data

RELATED TOPICS

1GB data
1GB data
A Comparative Study Of Automatic Language Identification Of Ethio-semitic Languages

325