Author Identification Of Amharic Online Text Using Stylometry And N-gram Features And Different Classification Techniques

Information Sciences Project Topics

Get the Complete Project Materials Now! ยป

Users in cyberspace generated a vast amount of text data by hiding their identity. Those anonymous online text writers are distributing misinformation throughout the world. In Ethiopia also, the number of anonymous writers who are hiding their identity increases from time to time. Such writers use different languages and different social media accounts. Amharic is one of more than 80 Ethiopian languages in which misinformation are spread by anonymous online writers. rnAuthor identification is a scientific method of identifying the author of anonymous texts by recognizing and extracting features of the author's writing style. To our knowledge, there is no authors identification model or published work to identify anonymous writers for Amharic so as to take the necessary measures. This thesis, therefore, aims at exploring the development of model for identifying Amharic text authors using stylometry, n-gram or both features and three classification algorithms: support vector machine, Naive Bayesian and Neural Network multilayer perceptron. In addition, the research investigates the effects of number of articles per author and number of authors on the performance of the author identification model. To achieve the aim of the study, experimental research methodology was followed. The necessary data (Amharic online texts) to train the model is collected and pre-processed, features are extracted and selected. The effects of increasing the number of authors and number of articles per authors are investigated in two experiments. The discrimination capability of the features and models was then tested using an anonymous Amharic online text from a suspected list. From the first experiment, the number of authors is inversely proportional with accuracy, precision, recall and f1-scores. On the other hand, these performance metrics increase as the number of articles per author increases, as the results of the two experiments show. The research findings indicate that merged features are better than the individual features for almost all models. NN-MLP-logistics has 90.47% accuracy and 90% model performance score for merged features and 27 authors. SVM Linear has 97.52% accuracy and 98% model performance score for merged features and 100 articles per author.Based on the results of the study we conclude that the Neural Network models are preferred to other classification models for small number of online text per authors to authorship identification and also the results are stable and show the best identification capability throughout number of suspects. We have conducted the experiments with limited number of authors; we recommend that further study can be conducted for more number of Amharic online text authors.

Get Full Work

Report copyright infringement or plagiarism

Be the First to Share On Social



1GB data
1GB data

RELATED TOPICS

1GB data
1GB data
Author Identification Of Amharic Online Text Using Stylometry And N-gram Features And Different Classification Techniques

126