Information Extraction Model From Amharic News Texts

Computer Science Project Topics

Get the Complete Project Materials Now! »

As the growth of unstructured documents in the web and intranet is increasing from time to time,rna tool that can extract relevant data to facilitate decision making is becoming crucial. IE isrnconcerned with extraction of relevant information from text and stores them in a database forrneasy use and management of the data. As the first comprehensive work on IE from Amharic textrnwe designed a model that is genuine enough to deal with different domains in the Amharicrnlanguage. The proposed model has document preprocessing, text categorization, learning andrnextraction and post processing as its main components. The document preprocessing componentrnhandles the normalization of the document while text categorization and learning and extractionrnhandle the categorization of the news text and extracting the predefined relevant informationrnfrom the categorized text respectively. The post processing component format and save thernextracted data to the database.rnVarious evaluation techniques, which are used to evaluate the performance of the classifierrnmachine learning algorithms, are used for IE and text categorization. Among the differentrnclassifier machine learning algorithms used for text categorization component, the Naïve Bayesrnalgorithm performs by correctly classifying 92.83% of the 1200 news texts used as a dataset. Onrnthe other hand, 1422 instances are used for training and testing the Information Extractionrncomponent. Different scenarios are used to evaluate the role of the different features inrnpredicting the category for the candidate texts. Among the different scenarios we considered andrnthe different machine learning algorithms we employed the SMO algorithm correctly classifiedrn94.58% of the instances correctly, when all the features are considered which yields higherrnprecision and recall rate for the different attributes considered for extraction.rnKey words: Amharic Text Information extraction, Machine Learning Approach to InformationrnExtraction, Amharic Text Categorization, Information Extraction

Get Full Work

Report copyright infringement or plagiarism

Be the First to Share On Social



1GB data
1GB data

RELATED TOPICS

1GB data
1GB data
Information Extraction Model From Amharic News Texts

208