Amharic Document Image Retrieval Using Lingustic Features

Computer Science Project Topics

Get the Complete Project Materials Now! ยป

The advent of modern computers play important roles in processing and managing electronicrninformation that are found in the form of texts, images, audios and videos, etc. With the rapidrndevelopment of computer technology, digital documents have become popular options forrnstorage, accessing and transmission. With the need of current fast evolving digital libraries, anrnincreasing amount of historical documents, newspaper, books, etc. are being digitized into anrnelectronic format for easy archival and dissemination purposes. Optical Character Recognitionrn(OCR) and Document Image Retrieval (DIR), as part of information retrieval paradigm, are therntwo means of accessing document images that received attention among the IR community.rnAmharic is the official language of Ethiopia since 19th century and as a result so many religiousrnand government documents are written in Amharic. Huge collections of Amharic machinernprinted documents are found in almost every institution of the country. It is observed thatrnaccessing those documents has become more and more difficult. To address this problem, veryrnfew number of research works have been attempted recently by using OCR and DIR methods.rnThe aim of this research is to develop a system model that enables users to find relevant Amharicrndocument images from a corpus of digitized documents in an easy, accurate, fast and efficientrnmanner. So this work presents the architecture of Amharic DIR which allows users to searchrnscanned Amharic documents without the need of OCR. The proposed model is designed afterrnmaking detailed analysis of the specific nature of Amharic language. Amharic belongs to thernSemitic languages and is morphologically rich language. Surface words formation involvesrnprefixation, suffixation, infixation, circumfixation and reduplication.rnIn this work a model for searching Amharic document images is proposed and word imagernfeatures are systematically extracted for automatically indexing, retrieving and ranking ofrndocument images stored in a database. A new approach that applies one of the NLP tools whichrnis Amharic word generator is incorporated in the proposed system model. By providing a givenrnAmharic root word to this Amharic specific surface word synthesizer, a number of possiblernsurface words are produced. Then, the descriptions of these surface word images are used forrnindexing and searching purposes. On the other hand the system passes through various phasesrnsuch as noise removal, binirization, text line and word boundary identification, wordrnsegmentation and resizing to normalize different font types, sizes and styles, feature extractionrnand finally matching query word image against document word images. The proposed methodrnwas tested on different real world Amharic documents from different sources like magazines,rntextbooks and newspapers with various font styles, types and sizes. Precision-recall measures ofrnevaluation had been conducted for sample queries on sample document images and promisingrnresults have been achieved.

Get Full Work

Report copyright infringement or plagiarism

Be the First to Share On Social



1GB data
1GB data

RELATED TOPICS

1GB data
1GB data
Amharic Document Image Retrieval Using Lingustic Features

173