Ontology Based Text Categorization - Telugu Documents

In this paper, we introduce a new method of ontology based text classification for Telugu documents and retrieval system. Many of the text categorization techniques are based on word and/or phrase analysis of the text. Term frequency analysis signifies the importance of a term within a document. Two terms within a document can have the same frequency, but one term may contribute more to the meaning of the sentence compared to the other term. Our aim is to capture the semantics of a text. The model we worked enables to capture the terms that presents the concepts in the text and thus identifies the topic of the document. We have introduced the new concept based model which ana- lyzes the terms on the sentences and documents level. This concept-based model effectively discriminates between non-important terms with respect to sentence semantics and terms which hold the concepts that represent the sentence meaning. The limitations of key-word based search are overcome by usage of Ontology which is a motivation of semantic IR. The retrieval model is based on an adaptation of the classic vector-space model. The concept of ontology is associated with the related words and their weights from the pre-classified documents as a learning stage. In the main process, the words and their mutual relations are extracted from the target documents. The concept of Ontology is used to map the target document. A detailed description of the test results is illustrated in the paper and we explained thoroughly how the concept based classification is far more superior when compared to the word based classification for telugu documents. Index Terms—Concept-based model, IR, Ontology, Retrieval model, Term frequency, Text categorization and Telugu documents,