Federated Document Summarization Using Probabilistic Approach for Kannada Language

The number of documents and the amount of information available online is being overloaded. From the last one decade information is getting doubled in size leading to the concept of big data; at the same time, it is being saved in unstructured manner. People used to collect huge amount of information related to many issues and areas, whether it is useful or not at that moment, and when it is required to get the needed information out of the collected information, summarization of that particular document can be made. Summaries of large documents will help to find the correct information. In this work, we present a method to produce extractive summaries of documents in Kannada language, limited to the number of sentences mentioned by user. This paper proposes a federated approach to summarization combining Text Rank algorithm and Naive Bayesian approach. Text Rank uses keyword extraction to rank the sentences with Jaccard’s similarity score. The sentences with higher ranks are expected to be a part of summary. Since Text Rank is unsupervised, the proposed work uses Naive Bayesian to incorporate supervised learning aspects. Training sets are prepared for certain category of Kannada documents, followed by training the system.