论文信息 - Semantic document classification and keyword spotting in digital repositories

Semantic document classification and keyword spotting in digital repositories

The volume of documents in the digital repositories numbers in thousands and is increasing constantly. In such a scenario it becomes a very important issue to organize and retrieve these documents in a way that relates to the human mind. In this paper, we present a novel approach to classify the documents in a digital repository and find the semantically significant keywords related to those documents to make the organization and the retrieval of the documents expeditious. We approach this problem using probabilistic model with incomplete training data to organize them and mark the relevant keywords. This approach makes the classification faster and instead of the unlabeled clustering gives classification with well defined topics.

Ratna Sanyal | Manish Kumar | Nikunj Yadav | Yanu Gupta

[1] Thomas Hofmann,et al. Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[2] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[3] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4] Mark A. Girolami,et al. A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections , 2004, Journal of Intelligent Information Systems.