Semantic document classification and keyword spotting in digital repositories
暂无分享,去创建一个
The volume of documents in the digital repositories numbers in thousands and is increasing constantly. In such a scenario it becomes a very important issue to organize and retrieve these documents in a way that relates to the human mind. In this paper, we present a novel approach to classify the documents in a digital repository and find the semantically significant keywords related to those documents to make the organization and the retrieval of the documents expeditious. We approach this problem using probabilistic model with incomplete training data to organize them and mark the relevant keywords. This approach makes the classification faster and instead of the unlabeled clustering gives classification with well defined topics.
[1] Thomas Hofmann,et al. Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.
[2] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.
[3] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .
[4] Mark A. Girolami,et al. A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections , 2004, Journal of Intelligent Information Systems.