Ontology Construction Based on Latent Topic Extraction in a Digital Library

This paper discusses the automatic ontology construction process in a digital library. Traditional automatic ontology construction uses hierarchical clustering to group similar terms, and the result hierarchy is usually not satisfactory for human's recognition. Human-provided knowledge network presents strong semantic features, but this generation process is both labor-intensive and inconsistent under large scale scenario. The method proposed in this paper combines the statistical correction and latent topic extraction of textual data in a digital library, which produces a semantic-oriented and OWL-based ontology. The experimental document collection used here is the Chinese Recorder, which served as a link between the various missions that were part of the rise and heyday of the Western effort to Christianize the Far East. The ontology construction process is described and a final ontology in OWL format is shown in our result.

[1]  John Yen,et al.  An incremental approach to building a cluster hierarchy , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[2]  Shun-hong Sie,et al.  Common Ontology Generation with Partially Available Side Information through Similarity Propagation , 2007, SWWS.

[3]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[4]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[5]  Kathleen L. Lodwick The Chinese recorder index : a guide to Christian missions in Asia, 1867-1941 , 1986 .

[6]  Edward A. Fox Digital Libraries: Implementing Strategies and Sharing Experiences, 8th International Conference on Asian Digital Libraries, ICADL 2005, Bangkok, Thailand, December 12-15, 2005, Proceedings , 2005, ICADL.

[7]  Tobun Dorbin Ng,et al.  Demonstration of hierarchical document clustering of digital library retrieval results , 2001, JCDL '01.

[8]  Yiming Yang,et al.  A Loss Function Analysis for Classification Methods in Text Categorization , 2003, ICML.

[9]  Shun-hong Sie,et al.  Towards Automatic Concept Hierarchy Generation for Specific Knowledge Network , 2006, IEA/AIE.

[10]  Chao-Chen Chen,et al.  Government Ontology and Thesaurus Construction: A Taiwanese Experience , 2005, ICADL.

[11]  N. F. Noy,et al.  Ontology Development 101: A Guide to Creating Your First Ontology , 2001 .

[12]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[13]  Deborah L. McGuinness,et al.  OWL Web ontology language overview , 2004 .

[14]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[15]  Jeremy H. Clear,et al.  The British national corpus , 1993 .

[16]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[17]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[18]  John Fulcher,et al.  Advances in Applied Artificial Intelligence , 2006 .

[19]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.

[20]  Mark A. Musen,et al.  The Knowledge Model of Protégé-2000: Combining Interoperability and Flexibility , 2000, EKAW.

[21]  Ata Kabán,et al.  On an equivalence between PLSI and LDA , 2003, SIGIR.

[22]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[23]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .