Personalized concept hierarchy construction

A concept hierarchy is a set of concepts and relations between those concepts. Since ancient times, concept hierarchies have been used to organize and access information. In some situations, task-specific and user-specific concept hierarchies are necessary to allow an overview and easy access a large set of documents. For example, in regulatory reforms, rule-makers in government regulatory agencies must quickly identify and respond to issues raised in public comments. A concept hierarchy constructed for a set of public comments hierarchically organizes the comments and a user is able to easily "drill down" into documents that discuss a specific topic. Particularly, this dissertation addresses how to construct concept hierarchies from text collections automatically or with a-human-in-the-loop. The novel metric-based concept hierarchy construction framework transforms concept hierarchy construction into a multi-criterion optimization problem. It incrementally clusters concepts based on minimum evolution of hierarchy structure, as well as optimization derived from the modeling of concept abstractness and concept coherence. Moreover, this dissertation represents the semantic distance between concepts as a wide range of features, each of which corresponds to a state-of-the-art concept hierarchy construction technique, such as lexico-syntactic pattern, contextual information, and co-occurrence. The use of multiple features allows a further study of the interaction between features and different types of semantic relations as well as the interaction between features and concepts at different abstraction levels. Besides the automatic framework for concept hierarchy construction, this dissertation also proposes an effective human-guided concept hierarchy construction framework to address personalization by learning from periodic manual guidance and directing the learned models towards personal preferences. Through human-computer interactions, the human and the machine work together to organize concepts into hierarchies. The machine's predictions not only save the user's effort but also make sensible suggestions to assist the user. This is one of the first works of real-time machine learning for organizing personalized and task-specific information in an interactive paradigm. This dissertation also studies user behaviors during concept hierarchy construction. It explores whether people create concept hierarchies more quickly or more consistently using the proposed frameworks, whether there are consistent dataset-specific or user-specific differences in the hierarchies that people construct, whether people are self-consistent, and how these factors interact with different construction methods. The user study elaborates that dataset difficulty is a major factor affecting how people organize information into concept hierarchies. It also reveals that people are quite self-consistent in building hierarchies. This novel finding provides foundations to study the differences in concept hierarchy construction behaviors between individuals. Last but not least, the dissertation proposes a novel similarity metric for measuring hierarchy similarity. Fragment-based Similarity (FBS) employs a unique bag-of-word representation for hierarchies and takes a fragment-based view to calculate hierarchy similarity. FBS well approximates tree edit distance and greatly improves tree edit distance's efficiency from NP-hard to only O(n3) and O( n) if pairwise node similarities are pre-calculated. The research in this dissertation is an important step forward of concept hierarchy construction. It addresses important problems of concept hierarchy construction, especially considers how to better model these problems with good theoretical foundations, to study these problems via extensive empirical experiments and user studies, and to solve these problems by developing practical applications for constructing personal concept hierarchies.

[1]  Ming Zhou,et al.  Identifying Synonyms among Distributionally Similar Words , 2003, IJCAI.

[2]  R. Scheaffer,et al.  Mathematical Statistics with Applications. , 1992 .

[3]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[4]  Horst Bunke,et al.  A Graph-Theoretic Approach to Network Dynamics , 2007 .

[5]  Yifen Huang A Framework for Mixed-Initiative Clustering , 2007 .

[6]  Grace Hui Yang,et al.  Feature selection for automatic taxonomy induction , 2009, SIGIR.

[7]  Brian Roark,et al.  Noun-Phrase Co-Occurence Statistics for Semi-Automatic Semantic Lexicon Construction , 1998, COLING-ACL.

[8]  Gideon S. Mann Fine-Grained Proper Noun Ontologies for Question Answering , 2002, COLING 2002.

[9]  Eugene Charniak,et al.  Finding Parts in Very Large Corpora , 1999, ACL.

[10]  Patrick Pantel,et al.  VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations , 2004, EMNLP.

[11]  Anthony K. H. Tung,et al.  Similarity evaluation on tree-structured data , 2005, SIGMOD '05.

[12]  Ari Rappoport,et al.  Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words , 2006, ACL.

[13]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[14]  Ellen Riloff,et al.  A Corpus-Based Approach for Building Semantic Lexicons , 1997, EMNLP.

[15]  W. Bruce Croft,et al.  Generating hierarchical summaries for web searches , 2003, SIGIR '03.

[16]  Maya Cakmak,et al.  Learning about objects with human teachers , 2009, 2009 4th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[17]  W. Bruce Croft,et al.  Indri: A language-model based search engine for complex queries1 , 2005 .

[18]  Ronen Feldman,et al.  Clustering for unsupervised relation identification , 2007, CIKM '07.

[19]  Rong Jin,et al.  Distance Metric Learning: A Comprehensive Survey , 2006 .

[20]  Jude W. Shavlik,et al.  Giving Advice about Preferred Actions to Reinforcement Learners Via Knowledge-Based Kernel Regression , 2005, AAAI.

[21]  Grace Hui Yang,et al.  A Metric-based Framework for Automatic Taxonomy Induction , 2009, ACL.

[22]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[23]  Frank Harary,et al.  Graph Theory , 2016 .

[24]  Shuming Shi,et al.  Employing Topic Models for Pattern-based Semantic Class Discovery , 2009, ACL/IJCNLP.

[25]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[26]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[27]  Susan T. Dumais,et al.  The web changes everything: understanding the dynamics of web content , 2009, WSDM '09.

[28]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[29]  Ellen Riloff,et al.  Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs , 2008, ACL.

[30]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[31]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[32]  Oren Etzioni,et al.  The Tradeoffs Between Open and Traditional Relation Extraction , 2008, ACL.

[33]  D. Penny,et al.  Branch and bound algorithms to determine minimal evolutionary trees , 1982 .

[34]  Steffen Staab,et al.  Comparing conceptual, parti-tional and agglomerative clustering for learning taxonomies from text , 2004 .

[35]  Ellen M. Voorhees,et al.  Overview of TREC 2003 , 2003, TREC.

[36]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[37]  Marta Sabou,et al.  Extracting ontologies from software documentation: a semi-automatic method and its evaluation , 2004 .

[38]  Harold R. Lindman,et al.  Analysis of variance in complex experimental designs , 1974 .

[39]  Patrick Pantel,et al.  Automatically Labeling Semantic Classes , 2004, NAACL.

[40]  Razvan C. Bunescu,et al.  Learning to Extract Relations from the Web using Minimal Supervision , 2007, ACL.

[41]  D. Mladení,et al.  Semi-automatic construction of topic ontology , 2005 .

[42]  Manuela M. Veloso,et al.  Teaching multi-robot coordination using demonstration of communication and state sharing , 2008, AAMAS.

[43]  Grace Hui Yang,et al.  Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[44]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[45]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[46]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[47]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[48]  Jaime Teevan,et al.  Information re-retrieval: repeat queries in Yahoo's logs , 2007, SIGIR.

[49]  Philipp Cimiano,et al.  Automatic Acquisition of Ranked Qualia Structures from the Web , 2007, ACL.

[50]  Luc Steels,et al.  Aibo''s first words. the social learning of language and meaning. Evolution of Communication , 2002 .

[51]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[52]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[53]  Grace Hui Yang,et al.  Next steps in near-duplicate detection for eRulemaking , 2006, DG.O.

[54]  Dominic Widdows,et al.  A Graph Model for Unsupervised Lexical Acquisition , 2002, COLING.

[55]  Jürgen Ziegler,et al.  Matrix browser: visualizing and exploring large networked information spaces , 2002, CHI Extended Abstracts.

[56]  Dan I. Moldovan,et al.  Automatic Discovery of Part-Whole Relations , 2006, CL.

[57]  Eduard Hovy,et al.  Towards terascale knowledge acquisition , 2004, COLING 2004.

[58]  Tom M. Mitchell,et al.  Text clustering with extended user feedback , 2006, SIGIR.

[59]  Jeffrey P. Bigham,et al.  Combining Independent Modules to Solve Multiple-choice Synonym and Analogy Problems , 2003, ArXiv.

[60]  Peter Clark,et al.  Knowledge entry as the graphical assembly of components , 2001, K-CAP '01.

[61]  Monica N. Nicolescu,et al.  Natural methods for robot task learning: instructive demonstrations, generalization and practice , 2003, AAMAS '03.

[62]  Erhard Rahm,et al.  Schema and ontology matching with COMA++ , 2005, SIGMOD '05.

[63]  James Allan,et al.  Flexible intrinsic evaluation of hierarchical clustering for TDT , 2003, CIKM '03.

[64]  Grace Hui Yang,et al.  Learning the distance metric in a personal ontology , 2008, ONISW '08.

[65]  Henry Lieberman,et al.  A goal-oriented web browser , 2006, CHI.

[66]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[67]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[68]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[69]  Tao Jiang,et al.  Some MAX SNP-Hard Results Concerning Unordered Labeled Trees , 1994, Inf. Process. Lett..

[70]  Hans-Peter Kriegel,et al.  Efficient Similarity Search for Hierarchical Data in Large Databases , 2004, EDBT.

[71]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[72]  Chris Mattmann,et al.  ACE: improving search engines via Automatic Concept Extraction , 2004, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004..

[73]  Ido Dagan,et al.  Scaling Web-based Acquisition of Entailment Relations , 2004, EMNLP.

[74]  Patrick Pantel,et al.  Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations , 2006, ACL.

[75]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[76]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[77]  R. Bhatia Positive Definite Matrices , 2007 .

[78]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[79]  Erhard Rahm,et al.  Matching large XML schemas , 2004, SGMD.

[80]  Dan I. Moldovan,et al.  Learning Semantic Constraints for the Automatic Discovery of Part-Whole Relations , 2003, NAACL.

[81]  Andruid Kerne,et al.  Generative semantic clustering in spatial hypertext , 2005, DocEng '05.

[82]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[83]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[84]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[85]  Grace Hui Yang,et al.  Ontology generation for large email collections , 2008, DG.O.

[86]  Sharon A. Caraballo Automatic construction of a hypernym-labeled noun hierarchy from text , 1999, ACL.

[87]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[88]  Tom M. Mitchell,et al.  Exploring Hierarchical User Feedback in Email Clustering , 2008 .

[89]  Yimin Wang,et al.  Towards Semi-automatic Ontology Building Supported by Large-Scale Knowledge Acquisition , 2006, AAAI Fall Symposium: Semantic Web for Collaborative Knowledge Acquisition.