Interpretation of text patterns

Patterns are used as a fundamental means to analyse data in many text mining applications. Many efficient techniques have been developed to discover patterns. However, the excessive number of discovered patterns and lack of grounded (e.g. a priori defined) semantics have made it difficult for a user to interpret and explore the patterns. An insight into the meanings of the patterns can benefit users in the process of exploring them. In this regard, this paper presents a model to automatically interpret patterns by achieving two goals: (1) providing the meanings of patterns in terms of ontology concepts and (2) providing a new method for generating and extracting features from an ontology to describe the relevant information more effectively. Taking advantage of a domain ontology and a set of relevant statistics (e.g. term frequency in a document, inverse term frequency in a domain ontology, etc.), our proposed model can give an insight into the hidden meanings of the patterns. The model is evaluated by comparing it with different baseline models on three standard datasets. The results show that the performance of the proposed model is significantly better than baseline models.

[1]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[2]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[3]  Timothy Baldwin,et al.  Automatic Labelling of Topic Models , 2011, ACL.

[4]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[5]  Ah-Hwee Tan,et al.  Text Mining: The state of the art and the challenges , 2000 .

[6]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[7]  Evgeniy Gabrilovich,et al.  Concept-Based Feature Generation and Selection for Information Retrieval , 2008, AAAI.

[8]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[9]  Naren Ramakrishnan,et al.  Redescription Mining: Structure Theory and Algorithms , 2005, AAAI.

[10]  Stephen E. Robertson,et al.  The TREC 2002 Filtering Track Report , 2002, TREC.

[11]  Padhraic Smyth,et al.  Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning , 2008, SEMWEB.

[12]  Jilles Vreeken,et al.  Item Sets that Compress , 2006, SDM.

[13]  Anand Kumar,et al.  Text mining and ontologies in biomedicine: Making sense of raw text , 2005, Briefings Bioinform..

[14]  Aristides Gionis,et al.  Approximating a collection of frequent sets , 2004, KDD.

[15]  Yuefeng Li,et al.  A Personalized Ontology Model for Web Information Gathering , 2011, IEEE Transactions on Knowledge and Data Engineering.

[16]  Bamshad Mobasher,et al.  Web search personalization with ontological user profiles , 2007, CIKM '07.

[17]  John R. Anderson A Spreading Activation Theory of Memory , 1988 .

[18]  Michael R. Berthold,et al.  Node Similarities from Spreading Activation , 2010, ICDM.

[19]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[20]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[21]  Rakesh M. Verma,et al.  A Semantic Free-text Summarization System Using Ontology Knowledge , 2007 .

[22]  Rudolf Kruse,et al.  Uncertainty and Vagueness in Knowledge Based Systems , 1991, Artificial Intelligence.

[23]  Saso Dzeroski,et al.  Using redescription mining to relate clinical and biological characteristics of cognitively impaired and Alzheimer’s disease patients , 2017, PloS one.

[24]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[25]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[26]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[27]  Yue Xu,et al.  Deploying Approaches for Pattern Refinement in Text Mining , 2006, Sixth International Conference on Data Mining (ICDM'06).

[28]  Rudolf Kruse,et al.  Uncertainty and vagueness in knowledge based systems: numerical methods , 1991, Artificial intelligence.

[29]  Yuefeng Li,et al.  Mining positive and negative patterns for relevance feature discovery , 2010, KDD.

[30]  Srinivasan Parthasarathy,et al.  Incremental and interactive sequence mining , 1999, CIKM '99.

[31]  Jiawei Han,et al.  Mining top-k frequent closed patterns without minimum support , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[32]  Hung T. Nguyen,et al.  Random sets : theory and applications , 1997 .

[33]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[34]  Derek Greene,et al.  Unsupervised graph-based topic labelling using dbpedia , 2013, WSDM.

[35]  Jiawei Han,et al.  Generating semantic annotations for frequent patterns with context analysis , 2006, KDD '06.

[36]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[37]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[38]  Yuefeng Li,et al.  Relevance Feature Discovery for Text Mining , 2014, IEEE Transactions on Knowledge and Data Engineering.

[39]  Nello Cristianini,et al.  MINI: Mining Informative Non-redundant Itemsets , 2007, PKDD.

[40]  Hans-Peter Kriegel,et al.  Future trends in data mining , 2007, Data Mining and Knowledge Discovery.

[41]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[42]  Evgeniy Gabrilovich,et al.  Wikipedia-based Semantic Interpretation for Natural Language Processing , 2014, J. Artif. Intell. Res..

[43]  Salvatore Ruggieri Frequent regular itemset mining , 2010, KDD '10.

[44]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[45]  Padhraic Smyth,et al.  Combining concept hierarchies and statistical topic models , 2008, CIKM '08.

[46]  Toon Calders,et al.  Non-derivable itemset mining , 2007, Data Mining and Knowledge Discovery.

[47]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[48]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[49]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[50]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[51]  Lois Mai Chan,et al.  Linking folksonomy to Library of Congress subject headings: an exploratory study , 2009, J. Documentation.

[52]  Yorick Wilks,et al.  Data Driven Ontology Evaluation , 2004, LREC.

[53]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[54]  Matthew Michelson,et al.  Tweet Disambiguate Entities Retrieve Folksonomy SubTree Step 1 : Discover Categories Generate Topic Profile from SubTrees Step 2 : Discover Profile Topic Profile : “ English Football ” “ World Cup ” , 2010 .

[55]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[56]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[57]  Stephan Bloehdorn,et al.  Learning Ontologies to Improve Text Clustering and Classification , 2005, GfKl.

[58]  M. Holick,et al.  Decreased bioavailability of vitamin D in obesity. , 2000, The American journal of clinical nutrition.

[59]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[60]  Wei-Ying Ma,et al.  Optimizing web search using web click-through data , 2004, CIKM '04.

[61]  Razvan C. Bunescu,et al.  Subsequence Kernels for Relation Extraction , 2005, NIPS.

[62]  Bo Zhang,et al.  StatSnowball: a statistical approach to extracting entity relationships , 2009, WWW '09.

[63]  Carlotta Domeniconi,et al.  Building semantic kernels for text classification using wikipedia , 2008, KDD.

[64]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[65]  Naren Ramakrishnan,et al.  Reasoning about sets using redescription mining , 2005, KDD '05.

[66]  Paolo Rosso,et al.  Evaluation of Internal Validity Measures in Short-Text Corpora , 2008, CICLing.

[67]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[68]  Yue Xu,et al.  Adopting Relevance Feature to Learn Personalized Ontologies , 2012, Australasian Conference on Artificial Intelligence.

[69]  Haixun Wang,et al.  On Conceptual Labeling of a Bag of Words , 2015, IJCAI.

[70]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[71]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[72]  Yuefeng Li,et al.  Mining ontology for automatically acquiring Web user information needs , 2006, IEEE Transactions on Knowledge and Data Engineering.

[73]  Aldo Gangemi,et al.  Ontology Learning and Its Application to Automated Terminology Translation , 2003, IEEE Intell. Syst..

[74]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[75]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[76]  Yoshua Bengio,et al.  Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[77]  Haixun Wang,et al.  Short Text Conceptualization Using a Probabilistic Knowledgebase , 2011, IJCAI.

[78]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[79]  J J Hopfield,et al.  Neurons with graded response have collective computational properties like those of two-state neurons. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[80]  Malladi Ravisankar,et al.  Effective Pattern Discovery for Text Mining , 2018 .

[81]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[82]  Heikki Mannila,et al.  The Pattern Ordering Problem , 2003, PKDD.

[83]  Deept Kumar,et al.  Turning CARTwheels: an alternating algorithm for mining redescriptions , 2003, KDD.

[84]  Gabriella Pasi,et al.  Personal ontologies: Generation of user profiles based on the YAGO ontology , 2013, Inf. Process. Manag..

[85]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics , 1998 .

[86]  Geng Li,et al.  Sampling frequent and minimal boolean patterns: theory and application in classification , 2015, Data Mining and Knowledge Discovery.

[87]  Luc De Raedt,et al.  k-Pattern Set Mining under Constraints , 2013, IEEE Transactions on Knowledge and Data Engineering.

[88]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[89]  Jiawei Han,et al.  Mining Quality Phrases from Massive Text Corpora , 2015, SIGMOD Conference.

[90]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[91]  Fabio Crestani,et al.  Application of Spreading Activation Techniques in Information Retrieval , 1997, Artificial Intelligence Review.

[92]  Jiawei Han,et al.  Mining Compressed Frequent-Pattern Sets , 2005, VLDB.

[93]  Evgeniy Gabrilovich,et al.  Concept-Based Information Retrieval Using Explicit Semantic Analysis , 2011, TOIS.

[94]  Alexander Pretschner,et al.  Ontology-based personalized search and browsing , 2003, Web Intell. Agent Syst..

[95]  Sebastian Rudolph,et al.  Ontology-Based Interpretation of Keywords for Semantic Search , 2007, ISWC/ASWC.

[96]  Sheng-Tang Wu,et al.  Knowledge discovery using pattern taxonomy model in text mining , 2007 .

[97]  Weimin Xiao,et al.  Rule interestingness analysis using OLAP operations , 2006, KDD '06.

[98]  Evgeniy Gabrilovich,et al.  Harnessing the Expertise of 70, 000 Human Editors: Knowledge-Based Feature Generation for Text Categorization , 2007, J. Mach. Learn. Res..

[99]  Jiawei Han,et al.  Semantic annotation of frequent patterns , 2007, TKDD.

[100]  Ebrahim Bagheri,et al.  Open Information Extraction , 2016, Encycl. Semantic Comput. Robotic Intell..

[101]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[102]  Jiawei Han,et al.  Representing Documents via Latent Keyphrase Inference , 2016, WWW.

[103]  Robert Wetzker,et al.  An Ontology-Based Approach to Text Summarization , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[104]  Guodong Zhou,et al.  Tree kernel-based semantic relation extraction with rich syntactic and semantic information , 2010, Inf. Sci..

[105]  Peter D. Turney Distributional Semantics Beyond Words: Supervised Learning of Analogy and Paraphrase , 2013, TACL.

[106]  Allan Collins,et al.  A spreading-activation theory of semantic processing , 1975 .

[107]  Luc De Raedt,et al.  Constraint-Based Pattern Set Mining , 2007, SDM.

[108]  I. Molchanov Theory of Random Sets , 2005 .

[109]  Hongfei Yan,et al.  Automatic labeling hierarchical topics , 2012, CIKM '12.

[110]  Timothy Baldwin,et al.  Best Topic Word Selection for Topic Labelling , 2010, COLING.

[111]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[112]  Luciano Del Corro,et al.  ClausIE: clause-based open information extraction , 2013, WWW.

[113]  Chao Liu,et al.  A probabilistic approach to spatiotemporal theme pattern mining on weblogs , 2006, WWW '06.