MetaPAD: Meta Pattern Discovery from Massive Text Corpora

Mining textual patterns in news, tweets, papers, and many other kinds of text corpora has been an active theme in text mining and NLP research. Previous studies adopt a dependency parsing-based pattern discovery approach. However, the parsing results lose rich context around entities in the patterns, and the process is costly for a corpus of large scale. In this study, we propose a novel typed textual pattern structure, called meta pattern, which is extended to a frequent, informative, and precise subsequence pattern in certain context. We propose an efficient framework, called MetaPAD, which discovers meta patterns from massive corpora with three techniques: (1) it develops a context-aware segmentation method to carefully determine the boundaries of patterns with a learnt pattern quality assessment function, which avoids costly dependency parsing and generates high-quality patterns; (2) it identifies and groups synonymous meta patterns from multiple facets---their types, contexts, and extractions; and (3) it examines type distributions of entities in the instances extracted by each group of patterns, and looks for appropriate type levels to make discovered patterns precise. Experiments demonstrate that our proposed framework discovers high-quality typed textual patterns efficiently from different genres of massive corpora and facilitates information extraction.

[1]  Rahul Gupta,et al.  Biperpedia: An Ontology for Search Applications , 2014, Proc. VLDB Endow..

[2]  Benjamin Van Durme,et al.  Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs , 2008, ACL.

[3]  Mohamed Yahya,et al.  ReNoun: Fact Extraction for Nominal Attributes , 2014, EMNLP.

[4]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[5]  Clare R. Voss,et al.  ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering , 2015, KDD.

[6]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[7]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[9]  Sujith Ravi,et al.  Using structured text for large-scale attribute extraction , 2008, CIKM '08.

[10]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[11]  Michael Gamon,et al.  Representing Text for Joint Embedding of Text and Knowledge Bases , 2015, EMNLP.

[12]  Michael Strube,et al.  WikiNet: A Very Large Scale Multi-Lingual Concept Network , 2010, LREC.

[13]  Mark Steedman,et al.  Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning , 2012 .

[14]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[15]  Rayid Ghani,et al.  Semi-Supervised Learning of Attribute-Value Pairs from Product Descriptions , 2007, IJCAI.

[16]  Heng Ji,et al.  Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding , 2016, KDD.

[17]  Jiawei Han,et al.  Mining Quality Phrases from Massive Text Corpora , 2015, SIGMOD Conference.

[18]  Daniel S. Weld,et al.  Open Information Extraction Using Wikipedia , 2010, ACL.

[19]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[20]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[21]  Christos Faloutsos,et al.  Inferring lockstep behavior from connectivity pattern in large graphs , 2016, Knowledge and Information Systems.

[22]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[23]  Gerhard Weikum,et al.  Fine-grained Semantic Typing of Emerging Entities , 2013, ACL.

[24]  Christopher D. Manning,et al.  Leveraging Linguistic Structure For Open Domain Information Extraction , 2015, ACL.

[25]  Yuefeng Li,et al.  Effective Pattern Discovery for Text Mining , 2012, IEEE Transactions on Knowledge and Data Engineering.

[26]  Estevam R. Hruschka,et al.  Discovering Relations between Noun Categories , 2011, EMNLP.

[27]  Frank Harary,et al.  A Procedure for Clique Detection Using the Group Matrix , 1957 .

[28]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[29]  Rayid Ghani,et al.  Text mining for product attribute extraction , 2006, SKDD.

[30]  Ariel Fuxman,et al.  Matching unstructured product offers to structured product specifications , 2011, KDD.

[31]  Daniel S. Weld,et al.  Fine-Grained Entity Recognition , 2012, AAAI.

[32]  Heng Ji,et al.  Overview of the TAC 2010 Knowledge Base Population Track , 2010 .

[33]  Heng Ji,et al.  Unsupervised Person Slot Filling based on Graph Mining , 2016, ACL.

[34]  Christos Faloutsos,et al.  CatchTartan: Representing and Summarizing Dynamic Multicontextual Behaviors , 2016, KDD.

[35]  Jiawei Han,et al.  Automated Phrase Mining from Massive Text Corpora , 2017, IEEE Transactions on Knowledge and Data Engineering.

[36]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[37]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[38]  Arjun Mukherjee,et al.  Aspect Extraction with Automated Prior Knowledge Learning , 2014, ACL.

[39]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[40]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[41]  Gerhard Weikum,et al.  PATTY: A Taxonomy of Relational Patterns with Semantic Types , 2012, EMNLP.

[42]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[43]  Radu Vultur Mesh , 2011, Encyclopedia of Parallel Computing.

[44]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[45]  Xiao Yu,et al.  Discovering Structure in the Universe of Attribute Names , 2016, WWW.

[46]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.