Successful Data Mining Methods for NLP

Historically Natural Language Processing (NLP) focuses on unstructured data (speech and text) understanding while Data Mining (DM) mainly focuses on massive, structured or semi-structured datasets. The general research directions of these two fields also have followed different philosophies and principles. For example, NLP aims at deep understanding of individual words, phrases and sentences (“micro-level”), whereas DM aims to conduct a high-level understanding, discovery and synthesis of the most salient information from a large set of documents when working on text data (“macro-level”). But they share the same goal of distilling knowledge from data. In the past five years, these two areas have had intensive interactions and thus mutually enhanced each other through many successful text mining tasks. This positive progress mainly benefits from some innovative intermediate representations such as “heterogeneous information networks” [Han et al., 2010, Sun et al., 2012b]. However, successful collaborations between any two fields require substantial mutual understanding, patience and passion among researchers. Similar to the applications of machine learning techniques in NLP, there is usually a gap of at least several years between the creation of a new DM approach and its first successful application in NLP. More importantly, many DM approaches such as gSpan [Yan and Han, 2002] and RankClus [Sun et al., 2009a] have demonstrated their power on structured data. But they remain relatively unknown in the NLP community, even though there are many obvious potential applications. On the other hand, compared to DM, the NLP community has paid more attention to developing large-scale data annotations, resources, shared tasks which cover a wide range of multiple genres and multiple domains. NLP can also provide the basic building blocks for many DM tasks such as text cube construction [Tao et al., 2014]. Therefore in many scenarios, for the same approach the NLP experiment setting is often much closer to real-world applications than its DM counterpart. We would like to share the experiences and lessons from our extensive inter-disciplinary collaborations in the past five years. The primary goal of this tutorial is to bridge the knowledge gap between these two fields and speed up the transition process. We will introduce two types of DM methods: (1). those state-of-the-art DM methods that have already been proven effective for NLP; and (2). some newly developed DM methods that we believe will fit into some specific NLP problems. In addition, we aim to suggest some new research directions in order to better marry these two areas and lead to more fruitful outcomes. The tutorial will thus be useful for researchers from both communities. We will try to provide a concise roadmap of recent perspectives and results, as well as point to the related DM software and resources, and NLP data sets that are available to both research communities.

[1]  Jiawei Han,et al.  Constructing topical hierarchies in heterogeneous information networks , 2013, 2013 IEEE 13th International Conference on Data Mining.

[2]  Jiawei Han,et al.  Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents , 2014, SDM.

[3]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[4]  Philip S. Yu,et al.  Integrating meta-path selection with user-guided object clustering in heterogeneous information networks , 2012, KDD.

[5]  Yizhou Sun,et al.  RankClus: integrating clustering with ranking for heterogeneous information network analysis , 2009, EDBT '09.

[6]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[7]  Michael R. Lyu,et al.  A generalized Co-HITS algorithm and its application to bipartite graphs , 2009, KDD.

[8]  Taylor Cassidy,et al.  The Wisdom of Minority: Unsupervised Slot Filling Validation based on Multi-dimensional Truth-Finding , 2014, COLING.

[9]  Philip S. Yu,et al.  Mining top-K large structural patterns in a massive network , 2011, Proc. VLDB Endow..

[10]  Yizhou Sun,et al.  NewsNetExplorer: automatic construction and exploration of news information networks , 2014, SIGMOD Conference.

[11]  Yizhou Sun,et al.  Personalized entity recommendation: a heterogeneous information network approach , 2014, WSDM.

[12]  Philip S. Yu,et al.  Substructure similarity search in graph databases , 2005, SIGMOD '05.

[13]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[14]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[15]  Charu C. Aggarwal,et al.  When will it happen?: relationship prediction in heterogeneous information networks , 2012, WSDM '12.

[16]  Jiawei Han,et al.  Mining heterogeneous information networks , 2010, Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '10.

[17]  Philip S. Yu,et al.  Mining significant graph patterns by leap search , 2008, SIGMOD Conference.

[18]  C. V. Ramamoorthy,et al.  Knowledge and Data Engineering , 1989, IEEE Trans. Knowl. Data Eng..

[19]  Xiang Li,et al.  Learning Hierarchical Relationships among Partially Ordered Objects with Heterogeneous Attributes and Links , 2012, SDM.

[20]  Yizhou Sun,et al.  Mining Heterogeneous Information Networks: Principles and Methodologies , 2012, Mining Heterogeneous Information Networks: Principles and Methodologies.

[21]  Charu C. Aggarwal,et al.  Co-author Relationship Prediction in Heterogeneous Bibliographic Networks , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[22]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[23]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[24]  Heng Ji,et al.  Collective Tweet Wikification based on Semi-supervised Graph Regularization , 2014, ACL.

[25]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[26]  Yang Li,et al.  Mining evidences for named entity disambiguation , 2013, KDD.

[27]  Philip S. Yu,et al.  PathSelClus: Integrating Meta-Path Selection with User-Guided Object Clustering in Heterogeneous Information Networks , 2013, TKDD.

[28]  Yizhou Sun,et al.  Co-Evolution of Multi-Typed Objects in Dynamic Star Networks , 2014, IEEE Transactions on Knowledge and Data Engineering.

[29]  Philip S. Yu,et al.  Graph OLAP: a multi-dimensional framework for graph data analysis , 2009, Knowledge and Information Systems.

[30]  Sangkyum Kim,et al.  Authorship classification: a discriminative syntactic tree mining approach , 2011, SIGIR.

[31]  Heng Ji,et al.  Exploring and inferring user–user pseudo‐friendship for sentiment analysis with heterogeneous networks , 2014, Stat. Anal. Data Min..

[32]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[33]  Yinghui Wu,et al.  Schemaless and Structureless Graph Querying , 2014, Proc. VLDB Endow..

[34]  Ahmed El-Kishky,et al.  Bringing structure to text: mining phrases, entities, topics, and hierarchies , 2014, KDD.

[35]  Heng Ji,et al.  Resolving Entity Morphs in Censored Data , 2013, ACL.

[36]  Clare R. Voss,et al.  Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..