Insights from Network Structure for Text Mining

Text mining and data harvesting algorithms have become popular in the computational linguistics community. They employ patterns that specify the kind of information to be harvested, and usually bootstrap either the pattern learning or the term harvesting process (or both) in a recursive cycle, using data learned in one step to generate more seeds for the next. They therefore treat the source text corpus as a network, in which words are the nodes and relations linking them are the edges. The results of computational network analysis, especially from the world wide web, are thus applicable. Surprisingly, these results have not yet been broadly introduced into the computational linguistics community. In this paper we show how various results apply to text mining, how they explain some previously observed phenomena, and how they can be helpful for computational linguistics applications.

[1]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[2]  Jimmy J. Lin,et al.  Integrating Web-based and Corpus-based Techniques for Question Answering , 2003, TREC.

[3]  Gert Sabidussi,et al.  The centrality index of a graph , 1966 .

[4]  Walter Willinger,et al.  Towards a Theory of Scale-Free Graphs: Definition, Properties, and Implications , 2005, Internet Math..

[5]  Patrick Pantel,et al.  Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations , 2006, ACL.

[6]  Marius Pasca,et al.  Acquisition of categorized named entities for web search , 2004, CIKM '04.

[7]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[8]  Dedre Gentner,et al.  Some interesting differences between nouns and verbs , 1981 .

[9]  Michael Gasser,et al.  Learning Nouns and Adjectives: A Connectionist Account , 1998 .

[10]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[11]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[12]  Zornitsa Kozareva,et al.  Not All Seeds Are Equal: Measuring the Quality of Text Mining Seeds , 2010, NAACL.

[13]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[14]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[15]  Jon Kleinberg,et al.  The Structure of the Web , 2001, Science.

[16]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[17]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[18]  Santo Fortunato,et al.  Diffusion of scientific credits and the ranking of scientists , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[19]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[20]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[21]  George W. Davidson,et al.  Roget's Thesaurus of English Words and Phrases , 1982 .

[22]  Eric Crestan,et al.  Helping editors choose better seed sets for entity set expansion , 2009, CIKM.

[23]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[24]  Fernando Pereira,et al.  Graph-based weakly-supervised methods for information extraction & integration , 2010 .

[25]  Dmitry Zelenko,et al.  Kernel Methods for Relation Extraction , 2002, J. Mach. Learn. Res..

[26]  Sergei Maslov,et al.  Finding scientific gems with Google's PageRank algorithm , 2006, J. Informetrics.

[27]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[28]  Zornitsa Kozareva,et al.  Learning Arguments and Supertypes of Semantic Relations Using Recursive Patterns , 2010, ACL.

[29]  Ellen Riloff,et al.  Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs , 2008, ACL.

[30]  Estevam R. Hruschka,et al.  Coupled semi-supervised learning for information extraction , 2010, WSDM '10.

[31]  Marius Pasca,et al.  Weakly-supervised discovery of named entities using web search queries , 2007, CIKM '07.

[32]  Joshua B. Tenenbaum,et al.  The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth , 2001, Cogn. Sci..

[33]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[34]  Fabio Massimo Zanzotto,et al.  Discovering Asymmetric Entailment Relations between Verbs Using Selectional Preferences , 2006, ACL.

[35]  Lise Getoor,et al.  FutureRank: Ranking Scientific Articles by Predicting their Future PageRank , 2009, SDM.

[36]  Ellen Riloff,et al.  A Corpus-Based Approach for Building Semantic Lexicons , 1997, EMNLP.

[37]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[38]  Bruno R. Preiss,et al.  Data Structures and Algorithms with Object-Oriented Design Patterns in Java , 1999 .

[39]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[40]  Sergei Maslov,et al.  Ranking scientific publications using a model of network traffic , 2006, ArXiv.

[41]  Partha Pratim Talukdar,et al.  Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition , 2010, ACL.

[42]  Dan I. Moldovan,et al.  Learning Semantic Constraints for the Automatic Discovery of Part-Whole Relations , 2003, NAACL.

[43]  Patrick Pantel,et al.  Concept Discovery from Text , 2002, COLING.

[44]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[45]  M. Newman,et al.  Mixing patterns in networks. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.