Efficient Generation and Processing of Word Co-occurrence Networks Using corpus2graph

Corpus2graph is an open-source NLPapplication-oriented Python package that generates a word co-occurrence network from a large corpus. It not only contains different built-in methods to preprocess words, analyze sentences, extract word pairs and define edge weights, but also supports user-customized functions. By using parallelization techniques, it can generate a large word co-occurrence network of the whole English Wikipedia data within hours. And thanks to its nodes-edges-weight three-level progressive calculation design, rebuilding networks with different configurations is even faster as it does not need to start all over again. This tool also works with other graph libraries such as igraph, NetworkX and graph-tool as a front end providing data to boost network generation speed.

[1]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[2]  Michalis Vazirgiannis,et al.  Graph-of-word and TW-IDF: new approach to ad hoc IR , 2013, CIKM.

[3]  Xiaojin Zhu,et al.  Improving Diversity in Ranking using Absorbing Random Walks , 2007, NAACL.

[4]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[5]  Olivier Ferret,et al.  Discovering word senses from a network of lexical cooccurrences , 2004, COLING.

[6]  Ramon Ferrer i Cancho,et al.  The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[7]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[8]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[9]  Dragomir R. Radev,et al.  Book Review: Graph-Based Natural Language Processing and Information Retrieval by Rada Mihalcea and Dragomir Radev , 2011, CL.

[10]  David J. Aldous,et al.  Lower bounds for covering times for reversible Markov chains and random walks on graphs , 1989 .

[11]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[12]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[13]  Tiago P. Peixoto,et al.  The graph-tool python library , 2014 .