Parallel Suffix Arrays for Corpus Exploration

This paper describes how recently developed techniques for suffix array construc- tion and compression can be expanded to bring a new data structure, called parallel suffix array, into existence, which is suitable as an in-memory representation of large annotated corpora, enabling complex queries and fast extractions of the context of matching substrings. It is also shown how parallel suffix arrays are superior to existing corpus search engines, in particular when sequential queries and corpora that are hard to tokenize are involved.

[1]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[2]  Pang Ko,et al.  Linear Time Construction of Suffix Arrays , 2002 .

[3]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[4]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[5]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[6]  Oliver Christ,et al.  A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[7]  Mark Davies Relational n-gram databases as a basis for unlimited annotation on large corpora , 2003 .

[8]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[9]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[10]  Adam Kilgarriff Googleology is Bad Science , 2007, Computational Linguistics.

[11]  Sophia Ananiadou Text Mining for Biomedicine , 2009, Information Retrieval in Biomedicine.

[12]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2003, J. Discrete Algorithms.

[13]  Antoinette Renouf,et al.  WebCorp: an integrated system for web text search , 2007 .

[14]  Pavel Rychlý,et al.  Manatee/Bonito - A Modular Corpus Manager , 2007, RASLAN.

[15]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[16]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[17]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.