Scalable Visual Analytics of Massive Textual Datasets

This paper describes the first scalable implementation of a text processing engine used in visual analytics tools. These tools aid information analysts in interacting with and understanding large textual information content through visual interfaces. By developing a parallel implementation of the text processing engine, we enabled visual analytics tools to exploit cluster architectures and handle massive datasets. The paper describes key elements of our parallelization approach and demonstrates virtually linear scaling when processing multi-gigabyte data sets such as Pubmed. This approach enables interactive analysis of large datasets beyond capabilities of existing state-of-the art visual analytics tools.

[1]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[2]  Shmuel Tomi Klein,et al.  Detecting content-bearing words by serial clustering—extended abstract , 1995, SIGIR '95.

[3]  Daniel A. Keim,et al.  Visual Analytics , 2009, Encyclopedia of Database Systems.

[4]  Ben Shneiderman,et al.  Tree-maps: a space-filling approach to the visualization of hierarchical information structures , 1991, Proceeding Visualization '91.

[5]  Thierry Matthey,et al.  ProtoMol: A Molecular Dynamics Framework with Incremental Parallelization , 2001, PPSC.

[6]  Paul G. Spirakis,et al.  Parallel text retrieval on a high performance supercomputer using the Vector Space Model , 1995, SIGIR '95.

[7]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[8]  Jean-Daniel Fekete,et al.  Interactive information visualization of a million items , 2002, IEEE Symposium on Information Visualization, 2002. INFOVIS 2002..

[9]  James J. Thomas,et al.  Visualizing the non-visual: spatial analysis and interaction with information from text documents , 1995, Proceedings of Visualization 1995 Conference.

[10]  Dhabaleswar K. Panda,et al.  High Performance Remote Memory Access Communication: The Armci Approach , 2006, Int. J. High Perform. Comput. Appl..

[11]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[12]  Peter Willett,et al.  Use of text signatures for document retrieval in a highly parallel environment , 1987, Parallel Comput..

[13]  Alan Weiss,et al.  Allocating Independent Subtasks on Parallel Processors , 1985, IEEE Transactions on Software Engineering.

[14]  Kristin A. Cook,et al.  Illuminating the Path: The Research and Development Agenda for Visual Analytics , 2005 .

[15]  GhemawatSanjay,et al.  The Google file system , 2003 .

[16]  Dhabaleswar K. Panda,et al.  Exploiting Non-blocking Remote Memory Access Communication in Scientific Benchmarks , 2003, HiPC.

[17]  Stephen E. Robertson,et al.  Parallel computing in information retrieval - an updated review , 1997, J. Documentation.

[18]  Edward A. Fox,et al.  FAST-INV: A Fast Algorithm for building large inverted files , 1991 .

[19]  Shmuel T. Klein,et al.  Detecting Content-Bearing Words by Serial Clustering. , 1995, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.