ParaText : scalable solutions for processing and searching very large document collections : final LDRD report.

This report is a summary of the accomplishments of the 'Scalable Solutions for Processing and Searching Very Large Document Collections' LDRD, which ran from FY08 through FY10. Our goal was to investigate scalable text analysis; specifically, methods for information retrieval and visualization that could scale to extremely large document collections. Towards that end, we designed, implemented, and demonstrated a scalable framework for text analysis - ParaText - as a major project deliverable. Further, we demonstrated the benefits of using visual analysis in text analysis algorithm development, improved performance of heterogeneous ensemble models in data classification problems, and the advantages of information theoretic methods in user analysis and interpretation in cross language information retrieval. The project involved 5 members of the technical staff and 3 summer interns (including one who worked two summers). It resulted in a total of 14 publications, 3 new software libraries (2 open source and 1 internal to Sandia), several new end-user software applications, and over 20 presentations. Several follow-on projects have already begun or will start in FY11, with additional projects currently in proposal.

[1]  Richard B. Lehoucq,et al.  Anasazi software for the numerical solution of large-scale eigenvalue problems , 2009, TOMS.

[2]  Kevin W. Bowyer,et al.  Combination of multiple classifiers using local accuracy estimates , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[4]  Ahmed Abdelali,et al.  Latent Morpho-Semantic Analysis: Multilingual Information Retrieval with Character N-Grams and Mutual Information , 2008, COLING.

[5]  Daniel M. Dunlavy,et al.  HETEROGENEOUS ENSEMBLE CLASSIFICATION , 2009 .

[6]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[7]  Darrell Laham,et al.  From paragraph to graph: Latent semantic analysis for information visualization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Derek Partridge,et al.  Hybrid ensembles and coincident-failure diversity , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[9]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[10]  Sebastiano Vigna,et al.  Distributed, large-scale latent semantic analysis by index interpolation , 2008, Infoscale.

[11]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[12]  Daniel M. Dunlavy,et al.  SEMISUPERVISED NAMED ENTITY RECOGNITION , 2009 .

[13]  Brian Wylie,et al.  A unified toolkit for information and scientific visualization , 2009, Electronic Imaging.

[14]  Elaine M. Raybourn,et al.  Beyond game effectiveness. Part II, a qualitative study of multi-role experiential learning. , 2010 .

[15]  Bill Hoffman,et al.  Mastering CMake 4th Edition , 2008 .

[16]  Roy T. Fielding,et al.  Principled design of the modern Web architecture , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[17]  Daniel M. Dunlavy,et al.  Relationships Between Accuracy and Diversity in Heterogeneous Ensemble Classiers , 2009 .

[18]  Wenjia Wang,et al.  On diversity and accuracy of homogeneous and heterogeneous ensembles , 2007, Int. J. Hybrid Intell. Syst..

[19]  Daniel M. Dunlavy,et al.  LSAView: A tool for visual exploration of latent semantic modeling , 2009, 2009 IEEE Symposium on Visual Analytics Science and Technology.

[20]  Markus Lorch,et al.  UIMA GRID: Distributed Large-scale Text Analysis , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).