Document Informatics for Scientific Learning and Accelerated Discovery

Abstract This chapter presents a concept paper that describes methods to accelerate new materials discovery and optimization, by enabling faster recognition and use of important theoretical, computational, and experimental information aggregated from peer-reviewed and published materials-related scientific documents online. To obtain insights for the discovery of new materials and to study about existing materials, research and development scientists and engineers rely heavily on an ever-growing number of materials research publications, mostly available online, and that date back many decades. So, the major thrust of this concept paper is the use of technology to (i) extract “deep” meaning from a large corpus of relevant materials science documents; (ii) navigate, cluster, and present documents in a meaningful way; and (iii) evaluate and revise the materials-related query responses until the researchers are guided to their information destination. While the proposed methodology targets the interdisciplinary field of materials research, the tools to be developed can be generalized to enhance scientific discoveries and learning across a broad swathe of disciplines. The research will advance the machine-learning area of developing hierarchical, dynamic topic models to investigate trends in materials discovery over user-specified time periods. Also, the field of image-based document analysis will benefit tremendously from machine learning tools such as the use of deep belief networks for classification and text separation from document images. Developing an interactive visualization tool that can display modeling results from a large materials network perspective as well as a time-based perspective is an advancement in visualization studies.

[1]  Ashwini K. Pande Table Understanding for Information Retrieval , 2002 .

[2]  Carol Tenopir,et al.  Measuring Total Reading of Journal Articles , 2006, D Lib Mag..

[3]  Richard S. Zemel,et al.  Learning stick-figure models using nonparametric Bayesian priors over trees , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Michael I. Jordan,et al.  The Asymptotics of Ranking Algorithms , 2012, ArXiv.

[5]  Thomas L. Griffiths,et al.  Semi-Supervised Learning with Trees , 2003, NIPS.

[6]  Arif E. Jinha Article 50 million: an estimate of the number of scholarly articles in existence , 2010, Learn. Publ..

[7]  Yee Whye Teh,et al.  Bayesian Agglomerative Clustering with Coalescents , 2007, NIPS.

[8]  Chaomei Chen,et al.  Top 10 Unsolved Information Visualization Problems , 2005, IEEE Computer Graphics and Applications.

[9]  Péter Jacsó,et al.  Metadata mega mess in Google Scholar , 2010, Online Inf. Rev..

[10]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[11]  Paul Ginsparg,et al.  Positional effects on citation and readership in arXiv , 2009 .

[12]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[13]  Hal Daumé,et al.  Bayesian Multitask Learning with Latent Hierarchies , 2009, UAI.

[14]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Venu Govindaraju,et al.  Detecting Figure-Panel Labels in Medical Journal Articles Using MRF , 2011, 2011 International Conference on Document Analysis and Recognition.

[17]  R. Manmatha,et al.  Statistical models for text query-based image retrieval , 2008 .

[18]  R. W. Hansen,et al.  The price of innovation: new estimates of drug development costs. , 2003, Journal of health economics.

[19]  James Ze Wang,et al.  Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[20]  Charles H. Ward Materials Genome Initiative for Global Competitiveness , 2012 .

[21]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994 .

[22]  Zhiyong Lu,et al.  Understanding PubMed® user search behavior through log analysis , 2009, Database J. Biol. Databases Curation.

[23]  Venu Govindaraju,et al.  Text extraction from gray scale historical document images using adaptive local connectivity map , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[24]  Michela Bertolotto,et al.  Exploratory spatio-temporal data mining and visualization , 2007, J. Vis. Lang. Comput..

[25]  Ben Shneiderman,et al.  Integrating Statistics and Visualization for Exploratory Power: From Long-Term Case Studies to Design Guidelines , 2009, IEEE Computer Graphics and Applications.

[26]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.

[27]  E. Cantoni Analysis of Robust Quasi-deviances for Generalized Linear Models , 2004 .

[28]  Geoffrey E. Hinton,et al.  Factored 3-Way Restricted Boltzmann Machines For Modeling Natural Images , 2010, AISTATS.

[29]  C. V. Jawahar,et al.  Retrieval from Document Image Collections , 2006, Document Analysis Systems.

[30]  Geoffrey E. Hinton,et al.  On deep generative models with applications to recognition , 2011, CVPR 2011.

[31]  Catherine Plaisant,et al.  TreePlus: Interactive Exploration of Networks with Enhanced Tree Layouts , 2006, IEEE Transactions on Visualization and Computer Graphics.

[32]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[33]  Allan Mark Konrad,et al.  On Inquiry: Human Concept Formation and Construction of Meaning through Library and Information Science Intermediation , 2007 .

[34]  Wei Jin,et al.  HCAMiner: Mining Concept Associations for Knowledge Discovery through Concept Chain Queries , 2007, COLING.

[35]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[36]  Ben Shneiderman Inventing discovery tools: combining information visualization with data mining? , 2002, Inf. Vis..

[37]  Eytan Adar,et al.  GUESS: a language and interface for graph exploration , 2006, CHI.

[38]  Robert P. Futrelle,et al.  Extraction,layout analysis and classification of diagrams in PDF documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[39]  Joshua B. Tenenbaum,et al.  Learning annotated hierarchies from relational data , 2006, NIPS.

[40]  T. Kuhn,et al.  The Structure of Scientific Revolutions. , 1964 .

[41]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[42]  Venu Govindaraju,et al.  Language-motivated approaches to action recognition , 2013, J. Mach. Learn. Res..

[43]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[44]  Edward Lank,et al.  Treatment of Diagrams in Document Image Analysis , 2000, Diagrams.

[45]  Ben Shneiderman,et al.  Strategies for evaluating information visualization tools: multi-dimensional in-depth long-term case studies , 2006, BELIV '06.

[46]  Jason J. Corso,et al.  Robust unsupervised segmentation of degraded document images with topic models , 2009, CVPR.

[47]  Fei-Fei Li,et al.  Towards total scene understanding: Classification, annotation and segmentation in an automatic framework , 2009, CVPR.

[48]  Peter Taylor,et al.  Citation Statistics , 2009, ArXiv.

[49]  Claudio Gutierrez,et al.  Survey of graph database models , 2008, CSUR.

[50]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[51]  Emile H. L. Aarts,et al.  Boltzmann machines , 1998 .

[52]  Michael I. Jordan,et al.  Tree-Structured Stick Breaking for Hierarchical Data , 2010, NIPS.

[53]  Venu Govindaraju,et al.  Document image analysis: A primer , 2002 .

[54]  Padhraic Smyth,et al.  Analysis and Visualization of Network Data using JUNG , 2005 .

[55]  Peder Olesen Larsen,et al.  The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index , 2010, Scientometrics.

[56]  J. Burnham Scopus database: a review , 2006, Biomedical digital libraries.

[57]  Carl Lagoze,et al.  Dienst: an architecture for distributed document libraries , 1995, CACM.

[58]  Venu Govindaraju,et al.  Reading handwritten phrases on U.S. census forms , 1996 .

[59]  Michael I. Jordan,et al.  Stick-Breaking Beta Processes and the Poisson Process , 2012, AISTATS.

[60]  Gyeonghwan Kim,et al.  Bankcheck Recognition Using Cross Validation Between Legal and Courtesy Amounts , 1997, Int. J. Pattern Recognit. Artif. Intell..

[61]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .