Using Layout Data for the Analysis of Scientific Literature

It is said that the world knowledge is in the Internet. Scientific knowledge is in the books, journals and conference proceedings. Yet both repositories are too large to skim through manually. We need clever algorithms to cope with the huge amount of information. To filter, sort and ultimately mine the information available it is vital to use every source of information we have. A common technique is to mine the text from the publications, but they are more complex than the text they include. The position of the words gives us clues about their meaning. Additional images either supplement the text or offer proof to a proposition. Tables cannot be understood before deciphering the rows and columns. To deal with the additional information, classic text mining techniques have to be coupled with spatial data and image data. In this chapter, we will give some background to the various techniques, explain the necessary pre-processing steps involved and present two case studies, one from image mining and one from table identification.

[1]  Peter F. Stadler,et al.  litsift: Automated Text Categorization in Bibliographic Search , 2003 .

[2]  Jeyakumar Natarajan,et al.  Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line , 2006, BMC Bioinformatics.

[3]  Martijn J. Schuemie,et al.  Distribution of information in biomedical abstracts and full-text publications , 2004, Bioinform..

[4]  Ying Liu,et al.  Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships: A Comparative Study of Algorithms , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  C. Frankel,et al.  Distinguishing photographs and graphics on the World Wide Web , 1997, 1997 Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries.

[6]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[7]  Konstantin Zuyev Table image segmentation , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[8]  Damien Chaussabel,et al.  Biomedical Literature Mining , 2004, American journal of pharmacogenomics : genomics-related research in drug development and clinical practice.

[9]  Yasuaki Nakano,et al.  Document Analysis Systems: Theory and Practice , 2003, Lecture Notes in Computer Science.

[10]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[11]  Edgar Wingender,et al.  PRODORIC: prokaryotic database of gene regulation , 2003, Nucleic Acids Res..

[12]  Xiaohua Hu,et al.  Data Mining and Predictive Modeling of Biomolecular Network from Biomedical Literature Databases , 2007, TCBB.

[13]  Brigitte Mathiak,et al.  Improving Literature Preselection by Searching for Images , 2006, KDLL.

[14]  Thomas G Kieninger,et al.  Table structure recognition based on robust block segmentation , 1998, Electronic Imaging.

[15]  Thomas Kieninger,et al.  The T-Recs Table Recognition and Analysis System , 1998, Document Analysis Systems.

[16]  Fang Liu,et al.  FigSearch: a figure legend indexing and classification system , 2004, Bioinform..

[17]  L Hunter,et al.  MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. , 1999, BioTechniques.

[18]  Alexander A. Morgan,et al.  Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup , 2003, ISMB.

[19]  Hermann Ney,et al.  Classification of Medical Images Using local Representations , 2002, Bildverarbeitung für die Medizin.

[20]  Ingemar J. Cox,et al.  The Bayesian image retrieval system, PicHunter: theory, implementation, and psychophysical experiments , 2000, IEEE Trans. Image Process..

[21]  Eric G. Bremer Knowledge Discovery in Life Science Literature, PAKDD 2006 International Workshop, KDLL 2006, Singapore, April 9, 2006, Proceedings , 2006, KDLL.

[22]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.