Scientific Data and Document Processing in ChemxSeer

ChemXSeer is a digital library and a data repository for the chemistry domain. The data deposited into our repository is linked with digital documents to create aggregates of resources representing the links between the data and the articles in which the data is reported. ChemXSeer enables the user to annotate the data using a metadata capturing tool. The metadata is indexed and searched to return relevant datasets to the user. ChemXSeer extracts chemical formulae and chemical names, disambiguates them and indexes them to allow for domain-knowledge enhanced search capabilities. As search engines mature, we foresee such vertical search engines, employing domain-specific knowledge to perform information extraction and indexing, especially for scientific domains, become more popular. Though substantial research has been pursued on information extraction from text, extracting information from tables and figures has received little attention. In the ChemXSeer project, we are building tools that allow automatic extraction of tables and figures.

[1]  Anne E. Trefethen,et al.  Cyberinfrastructure for e-Science , 2005, Science.

[2]  James Ze Wang,et al.  Automatic Extraction of Data from 2-D Plots in Documents , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[3]  W. Bruce Croft,et al.  Table extraction for answer retrieval , 2006, Information Retrieval.

[4]  C. Lee Giles,et al.  Image annotation by hierarchical mapping of features , 2007, WWW '07.

[5]  C. Lee Giles,et al.  Mining, indexing, and searching for textual chemical molecule information on the web , 2008, WWW.

[6]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[7]  C. Lee Giles,et al.  Extraction and search of chemical formulae in text documents on the web , 2007, WWW '07.

[8]  Wang-Chien Lee,et al.  CiteSeerx: an architecture and web service design for an academic document search engine , 2006, WWW '06.

[9]  James Ze Wang,et al.  An architecture for creating collaborative semantically capable scientific data sharing infrastructures , 2006, WIDM '06.

[10]  Kun Bai,et al.  TableRank: A Ranking Algorithm for Table Search and Retrieval , 2007, AAAI.

[11]  Marco Gori,et al.  Towards Next Generation CiteSeer: A Flexible Architecture for Digital Library Deployment , 2006, ECDL.

[12]  James Ze Wang,et al.  Automatic categorization of figures in scientific documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[13]  C. Lee Giles The future of citeseer: citeseer x , 2006 .

[14]  W. Bruce Croft,et al.  TINTIN: a system for retrieval in text tables , 1997, DL '97.