IPKB: a digital library for invertebrate paleontology

In this paper, we present the Invertebrate Paleontology Knowledgebase (IPKB), an effort to digitize and share the Treatise on Invertebrate Paleontology. The Treatise is the most authoritative compilation of invertebrate fossil records. Unfortunately, the PDF version is simply a clone of paper publications and the content is in no way organized to facilitate search and knowledge discovery. We extracted texts and images from the Treatise, stored them in a database, and built a system for efficient browsing and searching. For image processing in particular, we segmented fossil photos from figures, recognized the embedded labels, and linked the images to the corresponding data entries. The detailed information of each genus, including fossil images, is delivered to users through a web access module. Some external applications (e.g. Google Earth) are acquired through web services APIs to improve user experience. Given the rich information in the Treatise, analyzing, modeling and understanding paleontological data are significant in many areas, such as: understanding evolution; understanding climate change; finding fossil fuels, etc. IPKB builds a general framework that aims to facilitate knowledge discovery activities in invertebrate paleontology, and provides a solid foundation for future explorations. In this article, we report our initial accomplishments. The specific techniques we employed in the project, such as those involved in text parsing, image-label association and meta data extraction, can be insightful and serve as examples for other researchers.

[1]  Baogang Wei,et al.  CARES: a ranking-oriented CADAL recommender system , 2009, JCDL '09.

[2]  Ching-chih Chen Global Memory Net: New Collaboration, New Activities and New Potentials , 2004, ICADL.

[3]  R. Moore,et al.  Treatise on Invertebrate Paleontology , 1950 .

[4]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Fernand Meyer,et al.  Topographic distance and watershed lines , 1994, Signal Process..

[6]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[7]  Na Li,et al.  oreChem ChemXSeer: a semantic digital library for chemistry , 2010, JCDL '10.

[8]  Edie M. Rasmussen,et al.  What do digital librarians do , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[9]  Matt Jones,et al.  Mobility, digital libraries and a rural indian village , 2009, JCDL '09.

[10]  Dmitriy Fradkin,et al.  Anticipating annotations and emerging trends in biomedical literature , 2008, KDD.

[11]  Lecia Jane Barker,et al.  Science teachers' use of online resources and the digital library for Earth system education , 2009, JCDL '09.

[12]  Rafael C. González,et al.  Digital image processing using MATLAB , 2006 .

[13]  Herbert Van de Sompel,et al.  SharedCanvas: a collaborative model for medieval manuscript layout dissemination , 2011, JCDL '11.

[14]  Scott Phillips,et al.  Large-scale ETD repositories: a case study of a digital library application , 2009, JCDL '09.

[15]  Jim R. Parker,et al.  Algorithms for image processing and computer vision , 1996 .

[16]  Hui Han,et al.  eBizSearch: an OAI-compliant digital library for ebusiness , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[17]  Xiaolong Zhang,et al.  CollabSeer: a search engine for collaboration discovery , 2011, JCDL '11.

[18]  Bryan S. Morse,et al.  Improving historical research by linking digital library information to a global genealogical database , 2009, JCDL '09.

[19]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[20]  James Ze Wang,et al.  SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Ching-chih Chen,et al.  Global Memory Net Offers New Innovative Access to Tsurumi's Old Japanese Waka Poems and Tales, and Maps , 2005, ICADL.

[22]  Jean Serra,et al.  Image Analysis and Mathematical Morphology , 1983 .

[23]  Edward A. Fox,et al.  Digital libraries , 1995, CACM.

[24]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[25]  Re Gonzalez,et al.  R.C. Eddins, Digital image processing using MATLAB, vol. Gatesmark Publishing Knoxville , 2009 .

[26]  Wang-Chien Lee,et al.  CiteSeerx: an architecture and web service design for an academic document search engine , 2006, WWW '06.

[27]  Luc Vincent,et al.  Morphological grayscale reconstruction in image analysis: applications and efficient algorithms , 1993, IEEE Trans. Image Process..

[28]  Hans-Dieter Daniel,et al.  Data sources for performing citation analysis: an overview , 2008, J. Documentation.

[29]  Tien Dat Nguyen,et al.  Facilitating content creation and content research in building the city of lit digital library , 2011, JCDL '11.

[30]  C. Lee Giles,et al.  ChemXSeer: a digital library and data repository for chemical kinetics , 2007, CIMS '07.

[31]  C. Lee Giles,et al.  Who gets acknowledged: Measuring scientific contributions through automatic acknowledgment indexing , 2004, Proc. Natl. Acad. Sci. USA.

[32]  Thorsten Joachims,et al.  Information genealogy: uncovering the flow of ideas in non-hyperlinked document databases , 2007, KDD '07.

[33]  Catherine C. Marshall,et al.  Going digital: a look at assumptions underlying digital libraries , 1995, CACM.