Searching online book documents and analyzing book citations

Academic search engines and digital libraries provide convenient online search and access facilities for scientific publications. However, most existing systems do not include books in their collections although several books are freely available online. Academic books are different from papers in terms of their length, contents and structure. We argue that accounting for academic books is important in understanding and assessing scientific impact. We introduce an open-book search engine that extracts and indexes metadata, contents, and bibliography from online PDF book documents. To the best of our knowledge, no previous work gives a systematical study on building a search engine for books. We propose a hybrid approach for extracting title and authors from a book that combines results from CiteSeer, a rule based extractor, and a SVM based extractor, leveraging web knowledge. For "table of contents" recognition, we propose rules based on multiple regularities based on numbering and ordering. In addition, we study bibliography extraction and citation parsing for a large dataset of books. Finally, we use the multiple fields available in books to rank books in response to search queries. Our system can effectively extract metadata and contents from large collections of online books and provides efficient book search and retrieval facilities.

[1]  Gabriella Kazai,et al.  Book Search Experiments: Investigating IR Methods for the Indexing and Retrieval of Books , 2008, ECIR.

[2]  David M. Pennock,et al.  Persistence of information on the web: analyzing citations contained in research articles , 2000, CIKM '00.

[3]  Eugene Garfield,et al.  Impact factors, and why they won't go away , 2001, Nature.

[4]  Qinghua Zheng,et al.  Automatic extraction of titles from general documents using machine learning , 2006, Inf. Process. Manag..

[5]  C. Lee Giles,et al.  Popularity Weighted Ranking for Academic Digital Libraries , 2007, ECIR.

[6]  Jean-Luc Meunier,et al.  On tables of contents and how to recognize them , 2009, International Journal of Document Analysis and Recognition (IJDAR).

[7]  Zhi Tang,et al.  Analysis of Book Documents' Table of Content Based on Clustering , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[8]  Gabriella Kazai,et al.  Overview of the INEX 2010 Book Track: Scaling Up the Evaluation Using Crowdsourcing , 2010, INEX.

[9]  R. Manmatha,et al.  A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[10]  Mike Thelwall,et al.  Assessing the citation impact of books: The role of Google Books, Google Scholar, and Scopus , 2011, J. Assoc. Inf. Sci. Technol..

[11]  Ying Liu,et al.  Structure extraction from PDF-based book documents , 2011, JCDL '11.

[12]  Gabriella Kazai,et al.  Social book search: comparing topical relevance judgements and book suggestions for evaluation , 2012, CIKM.

[13]  C. Lee Giles,et al.  The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists , 2012, WebSci '12.

[14]  J. Hirsch Does the h index have predictive power? , 2007, Proceedings of the National Academy of Sciences.

[15]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[16]  Walid Magdy,et al.  Book search: indexing the valuable parts , 2008, BooksOnline '08.

[17]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[18]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[19]  Chandrashekar Ramanathan,et al.  Challenges in generating bookmarks from TOC entries in e-books , 2012, DocEng '12.

[20]  Jean-Luc Meunier,et al.  Structuring documents according to their table of contents , 2005, DocEng '05.

[21]  C. Lee Giles,et al.  Scholarly publishing in the Internet age: a citation analysis of computer science literature , 2001, Inf. Process. Manag..

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  Xiaofan Lin,et al.  Detection and analysis of table of contents based on content association , 2005, International Journal of Document Analysis and Recognition (IJDAR).

[24]  Jure Leskovec,et al.  Citing for high impact , 2010, JCDL '10.

[25]  Mike Thelwall,et al.  Google book search: Citation analysis for social science and the humanities , 2009 .

[26]  Richard N. Taylor,et al.  Automatic and versatile publications ranking for research institutions and scholars , 2007, CACM.

[27]  Gabriella Kazai,et al.  Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking , 2011, SIGIR.

[28]  Giovanni Soda,et al.  Table of contents recognition for converting PDF documents in e-book formats , 2010, DocEng '10.