Finding information in books: Characteristics of full-text searches in a collection of 10 million books

Searching large collections of digitized books is a relatively new area in information-seeking and retrieval research, made possible by initiatives such as Google Books and the HathiTrust Digital Library. The availability of large full-text book collections is transforming how users search and interact with information in books, but the characteristics of these changes are unknown. This paper aims to provide insight into the characteristics of full-text searches in a large collection of digitized books and is the first step in a broader research agenda intended to improve book retrieval. To better understand the types of queries that users are issuing to full-text-book collections, we analyzed a full year of anonymized query logs from the HathiTrust Digital Library full-text search engine. We also manually classified a random sample of 600 queries to develop a taxonomy of book search query types. We found that users are beginning to search for information in books instead of searching for books. Searches still largely follow bibliographic models, but, as expected, new types of searches are beginning to take advantage of full-text capabilities. Additionally, comparing the results of our query log analysis to searches in other domains, we found similar search patterns including short queries, sessions with only a few queries, and users viewing only a few pages of results per query. We discuss how these findings can be used to characterize users of large full-text book collections.

[1]  Michael K. Buckland On types of search and the allocation of library resources , 1979, J. Am. Soc. Inf. Sci..

[2]  Fernando Diaz,et al.  Sources of evidence for vertical selection , 2009, SIGIR.

[3]  Frederick G. Kilgour,et al.  Retrieval Effectiveness of Surname-Title-Word Searches for Known Items by Academic Library Users , 1999, J. Am. Soc. Inf. Sci..

[4]  Lyn Condron,et al.  Functional Requirements for Bibliographic Records , 2004 .

[5]  Mark T. J. Carden E-Books are not books , 2008, BooksOnline '08.

[6]  Miles Efron,et al.  Query representation for cross-temporal information retrieval , 2013, SIGIR.

[7]  B. B. Tillett Bibliographic relationships : an empirical study of the LC machine-readable records , 1992 .

[8]  Xiao Li,et al.  Learning query intent from regularized click graphs , 2008, SIGIR '08.

[9]  Dan Morris,et al.  Investigating the querying and browsing behavior of advanced search engine users , 2007, SIGIR.

[10]  Susan T. Dumais,et al.  What should blog search look like? , 2008, SSM '08.

[11]  S. Robertson The probability ranking principle in IR , 1997 .

[12]  Barbara B. Tillett A taxonomy of bibliographic relationships , 1991 .

[13]  Tom Burton-West Practical Relevance Ranking for 10 Million Books , 2012, CLEF.

[14]  Meredith Ringel Morris,et al.  #TwitterSearch: a comparison of microblog search and web search , 2011, WSDM '11.

[15]  Andrei Z. Broder,et al.  Robust classification of rare queries using web knowledge , 2007, SIGIR.

[16]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[17]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[18]  Bernard J. Jansen,et al.  Micro-blogging as online word of mouth branding , 2009, CHI Extended Abstracts.

[19]  Brian Lavoie,et al.  The concept of a work in WorldCat: an application of FRBR , 2003 .

[20]  Meredith Ringel Morris,et al.  What do people ask their social networks, and why?: a survey study of status message q&a behavior , 2010, CHI.

[21]  Ray R. Larson The decline of subject searching: long-term trends and patterns of index use in an online catalog , 1991 .

[22]  Gabriella Kazai,et al.  Overview of the INEX 2011 Books and Social Search Track , 2011, INEX.

[23]  Gabriella Kazai,et al.  Booksonline'12: 5th workshop on online books, complementary social media and their impact , 2012, CIKM '12.

[24]  Jerry Specht Patron use of an online circulation system in known-item searching , 1980, J. Am. Soc. Inf. Sci..

[25]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[26]  Karen Markey,et al.  Twenty-five years of end-user searching, Part 1: Research findings , 2007, J. Assoc. Inf. Sci. Technol..

[27]  Thomas B. Hickey,et al.  Experiments with the IFLA Functional Requirements for Bibliographic Records (FRBR) , 2002, D Lib Mag..

[28]  Ray R. Larson,et al.  The decline of subject searching: Long-term trends and patterns of index use in an online catalog , 1991, J. Am. Soc. Inf. Sci..

[29]  Ophir Frieder,et al.  Varying approaches to topical web query classification , 2007, SIGIR.

[30]  Marc-Allen Cartright,et al.  Understanding book search behavior on the web , 2012, CIKM.

[31]  Gabriella Kazai,et al.  Overview of the INEX 2014 Social Book Search Track , 2014, CLEF.

[32]  Patrick Le Boeuf,et al.  Functional Requirements for Bibliographic Records , 2005 .

[33]  Debra J. Slone Encounters with the OPAC: On-line searching in public libraries , 2000, J. Am. Soc. Inf. Sci..

[34]  Linda C. Smith,et al.  Known-Item Search: Variations on a Concept , 2007, ASIST.

[35]  Sally Jo Cunningham,et al.  A transaction log analysis of a digital library , 2000, International Journal on Digital Libraries.

[36]  In-Ho Kang,et al.  Query type classification for web document retrieval , 2003, SIGIR.

[37]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.