Using Context to Assist in Personal File Retrieval (CMU-CS-06-147)

Personal data is growing at ever increasing rates, fueled by a growing market for personal computing solutions and dramatic growth of available storage space on these platforms. Users, no longer limited in what they can store, are now faced with the problem of organizing their data such that they can find it again later. Unfortunately, as data sets grow the complexity of organizing these sets also grows. This problem has driven a sudden growth in search tools aimed at the personal computing space, designed to assist users in locating data within their disorganized file space. Despite the sudden growth in this area, local file search tools are often inaccurate. These inaccuracies have been a long-standing problem for file data, as evidenced by the downfall of attribute-based naming systems that often relied on content analysis to provide meaningful attributes to files for automated organization. While file search tools have lagged behind, search tools designed for the world wide web have found wide-spread acclaim. Interestingly, despite significant increases in non-textual data on the web (e.g., images, movies), web search tools continue to be effective. This is because the web contains key information that is currently unavailable within file systems: context. By capturing context information, e.g., the links describing how data on the web is inter-related, web search tools can significantly improve the quality of search over content analysis techniques alone. This work describes Connections, a context-enhanced search tool that utilizes temporal locality among file accesses to provide inter-file relationships to the local file system. Once identified, these inter-file relationships provide context information, similar to that available in the world wide web. Connections leverages this context to improve the quality of file search results. Specifically, user studies with Connections see improvements in both precision and recall (i.e., fewer false-positives and false-negatives) over content-only search, and a live deployment found that users experienced reduced search time with Connections when compared to content-only search.

[1]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[2]  Hui Lei,et al.  An analytical approach to file prefetching , 1997 .

[3]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[4]  Arun N. Swami,et al.  Set-oriented mining for association rules in relational databases , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[5]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[6]  W. Bruce Croft,et al.  Indri at TREC 2004: Terabyte Track , 2004, TREC.

[7]  Gordon Bell,et al.  Passive capture and ensuing issues for a personal lifetime store , 2004, CARPE'04.

[8]  Thad Starner,et al.  Remembrance Agent: A Continuously Running Automated Information Retrieval System , 1996, PAAM.

[9]  Peter B. Danzig,et al.  Scalable Internet resource discovery: research problems and approaches , 1994, CACM.

[10]  Mary Czerwinski,et al.  Visualizing implicit queries for information management and retrieval , 1999, CHI '99.

[11]  Gordon Bell,et al.  MyLifeBits: fulfilling the Memex vision , 2002, MULTIMEDIA '02.

[12]  Udi Manber,et al.  Integrating content-based access mechanisms with hierarchical file systems , 1999, OSDI '99.

[13]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[14]  Michel Dagenais,et al.  Measuring and Characterizing System Behavior Using Kernel-Level Event Logging , 2000, USENIX Annual Technical Conference, General Track.

[15]  David Gelernter,et al.  Lifestreams: an alternative to the desktop metaphor , 1996, CHI Conference Companion.

[16]  Qi Li,et al.  UMass at TREC 2003: HARD and QA , 2003, TREC.

[17]  Jaime Teevan,et al.  Implicit feedback for inferring user preference: a bibliography , 2003, SIGF.

[18]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[19]  Mark D. Smucker,et al.  UMass at TREC 2004: Notebook , 2004 .

[20]  B. Clifford Neuman,et al.  The Prospero File System: A Global File System Based on the Virtual System Model , 1992, Comput. Syst..

[21]  Gene H. Golub,et al.  Extrapolation methods for accelerating PageRank computations , 2003, WWW '03.

[22]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[23]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[24]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[25]  Susan T. Dumais,et al.  Personalizing Search via Automated Analysis of Interests and Activities , 2005, SIGIR.

[26]  Azadeh Shakery,et al.  Relevance Propagation for Topic Distillation UIUC TREC 2003 Web Track Experiments , 2003, TREC.

[27]  Avi Arampatzis,et al.  The score-distributional threshold optimization for adaptive binary classification tasks , 2001, SIGIR '01.

[28]  Tao Qin,et al.  Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004 , 2004, TREC.

[29]  Kevyn Collins-Thompson,et al.  Initial Results with Structured Queries and Language Models on Half a Terabyte of Text , 2004, TREC.

[30]  Tim Berners-Lee,et al.  The world-wide web : Internet technology , 1994 .

[31]  Paul Dourish,et al.  What we talk about when we talk about context , 2004, Personal and Ubiquitous Computing.

[32]  James C. French,et al.  Clustering large datasets in arbitrary metric spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[33]  Michael L. Mauldin,et al.  Retrieval performance in Ferret a conceptual information retrieval system , 1991, SIGIR '91.

[34]  Craig MacDonald,et al.  Terrier Information Retrieval Platform , 2005, ECIR.

[35]  Christoph Baumgarten,et al.  A probabilistic solution to the selection and fusion problem in distributed information retrieval , 1999, SIGIR '99.

[36]  Tao Qin,et al.  A study of relevance propagation for web search , 2005, SIGIR '05.

[37]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[38]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[39]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[40]  Guanling Chen,et al.  A Survey of Context-Aware Mobile Computing Research , 2000 .

[41]  R. S. Fabry,et al.  A fast file system for UNIX , 1984, TOCS.

[42]  M. Frans Kaashoek,et al.  Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files , 1997, USENIX Annual Technical Conference.

[43]  Eric Horvitz,et al.  The Lumière Project: Bayesian User Modeling for Inferring the Goals and Needs of Software Users , 1998, UAI.

[44]  Udi Manber,et al.  WebGlimpse: combining browsing and searching , 1997 .

[45]  David R. Karger,et al.  Haystack: A Platform for Authoring End User Semantic Web Applications , 2003, WWW.

[46]  Alan Jay Smith,et al.  The VTrace tool: building a system tracer for Windows NT and Windows 2000 , 2000 .

[47]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[48]  Stuart Sechrest,et al.  Blending hierarchical and attribute-based file naming , 1992, [1992] Proceedings of the 12th International Conference on Distributed Computing Systems.

[49]  Magnus Karlsson,et al.  Towards a Semantic-Aware File Store , 2003, HotOS.

[50]  Darrell D. E. Long,et al.  Predicting Future File-System Actions From Prior Events , 1996, USENIX Annual Technical Conference.

[51]  Thomas M. Kroeger,et al.  Predicting file system actions from prior events , 1996 .

[52]  Mark S. Ackerman,et al.  The perfect search engine is not enough: a study of orienteering behavior in directed search , 2004, CHI.

[53]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[54]  Pierre Jouvelot,et al.  Semantic file systems , 1991, SOSP '91.

[55]  Bradley J. Rhodes Using Physical Context for Just-in-Time Information Retrieval , 2003, IEEE Trans. Computers.

[56]  Dominic Giampaolo,et al.  Practical File System Design with the Be File System , 1998 .

[57]  R. Card,et al.  Design and Implementation of the Second Extended Filesystem , 2001 .

[58]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[59]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[60]  Ahmed Amer,et al.  File access prediction with adjustable accuracy , 2002, Conference Proceedings of the IEEE International Performance, Computing, and Communications Conference (Cat. No.02CH37326).

[61]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[62]  Jim Griffioen,et al.  Reducing File System Latency using a Predictive Approach , 1994, USENIX Summer.

[63]  R. Manmatha,et al.  Modeling score distributions for combining the outputs of search engines , 2001, SIGIR '01.