Predicting Download Directories for Web Resources

Browsing the web is one of the most common activities that users engage in nowadays, and downloading web resources of interest, such as images, documents, music, etc., is part of this process. However, users would rather temporarily save that resource to a default path that they have easy access to (e.g. their "Desktop") than select the actual directory where they would eventually place it. This clearly implies that existing user interfaces are not as effective for this particular task as the users would like them to be. Instead of proposing a different User Interface, in this paper, we try to address the problem at its core, and propose a methodology to suggest the most likely directory where the file would (eventually) be saved by the user. By doing so, future interfaces can also benefit from our technique. We provide a formal definition of the problem and propose a classification framework to tackle it. We present our overall solution to this problem, namely Directory Download PrediCtor, or DiDoCtor for short. We give experimental evidence of its effectiveness, by implementing our approach as part of a widely used browser and evaluate it with real user activity. We also discuss lessons learned from this process, regarding the efficiency perspective.

[1]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[2]  Thomas G. Dietterich,et al.  FolderPredictor: Reducing the cost of reaching the right folder , 2011, TIST.

[3]  Idit Keidar,et al.  Do not crawl in the dust: different urls with similar text , 2006, WWW '07.

[4]  Gary L. Dannenbring System response time and user performance , 1984, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  Johannes Fürnkranz,et al.  Exploiting Structural Information for Text Classification on the WWW , 1999, IDA.

[6]  Christopher Krügel,et al.  Protecting Users against Phishing Attacks , 2006, Comput. J..

[7]  Prabhakar Raghavan,et al.  Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies , 1998, The VLDB Journal.

[8]  Mark Dredze,et al.  Automatically classifying emails into activities , 2006, IUI '06.

[9]  T. W. Butler Computer response time and user performance. , 1983, CHI '83.

[10]  Jeffrey O. Kephart,et al.  MailCat: an intelligent assistant for organizing e-mail , 1999, AGENTS '99.

[11]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[12]  Sofia Stamou,et al.  Classifying Web Data in Directory Structures , 2006, APWeb.

[13]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[14]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[15]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[16]  Idit Keidar,et al.  Do not crawl in the DUST: different URLs with similar text , 2006, WWW.

[17]  Rudolf Bayer,et al.  Symmetric binary B-Trees: Data structure and maintenance algorithms , 1972, Acta Informatica.

[18]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[19]  Ian H. Witten,et al.  An effective, low-cost measure of semantic relatedness obtained from Wikipedia links , 2008 .

[20]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[21]  Thomas G. Dietterich,et al.  A hybrid learning system for recognizing user tasks from desktop activities and email messages , 2006, IUI '06.

[22]  Darin Fisher,et al.  Link Prefetching in Mozilla: A Server-Driven Approach , 2003, WCW.

[23]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[24]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[25]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[26]  Kevin S. McCurley,et al.  Analysis of anchor text for web search , 2003, SIGIR.