InfoSuggest: A System for Automated Information Gathering: With a Real-World Case Study

Departments of many organizations treat the World Wide Web as an important information source. They have a need to keep themselves up-to-date with current information in their domain. Such information gathering is a time consuming process due to overload of available information and there are dedicated teams in many organizations for this task. In this paper, we present Info Suggest, a system for end-to-end information gathering from the web. Info Suggest improves efficiency of such focused information gathering process with the use of machine learning. We employ a semi-supervised document classification method called Transductive Support Vector Machines (TSVMs) for learning user preferences based on example articles provided by them. We also devise a strategy for unlabeled data selection TSVM-Meta that is applicable for an information gathering setting. In the paper, we discuss the system architecture and also present a case study for information gathering for food safety in an environmental health department of a government agency. We conduct experiments and demonstrate that our system results in improving the efficiency by as much as 35% by making it easier to find relevant content.

[1]  Zhu He,et al.  A boosted semi-supervised learning framework for web page filtering , 2009, 2009 IEEE International Conference on Systems, Man and Cybernetics.

[2]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[3]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[4]  Jiahui Liu,et al.  Personalized news recommendation based on click behavior , 2010, IUI '10.

[5]  Marko Balabanovic,et al.  An adaptive Web page recommendation service , 1997, AGENTS '97.

[6]  Pattie Maes,et al.  Amalthaea: An Evolving Multi-Agent Information Filtering and Discovery System for the WWW , 2004, Autonomous Agents and Multi-Agent Systems.

[7]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[8]  Vikas Sindhwani,et al.  Concept Labeling: Building Text Classifiers with Minimal Supervision , 2011, IJCAI.

[9]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[10]  Michael J. Pazzani,et al.  Content-Based Recommendation Systems , 2007, The Adaptive Web.

[11]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[12]  Zhi-Hua Zhou,et al.  SETRED: Self-training with Editing , 2005, PAKDD.

[13]  Tong Zhang,et al.  The Value of Unlabeled Data for Classification Problems , 2000, ICML 2000.

[14]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[15]  Avrim Blum,et al.  Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[16]  Michael J. Pazzani,et al.  Syskill & Webert: Identifying Interesting Web Sites , 1996, AAAI/IAAI, Vol. 1.

[17]  Yuancheng Li,et al.  A semi-supervised learning approach for detection of phishing webpages , 2013 .

[18]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[19]  Gideon S. Mann,et al.  Learning from labeled features using generalized expectation criteria , 2008, SIGIR '08.

[20]  Zhi-Hua Zhou,et al.  Improving Semi-Supervised Support Vector Machines Through Unlabeled Instances Selection , 2010, AAAI.

[21]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[22]  Taghi M. Khoshgoftaar,et al.  A Survey of Collaborative Filtering Techniques , 2009, Adv. Artif. Intell..

[23]  Bernhard Schölkopf,et al.  Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[24]  Ramesh Nallapati,et al.  Discriminative models for information retrieval , 2004, SIGIR '04.

[25]  S. Sathiya Keerthi,et al.  Optimization Techniques for Semi-Supervised Support Vector Machines , 2008, J. Mach. Learn. Res..