论文信息 - InfoSuggest: A System for Automated Information Gathering: With a Real-World Case Study

InfoSuggest: A System for Automated Information Gathering: With a Real-World Case Study

Departments of many organizations treat the World Wide Web as an important information source. They have a need to keep themselves up-to-date with current information in their domain. Such information gathering is a time consuming process due to overload of available information and there are dedicated teams in many organizations for this task. In this paper, we present Info Suggest, a system for end-to-end information gathering from the web. Info Suggest improves efficiency of such focused information gathering process with the use of machine learning. We employ a semi-supervised document classification method called Transductive Support Vector Machines (TSVMs) for learning user preferences based on example articles provided by them. We also devise a strategy for unlabeled data selection TSVM-Meta that is applicable for an information gathering setting. In the paper, we discuss the system architecture and also present a case study for information gathering for food safety in an environmental health department of a government agency. We conduct experiments and demonstrate that our system results in improving the efficiency by as much as 35% by making it easier to find relevant content.

[1] Zhu He,et al. A boosted semi-supervised learning framework for web page filtering , 2009, 2009 IEEE International Conference on Systems, Man and Cybernetics.

[2] Haibo He,et al. Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[3] Sebastian Thrun,et al. Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[4] Jiahui Liu,et al. Personalized news recommendation based on click behavior , 2010, IUI '10.

[5] Marko Balabanovic,et al. An adaptive Web page recommendation service , 1997, AGENTS '97.

[6] Pattie Maes,et al. Amalthaea: An Evolving Multi-Agent Information Filtering and Discovery System for the WWW , 2004, Autonomous Agents and Multi-Agent Systems.

[7] Thorsten Joachims,et al. Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[8] Vikas Sindhwani,et al. Concept Labeling: Building Text Classifiers with Minimal Supervision , 2011, IJCAI.

[9] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.

[10] Michael J. Pazzani,et al. Content-Based Recommendation Systems , 2007, The Adaptive Web.

[11] Peter Fankhauser,et al. Boilerplate detection using shallow text features , 2010, WSDM '10.

[12] Zhi-Hua Zhou,et al. SETRED: Self-training with Editing , 2005, PAKDD.

[13] Tong Zhang,et al. The Value of Unlabeled Data for Classification Problems , 2000, ICML 2000.

[14] Abhinandan Das,et al. Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[15] Avrim Blum,et al. Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[16] Michael J. Pazzani,et al. Syskill & Webert: Identifying Interesting Web Sites , 1996, AAAI/IAAI, Vol. 1.

[17] Yuancheng Li,et al. A semi-supervised learning approach for detection of phishing webpages , 2013 .

[18] John Blitzer,et al. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[19] Gideon S. Mann,et al. Learning from labeled features using generalized expectation criteria , 2008, SIGIR '08.

[20] Zhi-Hua Zhou,et al. Improving Semi-Supervised Support Vector Machines Through Unlabeled Instances Selection , 2010, AAAI.

[21] Burr Settles,et al. Active Learning Literature Survey , 2009 .

[22] Taghi M. Khoshgoftaar,et al. A Survey of Collaborative Filtering Techniques , 2009, Adv. Artif. Intell..

[23] Bernhard Schölkopf,et al. Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[24] Ramesh Nallapati,et al. Discriminative models for information retrieval , 2004, SIGIR '04.

[25] S. Sathiya Keerthi,et al. Optimization Techniques for Semi-Supervised Support Vector Machines , 2008, J. Mach. Learn. Res..