Cooperative strategy for web data mining and cleaning

While the Internet and World Wide Web have put a huge volume of low-quality information at the easy access of an information gathering system, filtering out irrelevant information has become a big challenge. In this paper, a Web data mining and cleaning strategy for information gathering is proposed. A data-mining model is presented for the data that come from multiple agents. Using the model, a data-cleaning algorithm is then presented to eliminate irrelevant data. To evaluate the data-cleaning strategy, an interpretation is given for the mining model according to evidence theory. An experiment is also conducted to evaluate the strategy using Web data. The experimental results have shown that the proposed strategy is efficient and promising.

[1]  Christoph Baumgarten,et al.  A probabilistic solution to the selection and fusion problem in distributed information retrieval , 1999, SIGIR '99.

[2]  Chengqi Zhang Cooperation Under Uncertainty in Distributed Expert Systems , 1992, Artif. Intell..

[3]  Yuefeng Li,et al.  A Method for Combining Interval Structures , 1998 .

[4]  Philip S. Yu,et al.  Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[5]  Ellen M. Voorhees,et al.  The Collection Fusion Problem , 1994, TREC.

[6]  Norbert Fuhr,et al.  A decision-theoretic approach to database selection in networked IR , 1999, TOIS.

[7]  Nicholas R. Jennings,et al.  A Roadmap of Agent Research and Development , 2004, Autonomous Agents and Multi-Agent Systems.

[8]  Victor R. Lesser,et al.  BIG: An agent for resource-bounded information gathering and decision making , 2000, Artif. Intell..

[9]  Edmund H. Durfee,et al.  Distributed Problem Solving and Planning , 2001, EASSS.

[10]  Pattie Maes,et al.  Agents that reduce work and information overload , 1994, CACM.

[11]  Alon Y. Halevy,et al.  Intelligent Internet systems , 2000, Artif. Intell..

[12]  Tom M. Mitchell,et al.  Experience with a learning personal assistant , 1994, CACM.

[13]  David A. Bell,et al.  Learning Bayesian networks from data: An information-theory based approach , 2002, Artif. Intell..

[14]  King-Lup Liu,et al.  Determining Text Databases to Search in the Internet , 1998, VLDB.

[15]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.

[16]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[17]  Yuefeng Li Information Fusion for Intelligent Agent-Based Information Gathering , 2001, Web Intelligence.

[18]  Hugo Zaragoza,et al.  Information Retrieval: Algorithms and Heuristics , 2002, Information Retrieval.

[19]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[20]  Yiyu Yao,et al.  Interval Structure: A Framework for Representing Uncertain Information , 1992, UAI.

[21]  Michael Wooldridge,et al.  Autonomous agents and multi-agent systems , 2014 .

[22]  Thomas A. Runkler,et al.  Web mining with relational clustering , 2003, Int. J. Approx. Reason..

[23]  Javed Mostafa,et al.  A multilevel approach to intelligent information filtering: model, system, and evaluation , 1997, TOIS.

[24]  Yuefeng Li,et al.  Perceiving Environments for Intelligent Agents , 2000, PRICAI.

[25]  Pattie Maes,et al.  Amalthaea: An Evolving Multi-Agent Information Filtering and Discovery System for the WWW , 2004, Autonomous Agents and Multi-Agent Systems.

[26]  Yuefeng. Li Modeling intelligent agents for web-based information gathering , 2000 .