Mining for Information Discovery on the Web: Overview and Illustrative Research

The Web has become a fertile ground for numerous research activities in mining. In this chapter, we discuss research on finding targeted information on the Web. First, we briefly survey the research area. We focus in particular on two key issues: (a) mining to impose structures over Web data, by building taxonomies and portals for example, to aid in Web navigation, and (b) mining to build information processing systems, such as search engines, question answering systems, and data integration systems. Next, we describe two recent Web mining projects that illustrate the use of mining techniques to address the above two key issues. We conclude by briefly discussing novel research opportunities in the area of mining for information discovery on the Web.

[1]  Amihai Motro,et al.  Database Schema Matching Using Machine Learning with Feature Selection , 2002, CAiSE.

[2]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[3]  Chih-Jen Lin,et al.  Training nu-Support Vector Classifiers: Theory and Algorithms , 2001, Neural Comput..

[4]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[5]  Joseph M. Hellerstein,et al.  Eddies:Continuous Query Optimization , 1999, SIGMOD 2000.

[6]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[7]  Deborah L. McGuinness,et al.  The Chimaera Ontology Environment , 2000, AAAI/IAAI.

[8]  Arnon Rosenthal,et al.  Data Integration Needs an Industrial Revolution , 2001 .

[9]  Joann J. Ordille,et al.  Querying Heterogeneous Information Sources Using Source Descriptions , 1996, VLDB.

[10]  Kevin Chen-Chuan Chang,et al.  PEBL: positive example based learning for Web page classification using SVM , 2002, KDD.

[11]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[12]  David W. Embley,et al.  Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration , 2001, Workshop on Information Integration on the Web.

[13]  Tom M. Mitchell,et al.  Discovering Test Set Regularities in Relational Domains , 2000, ICML.

[14]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[15]  Dan Roth,et al.  Probabilistic Reasoning for Entity & Relation Recognition , 2002, COLING.

[16]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[17]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[18]  Sourav S. Bhowmick,et al.  Research Issues in Web Data Mining , 1999, DaWaK.

[19]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[20]  Gerhard Weikum,et al.  The BINGO! System for Information Portal Generation and Expert Web Search , 2003, CIDR.

[21]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[22]  Jiawei Han,et al.  Object Matching for Information Integration: A Profiler-Based Approach , 2003, IIWeb.

[23]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[24]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[25]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[26]  Subbarao Kambhampati,et al.  Optimizing Recursive Information-Gathering Plans , 1999, IJCAI.

[27]  Luis Gravano,et al.  Text joins for data cleansing and integration in an RDBMS , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[28]  Kevin Chen-Chuan Chang,et al.  Statistical schema matching across web query interfaces , 2003, SIGMOD '03.

[29]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[30]  Felix Naumann,et al.  Attribute classification using feature analysis , 2002, Proceedings 18th International Conference on Data Engineering.

[31]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[32]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[33]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[34]  Amihai Motro,et al.  Autoplex: Automated Discovery of Content for Virtual Databases , 2001, CoopIS.

[35]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[36]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[37]  Craig A. Knoblock,et al.  Wrapper generation for semi-structured Internet sources , 1997, SGMD.

[38]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[39]  C. Lee Giles,et al.  Autonomous citation matching , 1999, AGENTS '99.

[40]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[41]  David J. DeWitt,et al.  NiagaraCQ: a scalable continuous query system for Internet databases , 2000, SIGMOD '00.

[42]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[43]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[44]  William W. Cohen,et al.  Learning to Match and Cluster Entity Names , 2001 .

[45]  李幼升,et al.  Ph , 1989 .

[46]  Nicholas Kushmerick,et al.  Wrapper verification , 2000, World Wide Web.

[47]  Andrew McCallum,et al.  A Machine Learning Approach to Building Domain-Specific Search Engines , 1999, IJCAI.

[48]  James P. Callan,et al.  Automatic discovery of language models for text databases , 1999, SIGMOD '99.

[49]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[50]  Prasenjit Mitra,et al.  Semi-automatic Integration of Knowledge Sources , 1999 .

[51]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[52]  Gideon S. Mann,et al.  Analyses for elucidating current question answering technology , 2001, Natural Language Engineering.

[53]  Prabhakar Raghavan,et al.  Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies , 1998, The VLDB Journal.

[54]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[55]  Jennifer Neville,et al.  Iterative Classification in Relational Data , 2000 .

[56]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[57]  Jennifer Widom,et al.  The TSIMMIS Approach to Mediation: Data Models and Languages , 1997, Journal of Intelligent Information Systems.

[58]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[59]  Laura M. Haas,et al.  Data-driven understanding and refinement of schema mappings , 2001, SIGMOD '01.

[60]  Mark A. Musen,et al.  PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment , 2000, AAAI/IAAI.

[61]  Jiawei Han,et al.  Data Mining for Web Intelligence , 2002, Computer.

[62]  Susan T. Dumais,et al.  Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[63]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[64]  Daniel Kudenko,et al.  Transferring and Retraining Learned Information Filters , 1997, AAAI/IAAI.

[65]  Pedro M. Domingos,et al.  Learning to map between ontologies on the semantic web , 2002, WWW '02.

[66]  R. Mooney,et al.  Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases , 2002 .

[67]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[68]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[69]  Soumen Chakrabarti,et al.  Data mining for hypertext: a tutorial survey , 2000, SKDD.

[70]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[71]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[72]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[73]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[74]  Dennis Shasha,et al.  An extensible Framework for Data Cleaning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[75]  Chih-Jen Lin,et al.  Training v-Support Vector Classifiers: Theory and Algorithms , 2001, Neural Computation.

[76]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.

[77]  Robert P. W. Duin,et al.  Support vector domain description , 1999, Pattern Recognit. Lett..

[78]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[79]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[80]  DoanAnHai,et al.  Learning to match ontologies on the Semantic Web , 2003, VLDB 2003.

[81]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[82]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[83]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[84]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[85]  Frann Cois Denis,et al.  PAC Learning from Positive Statistical Queries , 1998, ALT.

[86]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[87]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[88]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[89]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[90]  Rémi Gilleron,et al.  Positive and Unlabeled Examples Help Learning , 1999, ALT.

[91]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[92]  Erhard Rahm,et al.  On Matching Schemas Automatically , 2001 .

[93]  Robert P. W. Duin,et al.  Uniform Object Generation for Optimizing One-class Classifiers , 2002, J. Mach. Learn. Res..

[94]  Hwanjo Yu SVMC: Single-Class Classification With Support Vector Machines , 2003, IJCAI.

[95]  Daniel A. Keim,et al.  On Knowledge Discovery and Data Mining , 1997 .

[96]  P. Schönemann On artificial intelligence , 1985, Behavioral and Brain Sciences.

[97]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[98]  Mark A. Musen,et al.  Promptdiff: a fixed-point algorithm for comparing ontology versions , 2002, AAAI/IAAI.

[99]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[100]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[101]  Luigi Palopoli,et al.  Semi-automatic, semantic discovery of properties from database schemes , 1998, Proceedings. IDEAS'98. International Database Engineering and Applications Symposium (Cat. No.98EX156).

[102]  Dayne Freitag,et al.  Multistrategy Learning for Information Extraction , 1998, ICML.

[103]  Hans Chalupsky,et al.  OntoMorph: A Translation System for Symbolic Knowledge , 2000, KR.

[104]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[105]  Tom M. Mitchell,et al.  Using unlabeled data to improve text classification , 2001 .