First European Web Mining Forum

Abstract. This paper presents a novel method for extracting information from collections of Web pages across different sites. Our method uses a standard wrapper induction algorithm and exploits named entity information. We introduce the idea of post-processing the extraction results for resolving ambiguous facts and improve the overall extraction performance. Post-processing involves the exploitation of two additional sources of information: fact transition probabilities, based on a trained bigram model, and confidence probabilities, estimated for each fact by the wrapper induction system. A multiplicative model that is based on the product of those two probabilities is also considered for post-processing. Experiments were conducted on pages describing laptop products, collected from many different sites and in four different languages. The results highlight the effectiveness of our approach. 1 Introduction Wrapper induction (WI) [7] aims to generate extraction rules, called wrappers

[1]  Johannes Gehrke,et al.  DEMON: mining and monitoring evolving data , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[2]  Wynne Hsu,et al.  Discovering the set of fundamental rule changes , 2001, KDD '01.

[3]  Bettina Berendt,et al.  Using Site Semantics to Analyze, Visualize, and Support Navigation , 2004, Data Mining and Knowledge Discovery.

[4]  Steffen Staab,et al.  OntoEdit: Collaborative Ontology Development for the Semantic Web , 2002, SEMWEB.

[5]  Jaideep Srivastava,et al.  Web mining: information and pattern discovery on the World Wide Web , 1997, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence.

[6]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[7]  Jaideep Srivastava,et al.  Selecting the right interestingness measure for association patterns , 2002, KDD.

[8]  Steven L. Lytinen,et al.  Concept Based Query Enhancement in the ARCH Search Agent , 2003, International Conference on Internet Computing.

[9]  Rahul Singh,et al.  Browsing Schedules - An Agent-Based Approach to Navigating the Semantic Web , 2002, SEMWEB.

[10]  Loriene Roy,et al.  Content-based book recommending using learning for text categorization , 1999, DL '00.

[11]  Xiaodong Chen,et al.  Mining Temporal Features in Association Rules , 1999, PKDD.

[12]  Myra Spiliopoulou,et al.  Analysis of navigation behaviour in web sites integrating multiple information systems , 2000, The VLDB Journal.

[13]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[14]  Georgios Paliouras,et al.  Annotating Web pages for the needs of Web Information Extraction Applications , 2003, WWW.

[15]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[16]  Jaideep Srivastava,et al.  Automatic personalization based on Web usage mining , 2000, CACM.

[17]  Oren Etzioni,et al.  Adaptive Web Sites: Automatically Synthesizing Web Pages , 1998, AAAI/IAAI.

[18]  Maria L. Gini,et al.  A Multi-Agent Negotiation Testbed for Contracting Tasks with Temporal and Precedence Constraints , 2002, Int. J. Electron. Commer..

[19]  I. V. Ramakrishnan,et al.  Extraction techniques for mining services from Web sources , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[20]  Myra Spiliopoulou,et al.  Data Mining for Measuring and Improving the Success of Web Sites , 2004, Data Mining and Knowledge Discovery.

[21]  Constantin V. Negoita,et al.  On Fuzzy Systems , 1978 .

[22]  Giuseppe Psaila,et al.  Querying Shapes of Histories , 1995, VLDB.

[23]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[24]  Timothy W. Finin,et al.  Yahoo! as an ontology: using Yahoo! categories to describe documents , 1999, CIKM '99.

[25]  Mark Levene,et al.  An Heuristic to Capture Longer User Web Navigation Patterns , 2000, EC-Web.

[26]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[27]  Sunita Sarawagi,et al.  Mining Surprising Patterns Using Temporal Description Length , 1998, VLDB.

[28]  Fabio Abbattista,et al.  Extraction of User Profiles by Discovering Preferences through Machine Learning , 2003, IIS.

[29]  Myra Spiliopoulou,et al.  Efficient Monitoring of Patterns in Data Mining Environments , 2003, ADBIS.

[30]  Philip S. Yu,et al.  Caching on the World Wide Web , 1999, IEEE Trans. Knowl. Data Eng..

[31]  Vassilis Christophides,et al.  Benchmarking RDF Schemas for the Semantic Web , 2002, SEMWEB.

[32]  Pattie Maes,et al.  Design and implementation of an agent-based intermediary infrastructure for electronic markets , 2000, EC '00.

[33]  C. Coombs A theory of data. , 1965, Psychology Review.

[34]  Ernestina Menasalvas Ruiz,et al.  Subsessions: a granular approach to click path analysis , 2002, 2002 IEEE World Congress on Computational Intelligence. 2002 IEEE International Conference on Fuzzy Systems. FUZZ-IEEE'02. Proceedings (Cat. No.02CH37291).

[35]  Fabio Ciravegna,et al.  Adaptive Information Extraction from Text by Rule Induction and Generalisation , 2001, IJCAI.

[36]  S.H.G. ten Hagen,et al.  Exploration/exploitation in adaptive recommender systems , 2003 .

[37]  Andreas Hotho,et al.  Conceptual User Tracking , 2003, AWIC.

[38]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[39]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[40]  Bamshad Mobasher,et al.  Discovery of Aggregate Usage Profiles for Web Personalization , 2000 .

[41]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[42]  Alex Alves Freitas,et al.  On Objective Measures of Rule Surprisingness , 1998, PKDD.

[43]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[44]  Michael J. Pazzani,et al.  Learning and Revising User Profiles: The Identification of Interesting Web Sites , 1997, Machine Learning.

[45]  D. Cheung,et al.  Maintenance of Discovered Association Rules: When to update? , 1997, DMKD.

[46]  Ernestina Menasalvas Ruiz,et al.  A Granular Approach for Analyzing the Degree of Affability of a Web Site , 2002, Rough Sets and Current Trends in Computing.

[47]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[48]  Yiming Ma,et al.  Analyzing the interestingness of association rules from the temporal dimension , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[49]  Myra Spiliopoulou,et al.  The Impact of Site Structure and User Environment on Session Reconstruction in Web Usage Analysis , 2002, WEBKDD.

[50]  William W. Cohen,et al.  Learning Page-Independent Heuristics for Extracting Data from Web Pages , 1999, Comput. Networks.

[51]  Jian Pei,et al.  Mining Access Patterns Efficiently from Web Logs , 2000, PAKDD.

[52]  Ke Wang,et al.  Discovering Patterns from Large and Dynamic Sequential Data , 1997, Journal of Intelligent Information Systems.

[53]  Andreas Hotho,et al.  Towards Semantic Web Mining , 2002, SEMWEB.

[54]  Myra Spiliopoulou,et al.  Monitoring Change in Mining Results , 2001, DaWaK.

[55]  Mário J. Silva,et al.  Web Access Mining from an On-line Newspaper Logs , 2001 .

[56]  Oren Etzioni,et al.  A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[57]  Andrew McCallum,et al.  Building Domain-Specific Search Engines with Machine Learning Techniques , 1999 .

[58]  David Wai-Lok Cheung,et al.  Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules , 1998, Data Mining and Knowledge Discovery.

[59]  Nicola Fanizzi,et al.  Multistrategy Theory Revision: Induction and Abduction in INTHELEX , 2004, Machine Learning.

[60]  Ian Horrocks,et al.  Querying the Semantic Web: A Formal Approach , 2002, SEMWEB.

[61]  Carlos Bento,et al.  A Metric for Selection of the Most Promising Rules , 1998, PKDD.

[62]  Cyrus Shahabi,et al.  Knowledge discovery from users Web-page navigation , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[63]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[64]  Ed H. Chi,et al.  The scent of a site: a system for analyzing and predicting information scent, usage, and usability of a Web site , 2000, CHI.

[65]  D. F. Morrison,et al.  Multivariate Statistical Methods , 1968 .

[66]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[67]  Anupam Joshi,et al.  Mining web access logs using a fuzzy relational clustering algorithm based on a robust estimator , 1999, WWW 1999.

[68]  Donato Malerba,et al.  A Logic Framework for the Incremental Inductive Synthesis of Datalog Theories , 1997, LOPSTR.

[69]  Bamshad Mobasher,et al.  Using Ontologies to Discover Domain-Level Web Usage Profiles , 2002 .

[70]  Mike P. Papazoglou,et al.  Agent-oriented technology in support of e-business , 2001, CACM.

[71]  Edward Omiecinski,et al.  Efficient Mining of Association Rules in Large Dynamic Databases , 1998, BNCOD.

[72]  Bettina Berendt,et al.  Detail and Context in Web Usage Mining: Coarsening and Visualizing Sequences , 2001, WEBKDD.

[73]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[74]  Dino Pedreschi,et al.  Web log data warehousing and mining for intelligent web caching , 2001, Data Knowl. Eng..

[75]  Bamshad Mobasher,et al.  A Road Map to More Effective Web Personalization: Integrating Domain Knowledge with Web Usage Mining , 2003, International Conference on Internet Computing.

[76]  Prem Melville and Raymond J. Mooney and Ramadass Nagarajan Content-Boosted Collaborative Filtering , 2001 .

[77]  Daniel A. Keim,et al.  On Knowledge Discovery and Data Mining , 1997 .

[78]  Dunja Mladenic,et al.  Text-learning and related intelligent agents: a survey , 1999, IEEE Intell. Syst..

[79]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[80]  GeunSik Jo,et al.  Collaborative Information Filtering by Using Categorized Bookmarks on the Web , 2001, INAP.

[81]  Johannes Gehrke,et al.  A framework for measuring changes in data characteristics , 1999, PODS '99.

[82]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[83]  Myra Spiliopoulou,et al.  Improving the Effectiveness of a Web Site with Web Usage Mining , 1999, WEBKDD.

[84]  David Wai-Lok Cheung,et al.  A General Incremental Technique for Maintaining Discovered Association Rules , 1997, DASFAA.

[85]  Sanjay Ranka,et al.  An Efficient Algorithm for the Incremental Updation of Association Rules in Large Databases , 1997, KDD.

[86]  R. Watson,et al.  THE WORLD WIDE WEB AS AN ADVERTISING MEDIUM , 1996 .

[87]  D. Kleinbaum,et al.  Applied Regression Analysis and Other Multivariate Methods , 1978 .

[88]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[89]  Bradley N. Miller,et al.  Using filtering agents to improve prediction quality in the GroupLens research collaborative filtering system , 1998, CSCW '98.

[90]  Niall M. Adams,et al.  The impact of changing populations on classifier performance , 1999, KDD '99.

[91]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[92]  Ron Kohavi,et al.  Integrating e-commerce and data mining: architecture and challenges , 2000, Proceedings 2001 IEEE International Conference on Data Mining.

[93]  Lars Schmidt-Thieme,et al.  Mining Web Navigation Path Fragments , 2002 .

[94]  José Oncina,et al.  Learning Stochastic Regular Grammars by Means of a State Merging Method , 1994, ICGI.