Advanced Techniques in Web Data Pre-processing and Cleaning

Central to successful e-business is the construction of web sites that attract users, capture user preferences, and entice them into making a purchase. Web mining is diverse data mining applied to categorize both the content and structure of web sites with the goal of aiding e-business. Web mining requires knowledge of the web site structure (hyperlink graph), the web content (vector model) and user sessions (the sequence of pages visited by each user to a site). Much of the data for web mining can be noisy. The origin of the noise comes from many sources, for example, undocumented changes to the web site structure and content, a different understanding of the text and media semantic, and web logs without individual user identification. There may not be any record of the number of times a specific page has been visited in a session as page is stored on a proxy or web browser cache. Such noise presents a challenge for web mining. This chapter presents issues with and approaches for cleaning web data in preparation for web mining analysis.

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  James E. Pitkow,et al.  Characterizing Browsing Strategies in the World-Wide Web , 1995, Comput. Networks ISDN Syst..

[3]  James E. Pitkow,et al.  Characterizing Browsing Behaviors on the World-Wide Web , 1995 .

[4]  Saul Greenberg,et al.  Revisitation patterns in World Wide Web navigation , 1997, CHI.

[5]  Huberman,et al.  Strong regularities in world wide web surfing , 1998, Science.

[6]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[7]  David J. Hand,et al.  Statistics and data mining: intersecting disciplines , 1999, SKDD.

[8]  Charles Aulds Linux Apache Web Server Administration , 2000 .

[9]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[10]  Tao Luo,et al.  Effective personalization based on association rule discovery from web usage data , 2001, WIDM '01.

[11]  Mark Levene,et al.  Zipf's Law for Web Surfers , 2001, Knowledge and Information Systems.

[12]  Rayid Ghani,et al.  Mining the web to create minority language corpora , 2001, CIKM '01.

[13]  Sankar K. Pal,et al.  Web mining in soft computing framework: relevance, state of the art and future directions , 2002, IEEE Trans. Neural Networks.

[14]  Tatsunori Mori,et al.  Information Gain Ratio as Term Weight: The case of Summarization of IR Results , 2002, COLING.

[15]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[16]  Ravi Kumar,et al.  Self-similarity in the web , 2001, TOIT.

[17]  Robert E. Bixby,et al.  Solving Real-World Linear Programs: A Decade and More of Progress , 2002, Oper. Res..

[18]  Terumasa Aoki,et al.  Using Self Organizing Feature Maps to Acquire Knowledge about Visitor Behavior in a Web Site , 2003, KES.

[19]  Myra Spiliopoulou,et al.  A Framework for the Evaluation of Session Reconstruction Heuristics in Web-Usage Analysis , 2003, INFORMS J. Comput..

[20]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[21]  Jason J. Jung,et al.  Semantic Outlier Analysis for Sessionizing Web Logs , 2003 .

[22]  Chengqi Zhang,et al.  Toward databases mining: Pre-processing collected data , 2003, Appl. Artif. Intell..

[23]  Yuna Kim,et al.  Web prefetching using display-based prediction , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[24]  Javed I. Khan,et al.  Exploiting Webspace organization for accelerating Web prefetching , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[25]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[26]  D. Langford Internet Ethics , 2003 .

[27]  M. Tamer Özsu,et al.  A Web page prediction model based on click-stream tree representation of user behavior , 2003, KDD '03.

[28]  Lakhmi C. Jain,et al.  Knowledge-Based Intelligent Information and Engineering Systems , 2004, Lecture Notes in Computer Science.

[29]  Jason J. Jung Ontology-Based Partitioning of Data Steam for Web Mining: A Case Study of Web Logs , 2004, International Conference on Computational Science.

[30]  Jin Chen,et al.  A Preprocessing Framework and Approach for Web Applications , 2004, J. Web Eng..

[31]  J. Srivastava,et al.  Mining Temporally Evolving Graphs , 2004 .

[32]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[33]  Jaideep Srivastava,et al.  Mining Temporally Changing Web Usage Graphs , 2004, WebKDD.

[34]  Luis Gravano,et al.  When one sample is not enough: improving text database selection using shrinkage , 2004, SIGMOD '04.

[35]  Svetlana Hensman,et al.  Construction of Conceptual Graph Representation of Texts , 2004, NAACL.

[36]  Tao Luo,et al.  Discovery and Evaluation of Aggregate Usage Profiles for Web Personalization , 2004, Data Mining and Knowledge Discovery.

[37]  Ricardo A. Baeza-Yates,et al.  Dynamics of the Chilean Web Structure , 2004, WebDyn@WWW.

[38]  Atanas Kiryakov,et al.  KIM – a semantic platform for information extraction and retrieval , 2004, Natural Language Engineering.

[39]  John Linn,et al.  Technology and web user data privacy - a survey of risks and countermeasures , 2005, IEEE Security & Privacy.

[40]  Xindong Wu,et al.  Support vector machines based on K-means clustering for real-time business intelligence systems , 2005, Int. J. Bus. Intell. Data Min..

[41]  Sandip Debnath,et al.  Automatic identification of informative sections of Web pages , 2005, IEEE Transactions on Knowledge and Data Engineering.

[42]  Michihiko Minoh,et al.  Modeling hypermedia-based communication , 2005, Inf. Sci..

[43]  Theo P. van der Weide,et al.  A formal derivation of Heaps' Law , 2005, Inf. Sci..

[44]  Chew Lim Tan,et al.  A comprehensive comparative study on term weighting schemes for text categorization with support vector machines , 2005, WWW '05.

[45]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[46]  Christos Faloutsos,et al.  Graph mining: Laws, generators, and algorithms , 2006, CSUR.

[47]  Yong Wang,et al.  Document Clustering with Semantic Analysis , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[48]  Kin Keung Lai,et al.  An integrated data preparation scheme for neural network data analysis , 2006, IEEE Transactions on Knowledge and Data Engineering.

[49]  Spencer Rugaber,et al.  Problems Modeling Web Sites and User Behavior , 2006, 2006 Eighth IEEE International Symposium on Web Site Evolution (WSE'06).

[50]  Chien-Chung Chan,et al.  Active User-Based and Ontology-Based Web Log Data Preprocessing for Web Usage Mining , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[51]  J. Copas,et al.  Interpreting Kullback-Leibler divergence with the Neyman-Pearson lemma , 2006 .

[52]  Wilfred Ng,et al.  Web dynamics and their ramifications for the development of Web search engines , 2006, Comput. Networks.

[53]  Eelco Herder,et al.  Off the beaten tracks: exploring three aspects of web navigation , 2006, WWW '06.

[54]  Mitsuru Ishizuka,et al.  Temporal multi-page summarization , 2006, Web Intell. Agent Syst..

[55]  David Nadeau,et al.  Semi-supervised named entity recognition: learning to recognize 100 entity types with little supervision , 2007 .

[56]  A. Sima Etaner-Uyar,et al.  Effects of Session Representation Models on the Performance of Web Recommender Systems , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[57]  Thomas Wilhelm,et al.  Metasploit Toolkit for Penetration Testing, Exploit Development, and Vulnerability Research , 2007 .

[58]  Pablo Castells,et al.  An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval , 2007, IEEE Transactions on Knowledge and Data Engineering.

[59]  Mengjun Xie,et al.  Automatic Cookie Usage Setting with CookiePicker , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[60]  Eelco Herder,et al.  Web page revisitation revisited: implications of a long-term click-stream study of browser usage , 2007, CHI.

[61]  Fang Wu,et al.  The economics of attention: maximizing user value in information-rich environments , 2007, ADKDD '07.

[62]  Ryen W. White,et al.  WWW 2007 / Track: Browsers and User Interfaces Session: Personalization Investigating Behavioral Variability in Web Search , 2022 .

[63]  Scott Dick,et al.  A Survey and Analysis of the P3P Protocol's Agents, Adoption, Maintenance, and Future , 2007, IEEE Transactions on Dependable and Secure Computing.

[64]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[65]  Ricardo A. Baeza-Yates,et al.  Characterization of national Web domains , 2007, TOIT.

[66]  Deepayan Chakrabarti,et al.  Page-level template detection via isotonic smoothing , 2007, WWW '07.

[67]  Rohini K. Srihari,et al.  Graph-based text representation and knowledge discovery , 2007, SAC '07.

[68]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[69]  David Maynor,et al.  Chapter 1 – Introduction to Metasploit , 2007 .

[70]  Charles V. Wright,et al.  On Web Browsing Privacy in Anonymized NetFlows , 2007, USENIX Security Symposium.

[71]  John Yen,et al.  Advances in Web Mining and Web Usage Analysis, 8th International Workshop on Knowledge Discovery on the Web, WebKDD 2006, Philadelphia, PA, USA, August 20, 2006, Revised Papers , 2007, WebKDD.

[72]  Sankar K. Pal,et al.  Stemming via Distribution-Based Word Segregation for Classification and Retrieval , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[73]  Olfa Nasraoui,et al.  Web data mining: exploring hyperlinks, contents, and usage data , 2008, SKDD.

[74]  Wolfgang Nejdl,et al.  Semantically Enhanced Entity Ranking , 2008, WISE.

[75]  Lori Lorigo,et al.  Eye Monitoring in Online Search , 2008 .

[76]  Yan Li,et al.  Research on Path Completion Technique in Web Usage Mining , 2008, 2008 International Symposium on Computer Science and Computational Technology.

[77]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[78]  Jie Li,et al.  Characterizing typical and atypical user sessions in clickstreams , 2008, WWW.

[79]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[80]  Pablo E. Román,et al.  Web User Session Reconstruction Using Integer Programming , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[81]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[82]  Ruimin Shen,et al.  Why web 2.0 is good for learning and for research: principles and prototypes , 2008, WWW.

[83]  V. Palade,et al.  Adaptive Web Sites - A Knowledge Extraction from Web Data Approach , 2008, Frontiers in Artificial Intelligence and Applications.

[84]  Andy Cockburn,et al.  An empirical characterisation of electronic document navigation , 2008, Graphics Interface.

[85]  Eelco Herder,et al.  Not quite the average: An empirical study of Web use , 2008, TWEB.

[86]  Gerhard Weikum,et al.  Efficiently Handling Dynamics in Distributed Link Based Authority Analysis , 2008, WISE.

[87]  Susan T. Dumais,et al.  The web changes everything: understanding the dynamics of web content , 2009, WSDM '09.

[88]  Václav Snásel,et al.  Web Content Mining Focused on Named Objects , 2009, IHCI.

[89]  Shady Shehata,et al.  A WordNet-Based Semantic Model for Enhancing Text Clustering , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[90]  Petra Benkovská,et al.  Web Usage Mining , 2009, Encyclopedia of Database Systems.

[91]  Bruce Bukiet,et al.  Internet Search Result Probabilities: Heaps' Law and Word Associativity* , 2009, J. Quant. Linguistics.

[92]  M.C. Monard,et al.  Improvement on the Porter's Stemming Algorithm for Portuguese , 2009, IEEE Latin America Transactions.

[93]  Radek Burget,et al.  Web Page Element Classification Based on Visual Features , 2009, 2009 First Asian Conference on Intelligent Information and Database Systems.

[94]  Jason I. Hong,et al.  Contextual web history: using visual and contextual cues to improve web browser history , 2009, CHI.

[95]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[96]  Peter Wittek,et al.  Improving Text Classification by a Sense Spectrum Approach to Term Expansion , 2009, CoNLL.

[97]  Juan D. Velásquez,et al.  Design and Implementation of a Methodology for Identifying Website Keyobjects , 2009, KES.

[98]  Maria Moloney,et al.  A Privacy Control Theory for Online Environments , 2009, 2009 42nd Hawaii International Conference on System Sciences.

[99]  Ibrahim Türkoglu,et al.  Creating meaningful data from web logs for improving the impressiveness of a website by using path analysis method , 2009, Expert Syst. Appl..

[100]  Murat Ali Bayir,et al.  Smart Miner: a new framework for mining large scale web usage data , 2009, WWW '09.

[101]  Lora Aroyo,et al.  The Semantic Web: Research and Applications , 2009, Lecture Notes in Computer Science.

[102]  Wolfgang Nejdl,et al.  How to Trace and Revise Identities , 2009, ESWC.

[103]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[104]  Pablo E. Román,et al.  Web User Session Reconstruction with Back Button Browsing , 2009, KES.

[105]  Ana Pont,et al.  Dweb model: Representing Web 2.0 dynamism , 2009, Comput. Commun..

[106]  Ninghui Li,et al.  End-User Privacy in Human–Computer Interaction , 2009 .

[107]  Gerhard Weikum,et al.  Data quality in web archiving , 2009, WICOW.

[108]  Marius Kloft,et al.  Active and Semi-supervised Data Domain Description , 2009, ECML/PKDD.

[109]  Jason Alexander,et al.  Understanding and improving navigation within electronic documents , 2009 .

[110]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[111]  Wolfgang Maass,et al.  Ontology-Based Natural Language Processing for In-store Shopping Situations , 2009, 2009 IEEE International Conference on Semantic Computing.

[112]  Pablo E. Román,et al.  A Dynamic Stochastic Model Applied to the Analysis of the Web User Behavior , 2010 .

[113]  James A. Thom,et al.  Entity Extraction from the Web with WebKnox , 2010 .

[114]  Iraklis Varlamis,et al.  An Experimental Study on Unsupervised Graph-based Word Sense Disambiguation , 2010, CICLing.

[115]  Olfa Nasraoui,et al.  Web Usage Mining , 2011 .