Web Document Modeling

A very common issue of adaptive Web-Based systems is the modeling of documents. Such documents represent domain-specific information for a number of purposes. Application areas such as Information Search, Focused Crawling and Content Adaptation (among many others) benefit from several techniques and approaches to model documents effectively. For example, a document usually needs preliminary processing in order to obtain the relevant information in an effective and useful format, so as to be automatically processed by the system. The objective of this chapter is to support other chapters, providing a basic overview of the most common and useful techniques and approaches related with document modeling. This chapter describes high-level techniques to model Web documents, such as the Vector Space Model and a number of AI approaches, such as Semantic Networks, Neural Networks and Bayesian Networks. This chapter is not meant to act as a substitute of more comprehensive discussions about the topics presented. Rather, it provides a brief and informal introduction to the main concepts of document modeling, also focusing on the systems that are presented in the rest of the book as concrete examples of the related concepts.

[1]  Marimuthu Palaniswami,et al.  A novel document retrieval method using the discrete wavelet transform , 2005, TOIS.

[2]  Yiyu Yao,et al.  Web Intelligence: exploring structures, semantics, and knowledge of the Web , 2004, Knowl. Based Syst..

[3]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI) and TREC-2 , 1993, TREC.

[4]  Elaine Broadbent The online catalog: dictionary, classified, or both? , 1989 .

[5]  Carlo Strapparava,et al.  User Modelling for News Web Sites with Word Sense Based Techniques , 2004, User Modeling and User-Adapted Interaction.

[6]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[7]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[8]  Anshu Aggarwal,et al.  HTTP: The Definitive Guide , 2002 .

[9]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[10]  Alessandro Micarelli,et al.  Infoweb: An adaptive information filtering system for the cultural heritage domain , 2003, Appl. Artif. Intell..

[11]  Benjamin Piwowarski,et al.  A Bayesian Framework for XML Information Retrieval: Searching and Learning with the INEX Collection , 2005, Information Retrieval.

[12]  Loren G. Terveen,et al.  Does “authority” mean quality? predicting expert quality ratings of Web documents , 2000, SIGIR '00.

[13]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[14]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[15]  Kelly Maglaughlin,et al.  IRIS at TREC-8 , 1999, TREC.

[16]  David Hawking,et al.  Toward better weighting of anchors , 2004, SIGIR '04.

[17]  Stephen A. Thomas HTTP Essentials: Protocols for Secure, Scaleable Web Sites , 2001 .

[18]  Abdel-Badeeh M. Salem,et al.  Unsupervised Artificial Neural Networks For Clustering Of Document Collections , 2004, Egypt. Comput. Sci. J..

[19]  Yiyu Yao,et al.  Web Intelligence (WI): A New Paradigm for Developing the Wisdom Web and Social Network Intelligence , 2003 .

[20]  Robert M. Fung,et al.  Applying Bayesian networks to information retrieval , 1995, CACM.

[21]  Michael J. Pazzani,et al.  User Modeling for Adaptive News Access , 2000, User Modeling and User-Adapted Interaction.

[22]  Michael W. Berry,et al.  Large-Scale Sparse Singular Value Computations , 1992 .

[23]  Fabio Gasparetti,et al.  Personalized Search on the World Wide Web , 2007, The Adaptive Web.

[24]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[25]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[26]  W. Bruce Croft,et al.  Document Retrieval and Routing Using the INQUERY System , 1994, TREC.

[27]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[28]  Thorsten Joachims,et al.  Web Watcher: A Tour Guide for the World Wide Web , 1997, IJCAI.

[29]  Maristella Agosti,et al.  Information Retrieval and Hypertext , 1996, Information Retrieval and Hypertext.

[30]  Filippo Menczer,et al.  Topical Crawling for Business Intelligence , 2003, ECDL.

[31]  Weiyi Meng,et al.  A new study on using HTML structures to improve retrieval , 1999, Proceedings 11th International Conference on Tools with Artificial Intelligence.

[32]  Stephen E. Robertson,et al.  Experimentation as a way of life: Okapi at TREC , 2000, Inf. Process. Manag..

[33]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[34]  Nicholas J. Belkin,et al.  Hypertext and Information Retrieval: What are the Fundamental Concepts? (Panel) , 1992, ECHT.

[35]  Erkki Oja,et al.  Artificial Neural Networks: Formal Models and Their Applications - ICANN 2005, 15th International Conference, Warsaw, Poland, September 11-15, 2005, Proceedings, Part II , 2005, International Conference on Artificial Neural Networks.

[36]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[37]  Mounia Lalmas,et al.  Combining evidence for Web retrieval using the inference network model: an experimental study , 2004, Inf. Process. Manag..

[38]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.

[39]  Gabriella Pasi,et al.  Contextual weighted representations and indexing models for the retrieval of HTML documents , 2005, Soft Comput..

[40]  James P. Callan,et al.  Document filtering with inference networks , 1996, SIGIR '96.

[41]  Alfred Kobsa,et al.  The Adaptive Web, Methods and Strategies of Web Personalization , 2007, The Adaptive Web.

[42]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[43]  Alessandro Micarelli,et al.  User Profiles for Personalized Information Access , 2007, The Adaptive Web.

[44]  Ronan Cummins,et al.  Evolving local and global weighting schemes in information retrieval , 2006, Information Retrieval.

[45]  Mohand Boughanem,et al.  A concept-based approach for indexing documents in IR , 2005, INFORSID.

[46]  Chris Buckley,et al.  Using Query Zoning and Correlation Within SMART: TREC 5 , 1996, TREC.

[47]  Yiyu Yao,et al.  Web Intelligence (WI) , 2000, Proceedings 24th Annual International Computer Software and Applications Conference. COMPSAC2000.

[48]  Weiyi Meng,et al.  Using the Structure of HTML Documents to Improve Retrieval , 1997, USENIX Symposium on Internet Technologies and Systems.

[49]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[50]  Dan E. Albertson,et al.  WIDIT in TREC-2003 Web Track , 2003, TREC.

[51]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[52]  Steven J. DeRose The SGML FAQ Book: Understanding the Foundation of HTML and XML , 1997 .

[53]  Gabriella Pasi,et al.  An indexing model of HTML documents , 2003, SAC '03.

[54]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[55]  Nicholas J. Belkin,et al.  Information filtering and information retrieval: two sides of the same coin? , 1992, CACM.

[56]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[57]  Lokendra Shastri,et al.  Why Semantic Networks? , 1991, Principles of Semantic Networks.

[58]  Michael J. Pazzani,et al.  Adaptive News Access , 2007, The Adaptive Web.

[59]  Ning Zhong,et al.  Web Intelligence: Research and Development , 2001, Lecture Notes in Computer Science.

[60]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[61]  Fabio Gasparetti,et al.  Adaptive Focused Crawling , 2007, The Adaptive Web.

[62]  Michael J. Pazzani,et al.  Content-Based Recommendation Systems , 2007, The Adaptive Web.

[63]  Gene H. Golub,et al.  Matrix computations , 1983 .

[64]  J. Nazuno Haykin, Simon. Neural networks: A comprehensive foundation, Prentice Hall, Inc. Segunda Edición, 1999 , 2000 .

[65]  Judea Pearl,et al.  Fusion, Propagation, and Structuring in Belief Networks , 1986, Artif. Intell..

[66]  Alessandro Micarelli,et al.  Anatomy and Empirical Evaluation of an Adaptive Web-Based Information Filtering System , 2004, User Modeling and User-Adapted Interaction.

[67]  Irena Koprinska,et al.  A neural network based approach to automated e-mail classification , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[68]  Jay L. Devore,et al.  Probability and statistics for engineering and the sciences , 1982 .

[69]  Samy Bengio,et al.  A Neural Network for Text Representation , 2005, ICANN.

[70]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[71]  Ning Zhong,et al.  In Search of the Wisdom Web , 2002, Computer.

[72]  Fabio Crestani,et al.  Lectures on Information Retrieval , 2001, Lecture Notes in Computer Science.

[73]  Jeffrey O. Kephart,et al.  MailCat: an intelligent assistant for organizing e-mail , 1999, AGENTS '99.

[74]  John F. Sowa,et al.  Principles of semantic networks , 1991 .

[75]  Bamshad Mobasher,et al.  Data Mining for Web Personalization , 2007, The Adaptive Web.

[76]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[77]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[78]  W. Bruce Croft,et al.  Retrieval Strategies for Hypertext , 1993, Inf. Process. Manag..

[79]  Lawrence Birnbaum,et al.  Information access in context , 2001, Knowl. Based Syst..

[80]  Hsinchun Chen,et al.  Visualization of large category map for Internet browsing , 2003, Decis. Support Syst..

[81]  Luca Chittaro,et al.  Adaptive 3D Web Sites , 2007, The Adaptive Web.

[82]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[83]  Kiduk Yang Combining Text- and Link-based Retrieval Methods for Web IR , 2001, TREC.

[84]  Michael J. Pazzani,et al.  Learning and Revising User Profiles: The Identification of Interesting Web Sites , 1997, Machine Learning.

[85]  Stephen Walker,et al.  The Okapi online catalogue research projects , 1997 .

[86]  Howard C. Card,et al.  An adaptive neural network approach to hypertext clustering , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[87]  Massimo Melucci,et al.  Information Retrieval on the Web , 2001, ESSIR.

[88]  Ross Wilkinson,et al.  Using the cosine measure in a neural network for document retrieval , 1991, SIGIR '91.

[89]  Hans-Peter Frei,et al.  Making use of hypertext links when retrieving information , 1992, ECHT '92.

[90]  WalkerS.,et al.  Experimentation as a way of life , 2000 .

[91]  Gerard Salton,et al.  The SMART Retrieval System , 1971 .

[92]  Yiyu Yao,et al.  The Wisdom Web: New Challenges for Web Intelligence (WI) , 2004, Journal of Intelligent Information Systems.

[93]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[94]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.