Information categorization in web pages and sites

Today, surfing on the net is not limited to the search of scientific information, indeed a generic user is interested in different kinds of information about business, music, travel and so on. When accessing web documents, however, the lack of explicit structure does not facilitate in understanding data semantics, thus the comprehension of logical organization of web data relies on user's intuition of the underlying author's schema. In this paper, we present an approach to web structuring based on the analysis of the structure and the semantics of both web pages and sites, in order to discover and provide users with hidden schemas. Aimed benefits from this work are to facilitate the navigation inside web documents/sites, to promote the use of more powerful, semantic-based search methods and to allow better pages/sites management and re-design.

[1]  Keishi Tajima,et al.  Cut as a querying unit for WWW, Netnews, and E-mail , 1998, HYPERTEXT '98.

[2]  Roland H. C. Yap,et al.  Automatic information extraction from web pages , 2001, SIGIR '01.

[3]  Johannes Fürnkranz,et al.  Exploiting Structural Information for Text Classification on the WWW , 1999, IDA.

[4]  Nicolás Marín,et al.  Review of Data on the Web: from relational to semistructured data and XML by Serge Abiteboul, Peter Buneman, and Dan Suciu. Morgan Kaufmann 1999. , 2003, SGMD.

[5]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[6]  Carolyn J. Crouch,et al.  The use of cluster hierarchies in hypertext information retrieval , 1989, Hypertext.

[7]  Peter Buneman,et al.  Semistructured data , 1997, PODS.

[8]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[9]  Jaideep Srivastava,et al.  Data Preparation for Mining World Wide Web Browsing Patterns , 1999, Knowledge and Information Systems.

[10]  Jong-Hyeok Lee,et al.  Text categorization based on k-nearest neighbor approach for Web site classification , 2003, Inf. Process. Manag..

[11]  Marti A. Hearst,et al.  Improving Web Site Design , 2002, IEEE Internet Comput..

[12]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[13]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[14]  Wen-Syan Li,et al.  Automating extraction of logical domains in a web site , 2002, Data Knowl. Eng..

[15]  Ed Huai-hsin Chi Improving Web Usability Through Visualization , 2002, IEEE Internet Comput..

[16]  Vijay V. Raghavan,et al.  User-oriented document clustering: a framework for learning in information retrieval , 1986, SIGIR '86.

[17]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[18]  Vincenza Carchiolo,et al.  Hidden Schema Extraction in Web Documents , 2003, DNIS.

[19]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[20]  Wen-Syan Li,et al.  Defining logical domains in a web site , 2000, HYPERTEXT '00.

[21]  Giuseppe Attardi,et al.  Automatic Web Page Categorization by Link and Context Analysis , 1999 .

[22]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[23]  George V. Meghabghab Discovering authorities and hubs in different topological web graph structures , 2002, Inf. Process. Manag..

[24]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[25]  Paolo Merialdo,et al.  To Weave the Web , 1997, VLDB.

[26]  Dan Smith,et al.  Information extraction for semi-structured documents , 1997 .

[27]  Erich J. Neuhold,et al.  Jedi: extracting and synthesizing information from the Web , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[28]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[29]  Xiaogang Peng,et al.  Automatic web page classification in a dynamic and hierarchical way , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[30]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[31]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[32]  Vincenza Carchiolo,et al.  Structuring the Web , 2000, Proceedings 11th International Workshop on Database and Expert Systems Applications.

[33]  Ramana Rao,et al.  Silk from a sow's ear: extracting usable structures from the Web , 1996, CHI.

[34]  Robert Cooley,et al.  The use of web structure and content to identify subjectively interesting web usage patterns , 2003, TOIT.

[35]  Stefano Paraboschi,et al.  Design principles for data-intensive Web sites , 1999, SGMD.

[36]  Jennifer Widom,et al.  The TSIMMIS Approach to Mediation: Data Models and Languages , 1997, Journal of Intelligent Information Systems.