The WebCAT framework automatic generation of meta-data for Web resources

Automated methods for resource annotation are a clear necessity, as the success of the semantic Web depends on the availability of Web resources with meta data conforming to known standards and ontologies. This paper describes the WebCAT framework for automatically generating RDF descriptions of Web pages. We present a general view of the system and the algorithms involved, giving an emphasis to typical issues in processing Web data.

[1]  Stuart Weibel The State of the Dublin Core Metadata Initiative , 1999 .

[2]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[3]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[4]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[5]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[6]  E. Schmidt,et al.  Lex—a lexical analyzer generator , 1990 .

[7]  Rob Malouf,et al.  Markov Models for Language-independent Named Entity Recognition , 2002, CoNLL.

[8]  Andrew Tomkins,et al.  How to build a WebFountain: An architecture for very large-scale text analytics , 2004, IBM Syst. J..

[9]  Stuart Weibel,et al.  The State of the Dublin Core Metadata Initiative April 1999 , 1999, D Lib Mag..

[10]  Luís Sarmento,et al.  O projecto AC/DC: acesso a corpora/disponibilização de corpora , 2003 .

[11]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[12]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[13]  Ron Sivan,et al.  Web-a-where: geotagging web content , 2004, SIGIR '04.

[14]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[15]  Mário J. Silva,et al.  Language identification in web pages , 2005, SAC '05.

[16]  David Raggett Clean Up Your Web Pages with HP's HTML Tidy , 1998, Comput. Networks.

[17]  James Frew,et al.  Geographic Names: The Implementation of a Gazetteer in a Georeferenced Digital Library , 1999, D Lib Mag..

[18]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[19]  Marc Moens,et al.  Named Entity Recognition without Gazetteers , 1999, EACL.

[20]  Jeremy J. Carroll,et al.  Resource description framework (rdf) concepts and abstract syntax , 2003 .

[21]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[22]  Xavier Carreras,et al.  Named Entity Extraction using AdaBoost , 2002, CoNLL.

[23]  Daniel Gomes,et al.  Characterizing a national community web , 2005, TOIT.

[24]  Brian D. Davison Recognizing Nepotistic Links on the Web , 2000 .

[25]  Bruno Martins,et al.  Language Identication in Web Pages , 2005 .

[26]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.

[27]  Pasi Tapanainen,et al.  What is a word, What is a sentence? Problems of Tokenization , 1994 .

[28]  Andrei Mikheev,et al.  Document centered approach to text normalization , 2000, SIGIR '00.

[29]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[30]  Einat Amitay,et al.  Trends, fashions, patterns, norms, conventions . . . and hypertext too , 2001, J. Assoc. Inf. Sci. Technol..

[31]  Rick Bennett,et al.  Trends in the Evolution of the Public Web: 1998 - 2002 , 2003, D Lib Mag..

[32]  Mark H. Butler,et al.  Barriers to real world adoption of semantic web technologies , 2002 .

[33]  Stan Matwin,et al.  Statistical Phrases in Automated Text Categorization , 2000 .

[34]  A. Broder Some applications of Rabin’s fingerprinting method , 1993 .

[35]  Andrzej Skowron,et al.  Proceedings of the 2005 IEEE / WIC / ACM International Conference on Web Intelligence , 2005 .

[36]  M. de Rijke,et al.  Blueprint of a Cross-Lingual Web Retrieval Collection , 2005, J. Digit. Inf. Manag..

[37]  Mário J. Silva,et al.  The Case for a Portuguese Web Search Engine , 2003, ICWI.

[38]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.