A case for automated large-scale semantic annotation

Abstract This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations. To our knowledge, this is the largest scale semantic tagging effort to date. We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large-scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.

[1]  Roberto J. Bayardo,et al.  Vinci: a service-oriented architecture for rapid development of web applications , 2001, WWW '01.

[2]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[3]  James Pustejovsky,et al.  Semantic Indexing and Typed Hyperlinking , 1997 .

[4]  Jeff Heflin,et al.  Searching the Web with SHOE , 2000 .

[5]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[6]  Yorick Wilks,et al.  Sense Tagging: Semantic Tagging with a Lexicon , 1997, ArXiv.

[7]  William W. Cohen A structured wrapper induction system for extracting information from semi-structured documents , 2001, IJCAI 2001.

[8]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[9]  Marja-Riitta Koivunen,et al.  Annotea: an open RDF infrastructure for shared Web annotations , 2001, WWW '01.

[10]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[11]  Paul A. Kogut,et al.  AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages , 2001, Semannot@K-CAP 2001.

[12]  Soumen Chakrabarti,et al.  Surfing the Web Backwards , 1999, Comput. Networks.

[13]  Rada Mihalcea,et al.  Word Sense Disambiguation And Its Application To Internet Search , 1999 .

[14]  Andrei Z. Broder,et al.  Algorithmic aspects of information retrieval on the web , 2002 .

[15]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[16]  Arthur Stutt,et al.  MnM: Ontology Driven Semi-automatic and Automatic Support for Semantic Markup , 2002, EKAW.

[17]  Stefan Decker,et al.  Creating Semantic Web Contents with Protégé-2000 , 2001, IEEE Intell. Syst..

[18]  Jasmine Novak,et al.  PageRank Computation and the Structure of the Web: Experiments and Algorithms , 2002 .

[19]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[20]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[21]  Frank van Harmelen,et al.  Web Ontology Language , 2004 .

[22]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[23]  Allen Newell,et al.  Some Problems Of Basic Organization In Problem-Solving Programs , 1962 .

[24]  Paolo Merialdo,et al.  Efficient Queries over Web Views , 2002, IEEE Trans. Knowl. Data Eng..

[25]  Lynn Andrea Stein,et al.  Squeal: a structured query language for the Web , 2000, Comput. Networks.

[26]  Hugh Glaser,et al.  Large Scale Acquisition and Maintenance from the Web without Source Access , 2001, Semannot@K-CAP 2001.

[27]  R GruberThomas Toward principles for the design of ontologies used for knowledge sharing , 1995 .

[28]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[29]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[30]  大島 正嗣,et al.  Simple Object Access Protocol と,その応用としてのソフトウェアの組み合わせについて (渡邉昭夫教授退任記念号) , 2001 .

[31]  Daniel G. Dupont,et al.  Out of Site , 1999 .

[32]  Thomas R. Gruber,et al.  Toward principles for the design of ontologies used for knowledge sharing? , 1995, Int. J. Hum. Comput. Stud..

[33]  Steffen Staab,et al.  S-CREAM: Semiautomatic CREAtion of Metadata , 2002, SAAKM@ECAI.

[34]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[35]  Lei Zhang,et al.  Learning to Generate Semantic Annotation for Domain Specific Sentences , 2001, Semannot@K-CAP 2001.

[36]  Deborah L. McGuinness,et al.  Description Logics Emerge from Ivory Towers , 2001, Description Logics.

[37]  Steffen Staab,et al.  From Manual to Semi-Automatic Semantic Annotation: About Ontology-Based Text Annotation Tools , 2000, SAIC@COLING.

[38]  Craig A. Knoblock,et al.  Automatic Data Extraction from Lists and Tables in Web Sources , 2001 .

[39]  Ellen Riloff,et al.  A Corpus-Based Approach for Building Semantic Lexicons , 1997, EMNLP.

[40]  Charles L. A. Clarke,et al.  Shortest Substring Ranking (MultiText Experiments for TREC-4) , 1995, TREC.

[41]  Gina Levow,et al.  Corpus-based Techniques for Word Sense Disambiguation , 1997 .

[42]  Tapas Kanungo,et al.  Integrating Link Structure and Content Information for Ranking Web Documents , 2001, TREC.

[43]  Alberto O. Mendelzon,et al.  Applications of a Web Query Language , 1997, Comput. Networks.