SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations. To our knowledge, this is the largest scale semantic tagging effort to date.We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.

[1]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[2]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[3]  Frank van Harmelen,et al.  Web Ontology Language , 2004 .

[4]  Hugh Glaser,et al.  Large Scale Acquisition and Maintenance from the Web without Source Access , 2001, Semannot@K-CAP 2001.

[5]  Paolo Merialdo,et al.  Efficient Queries over Web Views , 2002, IEEE Trans. Knowl. Data Eng..

[6]  Deborah L. McGuinness,et al.  Description Logics Emerge from Ivory Towers , 2001, Description Logics.

[7]  William W. Cohen A structured wrapper induction system for extracting information from semi-structured documents , 2001, IJCAI 2001.

[8]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[9]  Allen Newell,et al.  Some Problems Of Basic Organization In Problem-Solving Programs , 1962 .

[10]  Marja-Riitta Koivunen,et al.  Annotea: an open RDF infrastructure for shared Web annotations , 2001, WWW '01.

[11]  A MusenMark,et al.  Creating Semantic Web Contents with Protégé-2000 , 2001 .

[12]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[13]  Arthur Stutt,et al.  MnM: Ontology Driven Semi-automatic and Automatic Support for Semantic Markup , 2002, EKAW.

[14]  Craig A. Knoblock,et al.  Automatic Data Extraction from Lists and Tables in Web Sources , 2001 .

[15]  Yorick Wilks,et al.  Sense Tagging: Semantic Tagging with a Lexicon , 1997, ArXiv.

[16]  Ellen Riloff,et al.  A Corpus-Based Approach for Building Semantic Lexicons , 1997, EMNLP.

[17]  Roberto J. Bayardo,et al.  Vinci: a service-oriented architecture for rapid development of web applications , 2001, WWW '01.

[18]  James Pustejovsky,et al.  Semantic Indexing and Typed Hyperlinking , 1997 .

[19]  Jeff Heflin,et al.  Searching the Web with SHOE , 2000 .

[20]  Charles L. A. Clarke,et al.  Shortest Substring Ranking (MultiText Experiments for TREC-4) , 1995, TREC.

[21]  Gina Levow,et al.  Corpus-based Techniques for Word Sense Disambiguation , 1997 .

[22]  R GruberThomas Toward principles for the design of ontologies used for knowledge sharing , 1995 .

[23]  Alberto O. Mendelzon,et al.  Applications of a Web Query Language , 1997, Comput. Networks.

[24]  Rada Mihalcea,et al.  Word Sense Disambiguation And Its Application To Internet Search , 1999 .

[25]  Andrei Z. Broder,et al.  Algorithmic aspects of information retrieval on the web , 2002 .

[26]  Arthur Stutt,et al.  MnM: Ontology-Driven Tool for Semantic Markup , 2002, SAAKM@ECAI.

[27]  Paul A. Kogut,et al.  AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages , 2001, Semannot@K-CAP 2001.

[28]  Stefan Decker,et al.  Creating Semantic Web Contents with Protégé-2000 , 2001, IEEE Intell. Syst..

[29]  Steffen Staab,et al.  An annotation framework for the semantic web , 2001 .

[30]  Jennifer Widom,et al.  The Lorel query language for semistructured data , 1997, International Journal on Digital Libraries.

[31]  Lei Zhang,et al.  Learning to Generate Semantic Annotation for Domain Specific Sentences , 2001, Semannot@K-CAP 2001.

[32]  Steffen Staab,et al.  From Manual to Semi-Automatic Semantic Annotation: About Ontology-Based Text Annotation Tools , 2000, SAIC@COLING.

[33]  大島 正嗣,et al.  Simple Object Access Protocol と,その応用としてのソフトウェアの組み合わせについて (渡邉昭夫教授退任記念号) , 2001 .

[34]  Thomas R. Gruber,et al.  Toward principles for the design of ontologies used for knowledge sharing? , 1995, Int. J. Hum. Comput. Stud..

[35]  Lynn Andrea Stein,et al.  Squeal: a structured query language for the Web , 2000, Comput. Networks.