Mining for and from the Semantic Web 2004 (SWM 2004)

Although vast amounts of textual data are freely available, many NLP algorithms exploit only a minute percentage. In this paper, we study the challenges of working at the terascale and survey reasons why researchers are not fully utilizing available resources. As a case study, we present a terascale algorithm for mining is-a relations that achieves better performance as compared to a state-of-the-art linguistically-rich method.

[1]  Martha Palmer,et al.  Class-Based Construction of a Verb Lexicon , 2000, AAAI/IAAI.

[2]  Shawn R. Wolfe,et al.  SemanticOrganizer: A Customizable Semantic Repository for Distributed NASA Project Teams , 2004, International Semantic Web Conference.

[3]  David Maxwell Chickering,et al.  Optimal Structure Identification With Greedy Search , 2002, J. Mach. Learn. Res..

[4]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[5]  Oge Marques,et al.  Semi-automatic Semantic Annotation of Images Using Machine Learning Techniques , 2003, SEMWEB.

[6]  Tan,et al.  Interactive semantic analysis of Clause-Level Relationships , 1995 .

[7]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[8]  Athanasios Kehagias,et al.  A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms , 2003, Journal of Intelligent Information Systems.

[9]  Steffen Staab,et al.  SEAL: a framework for developing SEmantic PortALs , 2001, K-CAP '01.

[10]  Luc Steels,et al.  Grounding adaptive language games in robotic agents , 1997 .

[11]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[12]  Thomas Hofmann,et al.  Text categorization by boosting automatically extracted concepts , 2003, SIGIR.

[13]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[14]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[15]  David R. Hall,et al.  Developing Visualization Techniques for Semantics-based Information Networks , 2003 .

[16]  Michele Banko,et al.  Mitigating the Paucity of Data Problem , 2001 .

[17]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[18]  J. M. Kittross The measurement of meaning , 1959 .

[19]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[20]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[21]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[22]  Johan Bos,et al.  Position statement: Inference in Question Answering , 2002 .

[23]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[24]  Jimmy J. Lin,et al.  Data-Intensive Question Answering , 2001, TREC.

[25]  Vasile Rus,et al.  Logic Forms for WordNet Glosses , 2002 .

[26]  Yue Liu,et al.  TREC-10 Experiments at CAS-ICT: Filtering, Web and QA , 2001, TREC.

[27]  Stefan Wrobel,et al.  Extensibility in Data Mining Systems , 1996, KDD.

[28]  Gunnar Rätsch,et al.  An Introduction to Boosting and Leveraging , 2002, Machine Learning Summer School.

[29]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[30]  David Maxwell Chickering,et al.  Learning Equivalence Classes of Bayesian Network Structures , 1996, UAI.

[31]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[32]  Shih-Fu Chang,et al.  Multimedia Knowledge Integration, Summarization And Evaluation , 2002, MDM/KDD.

[33]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[34]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[35]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[36]  Ming Zhou,et al.  Identifying Synonyms among Distributionally Similar Words , 2003, IJCAI.

[37]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[38]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[39]  Mitchell P. Marcus,et al.  Adding Semantic Annotation to the Penn TreeBank , 1998 .

[40]  Martha Palmer,et al.  Verb semantics for English-Chinese translation , 1995, Machine Translation.

[41]  York Sure-Vetter,et al.  OntoWeb - A Semantic Web Community Portal , 2002, PAKM.

[42]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[43]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[44]  Hussein A. Abbass,et al.  A Comparative Study for Domain Ontology Guided Feature Extraction , 2003, ACSC.

[45]  Jiawei Han,et al.  Object Matching for Information Integration: A Profiler-Based Approach , 2003, IIWeb.

[46]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[47]  Ramanathan V. Guha,et al.  Object co-identification on the semantic web , 2004, WWW 2004.

[48]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[49]  Nico Roos,et al.  Automatic ontology mapping for agent communication , 2002, AAMAS '02.

[50]  Asunción Gómez-Pérez,et al.  ODESeW. Automatic Generation of Knowledge Portals for Intranets and Extranets , 2003, SEMWEB.

[51]  Oren Etzioni,et al.  Category Translation: Learning to Understand Information on the Internet , 1995, IJCAI.

[52]  Frank van Harmelen,et al.  Learning Structural Classification Rules for Web-Page Categorization , 2002, FLAIRS.

[53]  Mike Y. Chen,et al.  Yahoo! For Amazon: Sentiment Parsing from Small Talk on the Web , 2001 .

[54]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[55]  Boris Lauser,et al.  Semi-automatic ontology engineering and ontology supported document indexing in a multilingual environment , 2003 .

[56]  Padhraic Smyth,et al.  Knowledge Discovery and Data Mining: Towards a Unifying Framework , 1996, KDD.

[57]  Stefan Decker,et al.  OntoWebber: a novel approach for managing data on the Web , 2002, Proceedings 18th International Conference on Data Engineering.

[58]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[59]  Fei Song,et al.  Knowledge-Based Approaches to Query Expansion in Information Retrieval , 1996, Canadian Conference on AI.

[60]  Fernando Gomez,et al.  An Algorithm for Aspects of Semantic Interpretation Using an Enhanced WordNet , 2001, NAACL.

[61]  Amit P. Sheth,et al.  Semantic (Web) Technology In Action: Ontology Driven Information Systems for Search, Integration and Analysis , 2003, IEEE Data Eng. Bull..

[62]  Pedro M. Domingos,et al.  Reconciling schemas of disparate data sources: a machine-learning approach , 2001, SIGMOD '01.

[63]  Chin-Yew Lin,et al.  Robust automated topic identification , 1997 .

[64]  T C Rindflesch,et al.  Semantic processing in information retrieval. , 1993, Proceedings. Symposium on Computer Applications in Medical Care.

[65]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[66]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[67]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[68]  Rema Padman,et al.  Tabu Search Enhanced Markov Blanket Classifier for High Dimensional Data Sets , 2005 .

[69]  Steffen Staab,et al.  KAON - Towards a Large Scale Semantic Web , 2002, EC-Web.

[70]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[71]  Mark A. Musen,et al.  Anchor-PROMPT: Using Non-Local Context for Semantic Matching , 2001, OIS@IJCAI.

[72]  Michael L. Littman,et al.  Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus , 2002, ArXiv.

[73]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[74]  Patrick Pantel,et al.  Automatically Labeling Semantic Classes , 2004, NAACL.

[75]  Lucy Vanderwende,et al.  MindNet: Acquiring and Structuring Semantic Information from Text , 1998, COLING-ACL.

[76]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[77]  Regina Barzilay,et al.  Inferring Strategies for Sentence Ordering in Multidocument News Summarization , 2002, J. Artif. Intell. Res..

[78]  Patrick Pantel,et al.  Discovery of inference rules for question-answering , 2001, Natural Language Engineering.

[79]  Alison Huettner,et al.  Fuzzy Typing for Document Management , 2000 .

[80]  Yehoshua Sagiv,et al.  XSEarch: A Semantic Search Engine for XML , 2003, VLDB.

[81]  Vasileios Hatzivassiloglou,et al.  Predicting the Semantic Orientation of Adjectives , 1997, ACL.

[82]  Christopher Meek,et al.  Learning Bayesian Networks with Discrete Variables from Data , 1995, KDD.

[83]  Pedro M. Domingos,et al.  iMAP: discovering complex semantic matches between database schemas , 2004, SIGMOD '04.

[84]  James A. M. McHugh,et al.  Mining the World Wide Web , 2001, The Information Retrieval Series.

[85]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[86]  Henry Lieberman,et al.  A model of textual affect sensing using real-world knowledge , 2003, IUI '03.

[87]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[88]  Roger C. Schank,et al.  Scripts, plans, goals and understanding: an inquiry into human knowledge structures , 1978 .

[89]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[90]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[91]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[92]  Pierre Zweigenbaum,et al.  From text to knowledge: a unifying document-centered view of analyzed medical language. , 1998, Methods of information in medicine.

[93]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[94]  James R. Curran,et al.  Scaling Context Space , 2002, ACL.

[95]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[96]  Gerald Salton,et al.  Automatic text processing , 1988 .

[97]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[98]  Sanda M. Harabagiu,et al.  LCC Tools for Question Answering , 2002, TREC.

[99]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .

[100]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.