Combining Data-Driven and Semantic Approaches for Text Mining

While the amount of structured data published on the Web keeps growing (fostered in particular by the Linked Open Data initiative), the Web still comprises of mainly unstructured—in particular textual—content and is therefore a Web for human consumption. Thus, an important question is which techniques are most suitable to enable people to effectively access the large body of unstructured information available on the Web, whether it is semantic or not. While the hope is that semantic technologies can be combined with standard Information Retrieval approaches to enable more accurate retrieval, some researchers have argued against this view. They claim that only data-driven or inductive approaches are applicable to tasks requiring the organization of unstructured (mainly textual) data for retrieval purposes. We argue that the dichotomy between data-driven/inductive and semantic approaches is indeed a false one. We further argue that bottom-up or inductive approaches can be successfully combined with top-down or semantic approaches and illustrate this for a number of tasks such as Ontology Learning, Information Retrieval, Information Extraction and Text Mining.

[1]  Philipp Cimiano,et al.  An Experimental Comparison of Explicit Semantic Analysis Implementations for Cross-Language Retrieval , 2009, NLDB.

[2]  York Sure-Vetter,et al.  Learning Disjointness , 2007, ESWC.

[3]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[4]  Philipp Cimiano,et al.  Ontology Learning from Text: Methods, Evaluation and Applications , 2005 .

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  Jon Atle Gulla,et al.  Applied Semantic Web Technologies , 2011 .

[7]  Marc Ehrig,et al.  Ontology Alignment: Bridging the Semantic Gap , 2006 .

[8]  Patrick Brézillon,et al.  Lecture Notes in Artificial Intelligence , 1999 .

[9]  Stephan Bloehdorn,et al.  Text classification by boosting weak learners based on terms and concepts , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[10]  Marko Brunzel,et al.  The XTREEM Methods for Ontology Learning from Web Documents , 2008, Ontology Learning and Population.

[11]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[12]  Stephan Bloehdorn,et al.  Combined Syntactic and Semantic Kernels for Text Classification , 2007, ECIR.

[13]  M. Sabou,et al.  Building web service ontologies , 2006 .

[14]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[15]  Martin Chodorow,et al.  Extracting Semantic Hierarchies from a Large On-Line Dictionary , 1985, ACL.

[16]  J. Allan,et al.  On-Line New Event Detection using Single Pass Clustering , 1998 .

[17]  Joakim Nivre,et al.  Single Malt or Blended? A Study in Multilingual Parser Optimization , 2007, EMNLP.

[18]  Johanna Völker,et al.  Lexico-Logical Acquisition of OWL DL Axioms , 2008, ICFCA.

[19]  York Sure-Vetter,et al.  The Semantic Web in One Day , 2005, IEEE Intell. Syst..

[20]  Eugenie Giesbrecht In Search of Semantic Compositionality in Vector Spaces , 2009, ICCS.

[21]  Yorick Wilks,et al.  Is there content in empty heads? , 1990, COLING.

[22]  Christoph Tempich,et al.  A Methodology for Ontology Learning , 2008, Ontology Learning and Population.

[23]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[24]  Philipp Cimiano,et al.  Ontology learning and population from text - algorithms, evaluation and applications , 2006 .

[25]  Dominic Widdows,et al.  Semantic Vector Products: Some Initial Investigations , 2008 .

[26]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[27]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[28]  Philipp Cimiano,et al.  Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction , 2007, PKDD.

[29]  Andreas Hotho,et al.  Discovering shared conceptualizations in folksonomies , 2008, J. Web Semant..

[30]  Stephan Bloehdorn,et al.  Semantic Kernels for Text Classification Based on Topological Measures of Feature Similarity , 2006, Sixth International Conference on Data Mining (ICDM'06).

[31]  Yorick Wilks,et al.  Background and Foreground Knowledge in Dynamic Ontology Construction: Viewing Text as Knowledge Maintenance , 2003 .

[32]  Steffen Staab,et al.  Explaining Text Clustering Results Using Semantic Structures , 2003, PKDD.

[33]  Raphael Volz,et al.  The text-to-onto ontology extraction and maintenance system , 2001 .

[34]  Steffen Staab,et al.  Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis , 2005, J. Artif. Intell. Res..

[35]  Michael L. Littman,et al.  Automatic Cross-Language Retrieval Using Latent Semantic Indexing , 1997 .

[36]  Andreas Hotho,et al.  Mining Association Rules in Folksonomies , 2006, Data Science and Classification.

[37]  Nenad Stojanovic,et al.  On the role of Librarian Agent in ontology-based Knowledge Management Systems , 2003, WOW.

[38]  Johanna Völker,et al.  Ontology Learning and Reasoning - Dealing with Uncertainty and Inconsistency , 2005, ISWC-URSW.

[39]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[40]  Steffen Staab,et al.  Explicit Versus Latent Concept Models for Cross-Language Information Retrieval , 2009, IJCAI.

[41]  Peter Gärdenfors,et al.  Conceptual spaces - the geometry of thought , 2000 .

[42]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[43]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[44]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[45]  Steffen Staab,et al.  Bibster - a semantics-based bibliographic Peer-to-Peer system , 2004, J. Web Semant..

[46]  Roberto Basili,et al.  A Contrastive Approach to Term Extraction , 2001 .

[47]  Johanna Völker,et al.  A Framework for Ontology Learning and Data-driven Change Discovery , 2005 .

[48]  Vipul Kashyap,et al.  TaxaMiner: an experimentation framework for automated taxonomy bootstrapping , 2005, Int. J. Web Grid Serv..

[49]  Lars Schmidt-Thieme,et al.  Relation Extraction for the Semantic Web with Taxonomic Sequential Patterns , 2011 .

[50]  Christina J. Hopfe,et al.  Natural Language Processing and Information Systems, 15th International Conference on Applications of Natural Language to Information Systems, NLDB 2010, Cardiff, UK, June 23-25, 2010. Proceedings , 2010, NLDB.

[51]  Lee Gillam,et al.  Lexical Ontology Extraction using Terminology Analysis: Automating Video Annotation , 2008, LREC.

[52]  C. Fellbaum An Electronic Lexical Database , 1998 .

[53]  Julio Gonzalo,et al.  Indexing with WordNet synsets can improve text retrieval , 1998, WordNet@ACL/COLING.

[54]  Peter Haase,et al.  Learning Expressive Ontologies , 2008, Ontology Learning and Population.

[55]  York Sure-Vetter,et al.  Automatic Evaluation of Ontologies (AEON) , 2005, SEMWEB.

[56]  Zellig S. Harris,et al.  Linguistic Transformations for Information Retrieval , 1970 .

[57]  Dave Robertson,et al.  Probabilistic Dialogue Models for Dynamic Ontology Mapping , 2006, URSW.

[58]  Richard J. Evans,et al.  A framework for named entity recognition in the open domain , 2003, RANLP.

[59]  John R. Smith,et al.  Semi-automatic, data-driven construction of multimedia ontologies , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[60]  Steffen Staab,et al.  Discovering Conceptual Relations from Text , 2000, ECAI.

[61]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[62]  Hinrich Schütze,et al.  Word Space , 1992, NIPS.

[63]  Steffen Staab,et al.  Comparing Conceptual, Divise and Agglomerative Clustering for Learning Taxonomies from Text , 2004, ECAI.

[64]  Philipp Cimiano,et al.  Cross-language Information Retrieval with Explicit Semantic Analysis , 2008, CLEF.

[65]  Steffen Staab,et al.  Gimme' the context: context-driven automatic semantic annotation with C-PANKOW , 2005, WWW '05.

[66]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[67]  Johanna Völker,et al.  Fostering Web Intelligence by Semi-automatic OWL Ontology Refinement , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[68]  Stephan Bloehdorn,et al.  Learning Ontologies to Improve Text Clustering and Classification , 2005, GfKl.

[69]  Patrick Drouin,et al.  Detection of Domain Specific Terminology Using Corpora Comparison , 2004, LREC.

[70]  Johanna Völker,et al.  AEON --An approach to the automatic evaluation of ontologies , 2008 .

[71]  Martin Kavalec,et al.  A Study on Automated Relation Labelling in Ontology Learning , 2005 .

[72]  Hans-Peter Schnurr,et al.  SemanticMiner - Ontology-Based Knowledge Retrieval , 2003, J. Univers. Comput. Sci..

[73]  Stan Matwin,et al.  Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases , 2007 .

[74]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[75]  Steffen Staab,et al.  Towards the self-annotating web , 2004, WWW '04.

[76]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[77]  Erhard Rahm,et al.  Quickmig: automatic schema matching for data migration projects , 2007, CIKM '07.

[78]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[79]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[80]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[81]  Philipp Cimiano,et al.  Automatic Acquisition of Ranked Qualia Structures from the Web , 2007, ACL.

[82]  Fulvio Corno,et al.  Self-Similarity Metric for Index Pruning in Conceptual Vector Space Models , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[83]  Philipp Cimiano,et al.  Harvesting Relations from the Web - Quantifiying the Impact of Filtering Functions , 2007, AAAI.

[84]  No Value,et al.  Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC) , 2004 .

[85]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[86]  Stephan Bloehdorn,et al.  Ontology-Based Question Answering for Digital Libraries , 2007, ECDL.

[87]  Steffen Staab,et al.  Learning Taxonomic Relations from Heterogeneous Sources of Evidence , 2005 .

[88]  Iryna Gurevych,et al.  Using Wikipedia and Wiktionary in Domain-Specific Information Retrieval , 2008, CLEF.

[89]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[90]  Johanna Völker,et al.  Acquisition of OWL DL Axioms from Lexical Resources , 2007, ESWC.

[91]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[92]  David Snchez Domain Ontology Learning from the Web , 2008 .

[93]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[94]  Man Li,et al.  Learning ontology from relational database , 2005, 2005 International Conference on Machine Learning and Cybernetics.