MF-Ontology: an Ontology for the Text Mining Domain

Text mining (TM) has emerged as a definitive technique for knowledge acquisition from text. The TM process is based on several phases that prepare the text for mining, process the text, and analyze the results. Effective and efficient use of the combination of TM algorithms and techniques is a challenge. Most of the research is focused on developing new data structures, algorithms and methods to achieve that. However, the TM process is still lacking of modeling support. The TM analyst faces many options when modeling a TM process. For instance, the analyst needs to choose the most effective solution to extract the desired knowledge. This is a complex decision involving choices for each one of the TM process phases where many algorithms and implementations are available for composition and several parameters must be tuned. This scenario tends to be chaotic and each time a new modeling starts, all this ad-hoc process is repeated. A first step towards this modeling is to add semantics to the TM process and register modeling results. The use of ontologies to describe the TM domain can help to structure the systematic composition of algorithms and techniques of the text mining process. By adopting the same structure, similar modeling can be identified and reuse of TM software components (web services, local applications) is facilitated. In this paper we describe the MF-Ontology, an ontology for the modeling of activity flow tailored to the TM domain. MF-Ontology that can be used to simplify the development of knowledge discovery applications based on texts. It represents a reference model to the different phases of text mining tasks, methodologies and software available in order to solve a problem. Thus, MF-Ontology offers semantic help for the TM analyst in finding the most appropriate solution. We describe the design of the MF-Ontology and analyze its different levels of abstraction to semantically represent the TM process. We also present an evaluation of MF-Ontology and show techniques for revising the ontology concepts based on interviews with specialists.

[1]  Mirina Grosz,et al.  World Wide Web Consortium , 2010 .

[2]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[3]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[4]  Asunción Gómez-Pérez,et al.  Ontology Evaluation , 2004, Handbook on Ontologies.

[5]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[6]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[7]  Ilkay Altintas,et al.  Provenance Collection Support in the Kepler Scientific Workflow System , 2006, IPAW.

[8]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[9]  Pierangelo Veltri MS-Analyzer: Composing and Executing Preprocessing and Data Mining Services for ProteomicsApplications. , 2006 .

[10]  Nicola Guarino,et al.  Formal Ontology and Information Systems , 1998 .

[11]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993 .

[12]  C. Michael Sperberg-McQueen,et al.  World Wide Web Consortium , 2009, Encyclopedia of Database Systems.

[13]  Nicola Guarino,et al.  Evaluating ontological decisions with OntoClean , 2002, CACM.

[14]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[15]  Mario Cannataro,et al.  MS‐Analyzer: preprocessing and data mining services for proteomics applications on the Grid , 2007, Concurr. Comput. Pract. Exp..

[16]  Johanna Völker,et al.  AEON --An approach to the automatic evaluation of ontologies , 2008 .

[17]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[18]  Simon Miles,et al.  Recycling services and workflows through discovery and reuse , 2004 .

[19]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[20]  N. F. Noy,et al.  Ontology Development 101: A Guide to Creating Your First Ontology , 2001 .

[21]  Giancarlo Guizzardi,et al.  Ontological foundations for structural conceptual models , 2005 .

[22]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[23]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[24]  Frank van Harmelen,et al.  Reviewing the design of DAML+OIL: an ontology language for the semantic web , 2002, AAAI/IAAI.

[25]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[26]  Mario Cannataro,et al.  A Data Mining Ontology for Grid Programming , 2003 .

[27]  Ido Dagan,et al.  Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[28]  Carole A. Goble,et al.  myGrid: personalised bioinformatics on the information grid , 2003, ISMB.

[29]  Mario Cannataro,et al.  KNOWLEDGE GRID An Architecture for Distributed Knowledge Discovery , 2002 .

[30]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[31]  Ke Wang,et al.  Building Hierarchical Classifiers Using Class Proximity , 1999, VLDB.

[32]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .