Introducing a New Scalable Data-as-a-Service Cloud Platform for Enriching Traditional Text Mining Techniques by Integrating Ontology Modelling and Natural Language Processing

A good deal of digital data produced in academia, commerce and industry is made up of a raw, unstructured text, such as Word documents, Excel tables, emails, web pages, etc., which are also often represented in a natural language. An important analytical task in a number of scientific and technological domains is to retrieve information from text data, aiming to get a deeper insight into the content represented by the data in order to obtain some useful, often not explicitly stated knowledge and facts, related to a particular domain of interest. The major challenge is the size, structural complexity, and frequency of the analysed text sets’ updates (i.e., the ‘big data’ aspect), which makes the use of traditional analysis techniques and tools impossible. We introduce an innovative approach to analyse unstructured text data. This allows for improving traditional data mining techniques by adopting algorithms from ontological domain modelling, natural language processing, and machine learning. The technique is inherently designed with parallelism in mind, which allows for high performance on large-scale Cloud computing infrastructures.

[1]  Eric Goodman,et al.  Scalable in-memory RDFS closure on billions of triples. , 2010 .

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Alexey Cheptsov Semantic Web Reasoning on the Internet Scale with Large Knowledge Collider , 2011, Int. J. Comput. Sci. Appl..

[4]  Dong Liu,et al.  Adaptive Service Binding with Lightweight Semantic Web Services , 2011 .

[5]  Jack Dongarra,et al.  Recent Advances in Parallel Virtual Machine and Message Passing Interface, 15th European PVM/MPI Users' Group Meeting, Dublin, Ireland, September 7-10, 2008. Proceedings , 2008, PVM/MPI.

[6]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[7]  William W. Cohen,et al.  Bootstrapping Biomedical Ontologies for Scientific Text using NELL , 2012, BioNLP@HLT-NAACL.

[8]  Antonio Bucchiarone,et al.  Service Engineering , 2010, S-CUBE Book.

[9]  Kalina Bontcheva,et al.  Ontology-Based Information Extraction for Business Intelligence , 2007, ISWC/ASWC.

[10]  Bastian Haarmann,et al.  Ontology-driven Information Extraction , 2011 .

[11]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[12]  Boris Motik,et al.  A novel approach to ontology classification , 2012, J. Web Semant..

[13]  Silvia Miksch,et al.  Ontology-Driven Information Extraction , 2007 .

[14]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[15]  Tom M. Mitchell,et al.  Coupling Semi-Supervised Learning of Categories and Relations , 2009, HLT-NAACL 2009.