Semantic HMC: A Predictive Model Using Multi-label Classification for Big Data

One of the biggest challenges in Big Data is the exploitation of Value from large volume of data. To exploit value one must focus on extracting knowledge from Big Data sources. In this paper we present a new simple but highly scalable process to automatically learn the label hierarchy from huge sets of unstructured text. We aim to extract knowledge from these sources using a Hierarchical Multi-Label Classification process called Semantic HMC. Five steps compose the Semantic HMC: Indexation, Vectorization, Hierarchization, Resolution and Realization. The first three steps construct the label hierarchy from data sources. The last two steps classify new items according to the hierarchy labels. To perform the classification without heavily relying on the user, the process is unsupervised, where no thesaurus or label examples are required. The process is implemented in a scalable and distributed platform to process Big Data.

[1]  Krzysztof Janowicz,et al.  Linked Data, Big Data, and the 4th Paradigm , 2013, Semantic Web.

[2]  Christophe Cruz,et al.  Semantic HMC for big data analysis , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[3]  Bracha Shapira,et al.  Recommender Systems Handbook , 2015, Springer US.

[4]  Yue Xu,et al.  Automatic Pattern-Taxonomy Extraction for Web Mining , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[5]  Steffen Staab,et al.  Automatic Acquisition of Taxonomies from Text: FCA meets NLP , 2003 .

[6]  Navneet Kaur,et al.  Review Paper on Clustering Techniques , 2013 .

[7]  Flavius Frasincar,et al.  A semantic approach for extracting domain taxonomies from text , 2014, Decis. Support Syst..

[8]  Raphael Volz,et al.  The Ontology Extraction & Maintenance Framework Text-To-Onto , 2001 .

[9]  Yunhao Liu,et al.  Big Data: A Survey , 2014, Mob. Networks Appl..

[10]  Marti A. Hearst Automatic Acquisition of Hyponyms , 1992 .

[11]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[12]  Leo Obrst,et al.  Ontologies for semantically interoperable systems , 2003, CIKM '03.

[13]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Hierarchical multi-label classification using local neural networks , 2014, J. Comput. Syst. Sci..

[14]  Antonio Badia,et al.  Ontologies , 2001, Springer Berlin Heidelberg.

[15]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993, Knowl. Acquis..

[16]  Yehuda Lindell,et al.  Text Mining at the Term Level , 1998, PKDD.

[17]  Jimmy J. Lin Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms , 2013, ArXiv.

[18]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[19]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[20]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[21]  Flavius Frasincar,et al.  Domain taxonomy learning from text: The subsumption method versus hierarchical clustering , 2013, Data Knowl. Eng..

[22]  Dieter Fensel,et al.  Ontologies: A silver bullet for knowledge management and electronic commerce , 2002 .

[23]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[24]  James T. Kwok,et al.  MultiLabel Classification on Tree- and DAG-Structured Hierarchies , 2011, ICML.

[25]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[26]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..