Classifying Short Unstructured Data Using the Apache Spark Platform

People worldwide use Twitter to post updates about the events that concern them directly or indirectly. Study of these posts can help identify global events and trends of importance. Similarly, E-commerce applications organize their products in a way that can facilitate their management and satisfy the needs and expectations of their customers. However, classifying data such as tweets or product descriptions is still a challenge. These data are described by short texts, containing in their vocabulary abbreviations of sentences, emojis, hashtags, implicit codes, and other non-standard usage of written language. Consequently, traditional text classification techniques are not effective on these data. In this paper, we describe our use of the Spark platform to implement two classification strategies to process large data collections, where each datum is a short textual description. One of our solutions uses an associative classifier, while the other is based on a multiclass Logistic Regression classifier using Word2Vec as a feature selection and transformation technique. Our associative classifier captures the relationships among words that uniquely identify each class, and Word2Vec captures the semantic and syntactic context of the words. In our experimental evaluation, we compared our solutions, as well as Spark MLlib classifiers. We assessed effectiveness, efficiency, and memory requirements. The results indicate that our solutions are able to effectively classify the millions of data instances composed of thousands of distinct features and classes, found in our digital libraries.

[1]  LEKHA R. NAIR,et al.  STREAMING TWITTER DATA ANALYSIS USING SPARK FOR EFFECTIVE JOB SEARCH , 2015 .

[2]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[3]  Patrick Wendell,et al.  Learning Spark: Lightning-Fast Big Data Analytics , 2015 .

[4]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[5]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6]  Henryk Maciejewski,et al.  Distributed Classification of Text Documents on Apache Spark Platform , 2016, ICAISC.

[7]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[8]  W. Greene,et al.  计量经济分析 = Econometric analysis , 2009 .

[9]  Shengrui Wang,et al.  Automated feature weighting in naive bayes for high-dimensional data classification , 2012, CIKM.

[10]  Yunming Ye,et al.  Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces , 2012, Int. J. Data Warehous. Min..

[11]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  Geoffrey I. Webb,et al.  Not So Naive Bayes: Aggregating One-Dependence Estimators , 2005, Machine Learning.

[14]  B. Carpenter Lazy Sparse Stochastic Gradient Descent for Regularized Mutlinomial Logistic Regression , 2008 .

[15]  Ahmed Ali Abdalla Esmin,et al.  Disambiguating publication venue titles using association rules , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[16]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[17]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.

[18]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[19]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[20]  Heng Ji,et al.  Harnessing web page directories for large-scale classification of tweets , 2013, WWW '13 Companion.

[21]  Ricardo Baeza-Yates,et al.  Modern Information Retrieval - the concepts and technology behind search, Second edition , 2011 .

[22]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[23]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[26]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[27]  Denilson Alves Pereira,et al.  An association rules based method for classifying product offers from e-shopping , 2017, Intell. Data Anal..

[28]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[29]  David A. Freedman,et al.  Statistical Models: Theory and Practice: References , 2005 .

[30]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.