Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources

During the last years, big data analysis has become a popular means of taking advantage of multiple (initially valueless) sources to find relevant knowledge about real domains. However, a large number of big data sources provide textual unstructured data. A proper analysis requires tools able to adequately combine big data and text-analysing techniques. Keeping this in mind, we combined a pipelining framework (BDP4J (Big Data Pipelining For Java)) with the implementation of a set of text preprocessing techniques in order to create NLPA (Natural Language Preprocessing Architecture), an extendable open-source plugin implementing preprocessing steps that can be easily combined to create a pipeline. Additionally, NLPA incorporates the possibility of generating datasets using either a classical token-based representation of data or newer synset-based datasets that would be further processed using semantic information (i.e., using ontologies). This work presents a case study of NLPA operation covering the transformation of raw heterogeneous big data into different dataset representations (synsets and tokens) and using the Weka application programming interface (API) to launch two well-known classifiers.

[1]  Peter W. Resnick,et al.  Internet Message Format , 2001, RFC.

[2]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[3]  Stephan Bloehdorn,et al.  Boosting for Text Classification with Semantic Features , 2004, WebKDD.

[4]  Sudipta Roy,et al.  Stock Price Prediction using Artificial Neural Model: An Application of Big Data , 2018, EAI Endorsed Trans. Scalable Inf. Syst..

[5]  Kevin Leahy,et al.  An industrial big data pipeline for data-driven analytics maintenance applications in large-scale smart manufacturing facilities , 2015, Journal of Big Data.

[6]  M. Anusha,et al.  Big Data-Survey , 2016 .

[7]  Kalina Bontcheva,et al.  Getting More Out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics , 2013, PLoS Comput. Biol..

[8]  Miguel Rocha,et al.  A Comparative Impact Study of Attribute Selection Techniques on Naïve Bayes Spam Filters , 2008, ICDM.

[9]  Keke Gai,et al.  Towards Cloud Computing: A Literature Review on Cloud Computing and Its Development Trends , 2012, 2012 Fourth International Conference on Multimedia Information Networking and Security.

[10]  Xike Xie,et al.  Survey of real-time processing systems for big data , 2014, IDEAS.

[11]  Xabier Artola,et al.  Big data for Natural Language Processing: A streaming approach , 2015, Knowl. Based Syst..

[12]  José Ramon Méndez,et al.  A new semantic-based feature selection method for spam filtering , 2019, Appl. Soft Comput..

[13]  Zibin Zheng,et al.  Service-Generated Big Data and Big Data-as-a-Service: An Overview , 2013, 2013 IEEE International Congress on Big Data.

[14]  Pierre Zweigenbaum,et al.  Automatic extraction of semantic relations between medical entities: a rule based approach , 2011, J. Biomed. Semant..

[15]  Pietro Ducange,et al.  A glimpse on big data analytics in the framework of marketing strategies , 2017, Soft Computing.

[16]  Max Mühlhäuser,et al.  Darmstadt Knowledge Processing Repository Based on UIMA , 2007 .

[17]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[18]  S. A. Babar,et al.  Improving Performance of Text Summarization , 2015 .

[19]  Antonio Moreno,et al.  Text Analytics: the convergence of Big Data and Artificial Intelligence , 2016, Int. J. Interact. Multim. Artif. Intell..

[20]  Raouf Boutaba,et al.  Cloud computing: state-of-the-art and research challenges , 2010, Journal of Internet Services and Applications.

[21]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[22]  Samar Wazir,et al.  Performance Analysis of Big Data and Cloud Computing Techniques: A Survey , 2018 .

[23]  RigauGerman,et al.  Big data for Natural Language Processing , 2015 .

[24]  Gonzalo Mateos,et al.  Modeling and Optimization for Big Data Analytics: (Statistical) learning tools for our era of data deluge , 2014, IEEE Signal Processing Magazine.