Automatic extraction of topics on big data streams through scalable advanced analysis

Extracting words, data patterns and topic models from streaming big data by way of real-time processing is a challenging job. Currently, many of applied machine learning techniques in data mining aim to utilize online feedbacks by making model updates faster and quicker. However, Mahout and Massive Online Analysis (MOA) existing solutions are not supported for streaming machine learning, and consequently, not suitable for scalable multiple machines. In this paper enhanced the machine learning algorithms for extracting the words and generating topic models based on the continuing data which was initially proposed. One of the great advantages of the proposed algorithm was the capability to be scaled into multiple machines, in which made it very suitable for real-time processing of streaming data. In general, the algorithm includes two main methods: (a) the first method introduces a principle approach to pre-process documents in an associated time sequence. It implements a class to detect identical files from input files so as to reduce computation time. (b) The second method suits real time monitoring and control of the process from multiple asynchronous text streams. In the experiment, these two methods were alternatively executed, and subsequently after iterations a monotonic convergence was guaranteed. The study conducts the experiments based on a real-world dataset collected from TREC KBA Stream Corpus in 2012. Finally, the accuracy of the proposed method resulted in greater robustness towards the ability to deal with noise and reduce the computation.

[1]  Michael W. Berry,et al.  Text mining : applications and theory , 2010 .

[2]  Wichian Premchaiswadi,et al.  Optimizing and Tuning MapReduce Jobs to Improve the Large‐Scale Data Analysis Process , 2013, Int. J. Intell. Syst..

[3]  Weiguo Fan,et al.  Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing , 2006, IEEE Transactions on Knowledge and Data Engineering.

[4]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[5]  Bin Tang,et al.  Data Replication in Data Intensive Scientific Applications with Performance Guarantee , 2011, IEEE Transactions on Parallel and Distributed Systems.

[6]  Richard T. Snodgrass,et al.  Main Memory-Based Algorithms for Efficient Parallel Aggregation for Temporal Databases , 2004, Distributed and Parallel Databases.

[7]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[10]  Anthony K. H. Tung,et al.  MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters , 2011, IEEE Transactions on Knowledge and Data Engineering.

[11]  Kai Wang,et al.  Accelerating MapReduce with Distributed Memory Cache , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[12]  Victor W. Marek,et al.  Scalable hybrid stream and hadoop network analysis system , 2014, ICPE.

[13]  Himanshu Shah,et al.  Big Data Application Architecture Q & A , 2013, Apress.

[14]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[15]  Craig MacDonald,et al.  MapReduce indexing strategies: Studying scalability and efficiency , 2012, Inf. Process. Manag..

[16]  Himanshu Shah,et al.  Big Data Application Architecture Q&A: A Problem - Solution Approach , 2013 .

[17]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[18]  Zhike Zhang,et al.  Real-time analytics processing with MapReduce , 2012, 2012 International Conference on Machine Learning and Cybernetics.

[19]  Geoffrey C. Fox,et al.  Grid services for earthquake science , 2002, Concurr. Comput. Pract. Exp..

[20]  Vignesh Prajapati,et al.  Big Data Analytics with R and Hadoop , 2013 .

[21]  Walid G. Aref,et al.  M3: Stream Processing on Main-Memory MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[22]  Robert L. Grossman,et al.  Ieee Transactions on Parallel and Distributed Systems, Manuscript Id towards Efficient and Simplified Distributed Data Intensive Computing* , 2022 .

[23]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .