LDA Topic Modeling Based Dataset Dependency Matrix Prediction

Classification of text based datasets has many applications in the field of Computer Science. Some of the key application areas include scientific article recommendation, news article tagging, multimedia content search assistance, etc. We are interested in the problem of data placement of text based datasets in a distributed storage system. Distributed data placement entails placing related data together at a local site. Thus, classifying related data from the unrelated ones is a pre-requisite for any such data placement system. Classification of datasets can be accomplished using information provided to the system about the relatedness of a pair of dataset. However, when such information are not available, the relatedness of pairs of dataset need to be inferred from content of the dataset itself. In literature, topic modeling has been used to find similarity between text documents and in classifying these documents according to the similarity between them. We intend to develop a novel classification system of text based datasets using topic modeling, as a precursor to a data placement scheme to be developed for distributed data storage system.

[1]  Tarek Hamrouni,et al.  A data mining correlated patterns-based periodic decentralized replication strategy for data grids , 2015, J. Syst. Softw..

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  Xiao Liu,et al.  A data placement strategy in scientific cloud workflows , 2010, Future Gener. Comput. Syst..

[4]  Jun Wang,et al.  DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications With Interest Locality , 2012, IEEE Transactions on Magnetics.

[5]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[6]  Peng Wang,et al.  A new data-grouping-aware dynamic data placement method that take into account jobs execute frequency for Hadoop , 2016, Microprocess. Microsystems.

[7]  Yi Yu,et al.  Fuzzy clustering of lecture videos based on topic modeling , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[8]  Paul J. Schweitzer,et al.  Problem Decomposition and Data Reorganization by a Clustering Technique , 1972, Oper. Res..

[9]  David M. Blei,et al.  Content-based recommendations with Poisson factorization , 2014, NIPS.

[10]  Mahdi Niamanesh,et al.  ScadiBino: An effective MapReduce-based association rule mining method , 2014, ICEC '14.

[11]  Amir Masoud Rahmani,et al.  PDDRA: A new pre-fetching based dynamic data replication algorithm in data grids , 2012, Future Gener. Comput. Syst..

[12]  Vasile Rus,et al.  Similarity Measures Based on Latent Dirichlet Allocation , 2013, CICLing.

[13]  Tarek Hamrouni,et al.  New Replication Strategy Based on Maximal Frequent Correlated Pattern Mining for Data Grids , 2014, 2014 15th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[14]  Vijaya Nagarajan,et al.  A prediction-based dynamic replication strategy for data-intensive applications , 2017, Comput. Electr. Eng..