A New Framework for High Performance Processing of Voluminous Multisource Datasets

In this paper we present a new framework to process high-volumes of data generated from heterogeneous sources with different formats (text, image’s features …etc.). The framework consists of three phases. The first phase selects appropriate data reduction technique that closely preserves all of the relevant information in the original data set. The second phase determines the suitable algorithm to apply the selected data reduction technique. The third phase integrates the reduced datasets and makes it ready to fit into different models (Visualization, Reports, Decision making, and predictions). This framework is ideal for knowledge management of data-intensive applications.

[1]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[2]  George Kollios,et al.  BoostMap: A method for efficient approximate similarity rankings , 2004, CVPR 2004.

[3]  Alon Y. Halevy,et al.  Enterprise information integration: successes, challenges and controversies , 2005, SIGMOD '05.

[4]  Sanguthevar Rajasekaran,et al.  Fast k-Means Algorithms with Constant Approximation , 2005, ISAAC.

[5]  Ronald L. Rivest,et al.  On the sample complexity of pac-learning using random and chosen examples , 1990, Annual Conference Computational Learning Theory.

[6]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[7]  Sanguthevar Rajasekaran,et al.  A Novel Scheme for the Parallel Computation of SVDs , 2006, HPCC.

[8]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[9]  Sanguthevar Rajasekaran Efficient parallel hierarchical clustering algorithms , 2005, IEEE Transactions on Parallel and Distributed Systems.

[10]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[11]  S. Rajasekaran,et al.  Applying LSI and data reduction to XML for counter terrorism , 2006, 2006 IEEE Aerospace Conference.

[12]  Krishna R. Pattipati,et al.  Analysis of Heterogeneous Data in Ultrahigh Dimensions , 2005, Emergent Information Technologies and Enabling Policies for Counter-Terrorism.

[13]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[14]  Juha Karhunen,et al.  Principal component neural networks — Theory and applications , 1998, Pattern Analysis and Applications.

[15]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[16]  Santosh S. Vempala,et al.  Latent Semantic Indexing , 2000, PODS 2000.

[17]  Eduard Hoenkamp Unitary operators for fast latent semantic indexing (FLSI) , 2001, SIGIR '01.

[18]  Fulvio Rinaudo,et al.  Terrestrial laser scanner data processing , 2004 .

[19]  Reda A. Ammar Hierarchical Performance Modeling and Analysis of Distributed Software Systems , 2007, Handbook of Parallel Computing.

[20]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[21]  Bran Selic,et al.  A wideband approach to integrating performance prediction into a software design environment , 1998, WOSP '98.

[22]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[23]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[24]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[25]  Kaizhong Zhang,et al.  Evaluating a class of distance-mapping algorithms for data mining and clustering , 1999, KDD '99.

[26]  Klaus R. Dittrich,et al.  Three decades of data integration - All problems solved? , 2004, IFIP Congress Topical Sessions.

[27]  B. S. Manjunath,et al.  An Eigenspace Update Algorithm for Image Analysis , 1997, CVGIP Graph. Model. Image Process..

[28]  Yousef Saad,et al.  Polynomial filtering in latent semantic indexing for information retrieval , 2004, SIGIR '04.

[29]  Eamonn J. Keogh,et al.  A Simple Dimensionality Reduction Technique for Fast Similarity Search in Large Time Series Databases , 2000, PAKDD.

[30]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[31]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[32]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[33]  R. Trumbower,et al.  Identifying Offline Muscle Strength Profiles Sufficient for Short-Duration Fes-Lce Exercise: A Pac Learning Model Approach , 2006, Journal of Clinical Monitoring and Computing.

[34]  A. Beck,et al.  Conference on Modern Analysis and Probability , 1984 .

[35]  John Yen,et al.  Emergent Information Technologies and Enabling Policies for Counter-Terrorism (IEEE Press Series on Computational Intelligence) , 2006 .

[36]  Connie U. Smith,et al.  Performance Engineering of Software Systems , 1990, SIGMETRICS Perform. Evaluation Rev..

[37]  George Kollios,et al.  BoostMap: A method for efficient approximate similarity rankings , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[38]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[39]  Howard A. Sholl,et al.  A framework for designing performance-oriented distributed systems , 2001, Proceedings. Sixth IEEE Symposium on Computers and Communications.