Visual Terrain Analysis of High-Dimensional Datasets

Most real-world datasets are, to a certain degree, skewed. When considered that they are also large, they become the pinnacle challenge in data analysis. More importantly, we cannot ignore such datasets as they arise frequently in a wide variety of applications. Regardless of the analytic, it is often that the effectiveness of analysis can be improved if the characteristic of the dataset is known in advance. In this paper, we propose a novel technique to preprocess such datasets to obtain this insight. Our work is inspired by the resonance phenomenon, where similar objects resonate to a given response function. The key analytic result of our work is the data terrain, which shows properties of the dataset to enable effective and efficient analysis. We demonstrated our work in the context of various real-world problems. In doing so, we establish it as the tool for preprocessing data before applying computationally expensive algorithms.

[1]  Wentian Li,et al.  Zipf's law in importance of genes for cancer classification using microarray data. , 2001, Journal of theoretical biology.

[2]  Michael I. Jordan,et al.  Stable algorithms for link analysis , 2001, SIGIR '01.

[3]  Hans-Peter Kriegel,et al.  VisDB: database exploration using multidimensional visualization , 1994, IEEE Computer Graphics and Applications.

[4]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Wei Wang,et al.  OP-cluster: clustering by tendency in high dimensional space , 2003, Third IEEE International Conference on Data Mining.

[6]  Dorit S. Hochbaum,et al.  Approximating Clique and Biclique Problems , 1998, J. Algorithms.

[7]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[8]  BART KOSKO,et al.  Bidirectional associative memories , 1988, IEEE Trans. Syst. Man Cybern..

[9]  Oded Galor,et al.  Discrete Dynamical Systems , 2005 .

[10]  Christos Faloutsos,et al.  The "DGX" distribution for mining massive, skewed data , 2001, KDD '01.

[11]  Georges G. Grinstein,et al.  Iconographic Displays For Visualizing Multidimensional Data , 1988, Proceedings of the 1988 IEEE International Conference on Systems, Man, and Cybernetics.

[12]  Alfred Inselberg,et al.  Parallel coordinates for visualizing multi-dimensional geometry , 1987 .

[13]  Oliver Eulenstein,et al.  Obtaining maximal concatenated phylogenetic data sets from large sequence databases. , 2003, Molecular biology and evolution.

[14]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[15]  Christos Faloutsos,et al.  Density biased sampling: an improved method for data mining and clustering , 2000, SIGMOD 2000.

[16]  René Peeters,et al.  The maximum edge biclique problem is NP-complete , 2003, Discret. Appl. Math..

[17]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[18]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[19]  W. Ng,et al.  Visual Terrain Analysis of High-dimensional Datasets ( Technical Report , TRC 04 / 06 ) , 2005 .

[20]  William Wright Information animation applications in the capital markets , 1999 .

[21]  Hans-Peter Kriegel,et al.  'Circle Segments': A Technique for Visually Exploring Large Multidimensional Data Sets , 1996 .

[22]  Lada A. Adamic,et al.  Power-Law Distribution of the World Wide Web , 2000, Science.

[23]  Panayiotis Tsaparas,et al.  Using non-linear dynamical systems for web searching and ranking , 2004, PODS.

[24]  Christos Faloutsos,et al.  Modeling Skewed Distribution Using Multifractals and the '80-20' Law , 1996, VLDB.

[25]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[26]  Christopher Ahlberg,et al.  IVEE: an Information Visualization and Exploration Environment , 1995, Proceedings of Visualization 1995 Conference.

[27]  Allan Borodin,et al.  Finding authorities and hubs from link structures on the World Wide Web , 2001, WWW '01.

[28]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[29]  James T. Sandefur,et al.  Discrete dynamical systems - theory and applications , 1990 .

[30]  William S. Cleveland,et al.  Visualizing Data , 1993 .

[31]  Jock D. Mackinlay,et al.  Cone Trees: animated 3D visualizations of hierarchical information , 1991, CHI.

[32]  Hans-Peter Kriegel,et al.  Recursive pattern: a technique for visualizing very large amounts of data , 1995, Proceedings Visualization '95.

[33]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[34]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.