Learning techniques for information retrieval and mining in high-dimensional databases

The main focus of my research is to design effective learning techniques for information retrieval and mining in high-dimensional databases. There are two main aspects in the retrieval and mining research: accuracy and efficiency. The accuracy problem is how to return results which can better match the ground truth, and the efficiency problem is how to evaluate users’ requests and execute learning algorithms as fast as possible. However, these problems are non-trivial because of the complexity of the high-level semantic concepts, the heterogeneous natures of the feature space, the high dimensionality of data representations and the size of the databases. My dissertation is dedicated to addressing these issues. Specifically, my work has five main contributions as follows. The first contribution is a novel manifold learning algorithm, Local and Global Structures Preserving Projection (LGSPP), which defines salient low-dimensional representations for the high-dimensional data. A small number of projection directions are sought in order to properly preserve the local and global structures for the original data. Specifically, two groups of points are extracted for each individual point in the dataset: the first group contains the nearest neighbors of the point, and the other set are a few sampled points far away from the point. These two point sets respectively characterize the local and global structures with regard to the data point. The objective of the embedding is to minimize the distances of the points in each local neighborhood and also to disperse the points far away from their respective remote points in the original space. In this way, the relationships between the data in the original space are well preserved with little distortions. The second contribution is a new constrained clustering algorithm. Conventionally, clustering is an unsupervised learning problem, which systematically partitions a dataset into a small set of clusters such that data in each cluster appear similar to each other compared with those in other clusters. In the proposal, the partial human knowledge is exploited to find better clustering results. Two kinds of constraints are integrated into the clustering

[1]  Lawrence Cayton,et al.  Algorithms for manifold learning , 2005 .

[2]  Eamonn Keogh Exact Indexing of Dynamic Time Warping , 2002, VLDB.

[3]  Kien A. Hua,et al.  Local and Global Structures Preserving Projection , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[4]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[5]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[6]  Tomer Hertz,et al.  Learning Distance Functions using Equivalence Relations , 2003, ICML.

[7]  Shashi Shekhar,et al.  A Unified Approach to Detecting Spatial Outliers , 2003, GeoInformatica.

[8]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[9]  Sabine Süsstrunk,et al.  Eigenregions for image classification , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Alexander Schrijver,et al.  Theory of linear and integer programming , 1986, Wiley-Interscience series in discrete mathematics and optimization.

[11]  Edward Y. Chang,et al.  Manifold learning, a promised land or work in progress? , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[12]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[13]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[14]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[15]  Deng Cai,et al.  Orthogonal locality preserving indexing , 2005, SIGIR '05.

[16]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.

[17]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[18]  Kien A. Hua,et al.  SubSpace Projection: A unified framework for a class of partition-based dimension reduction techniques , 2009, Inf. Sci..

[19]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[20]  Wei-Ying Ma,et al.  Locality preserving clustering for image database , 2004, MULTIMEDIA '04.

[21]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[22]  Kien A. Hua,et al.  Image Retrieval Based on Regions of Interest , 2003, IEEE Trans. Knowl. Data Eng..

[23]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[24]  Scott Cohen Measuring Point Set Similarity with the Hausdorff Distance: Theory and Applications , 1995 .

[25]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[26]  Bin Wang,et al.  A hyperplane based indexing technique for high-dimensional data , 2007, Inf. Sci..

[27]  Kien A. Hua,et al.  Image retrieval in multipoint queries , 2008 .

[28]  H. V. Jagadish,et al.  A retrieval technique for similar shapes , 1991, SIGMOD '91.

[29]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[30]  Carlotta Domeniconi,et al.  Weighted Clustering Ensembles , 2006, SDM.

[31]  X. Huo,et al.  A Survey of Manifold-Based Learning Methods , 2007 .

[32]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[33]  Christos Faloutsos,et al.  Similarity search without tears: the OMNI-family of all-purpose access methods , 2001, Proceedings 17th International Conference on Data Engineering.

[34]  Claire Cardie,et al.  Constrained K-means Clustering with Background Knowledge , 2001, ICML.

[35]  Oded Maron,et al.  Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[36]  Hermann Ney,et al.  Features for image retrieval: an experimental comparison , 2008, Information Retrieval.

[37]  D. S. Yeung,et al.  Improving Performance of Similarity-Based Clustering by Feature Weight Learning , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[39]  Ira Assent,et al.  Efficient EMD-based similarity search in multimedia databases via flexible dimensionality reduction , 2008, SIGMOD Conference.

[40]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[41]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[42]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[43]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[44]  Hui Xiong,et al.  K-means clustering versus validation measures: a data distribution perspective , 2006, KDD '06.

[45]  Kien A. Hua,et al.  Semi-supervised dimensionality reduction in image feature space , 2008, SAC '08.

[46]  Sally A. Goldman,et al.  Multiple-Instance Learning of Real-Valued Data , 2001, J. Mach. Learn. Res..

[47]  Bernhard Pfahringer,et al.  A Two-Level Learning Method for Generalized Multi-instance Problems , 2003, ECML.

[48]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[49]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[50]  Lawrence K. Saul,et al.  Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifold , 2003, J. Mach. Learn. Res..

[51]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[52]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[53]  B. Nadler,et al.  Diffusion maps, spectral clustering and reaction coordinates of dynamical systems , 2005, math/0503445.

[54]  Zhi-Hua Zhou,et al.  Ensembles of Multi-instance Learners , 2003, ECML.

[55]  Xiaojun Wan,et al.  A novel document similarity measure based on earth mover's distance , 2007, Inf. Sci..

[56]  Yixin Chen,et al.  Image Categorization by Learning and Reasoning with Regions , 2004, J. Mach. Learn. Res..

[57]  D. Donoho,et al.  Hessian Eigenmaps : new locally linear embedding techniques for high-dimensional data , 2003 .

[58]  Vijayan K. Asari,et al.  An improved face recognition technique based on modular PCA approach , 2004, Pattern Recognit. Lett..

[59]  Jeng-Shyang Pan,et al.  Kernel class-wise locality preserving projection , 2008, Inf. Sci..

[60]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[61]  Hagit Shatkay,et al.  Approximate queries and representations for large data sequences , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[62]  Zhi-Hua Zhou,et al.  On the relation between multi-instance learning and semi-supervised learning , 2007, ICML '07.

[63]  Kien A. Hua,et al.  Dynamic Directional Navigation in Content-Based Image Retrieval , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[64]  Nicu Sebe,et al.  Toward Improved Ranking Metrics , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[65]  Raymond T. Ng,et al.  Indexing spatio-temporal trajectories with Chebyshev polynomials , 2004, SIGMOD '04.

[66]  Christian Borgelt,et al.  Induction of Association Rules: Apriori Implementation , 2002, COMPSTAT.

[67]  Kien A. Hua,et al.  Bounded Approximation: A New Criterion for Dimensionality Reduction Approximation in Similarity Search , 2008, IEEE Transactions on Knowledge and Data Engineering.

[68]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[69]  J. Bourgain On lipschitz embedding of finite metric spaces in Hilbert space , 1985 .

[70]  Min Chen,et al.  A latent semantic indexing based method for solving multiple instance learning problem in region-based image retrieval , 2005, Seventh IEEE International Symposium on Multimedia (ISM'05).

[71]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[72]  M.M. Deris,et al.  A Comparative Study for Outlier Detection Techniques in Data Mining , 2006, 2006 IEEE Conference on Cybernetics and Intelligent Systems.

[73]  Dimitrios Gunopulos,et al.  Subspace Clustering of High Dimensional Data , 2004, SDM.

[74]  Alberto O. Mendelzon,et al.  Similarity-based queries for time series data , 1997, SIGMOD '97.

[75]  Kien A. Hua,et al.  Boost image clustering with user query log , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[76]  Joshua B. Tenenbaum,et al.  Global Versus Local Methods in Nonlinear Dimensionality Reduction , 2002, NIPS.

[77]  X. Huo SOME RECENT RESULTS ON THE PERFORMANCE AND IMPLEMENTATION OF MANIFOLD LEARNING ALGORITHMS , 2006 .

[78]  Heikki Mannila,et al.  Distance measures for point sets and their computation , 1997, Acta Informatica.

[79]  Greg Hamerly,et al.  Alternatives to the k-means algorithm that find better clusterings , 2002, CIKM '02.

[80]  Yixin Chen,et al.  MILES: Multiple-Instance Learning via Embedded Instance Selection , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[81]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[82]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[83]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[84]  Sanghyun Park,et al.  A multi-dimensional indexing approach for timestamped event sequence matching , 2007, Inf. Sci..

[85]  Ian Davidson,et al.  When Is Constrained Clustering Beneficial, and Why? , 2006, AAAI.

[86]  Jun Wang,et al.  Solving the Multiple-Instance Problem: A Lazy Learning Approach , 2000, ICML.

[87]  S. S. Ravi,et al.  Clustering with Constraints: Feasibility Issues and the k-Means Algorithm , 2005, SDM.

[88]  Charles Elkan,et al.  Fast recognition of musical genres using RBF networks , 2005, IEEE Transactions on Knowledge and Data Engineering.

[89]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[90]  Kien A. Hua,et al.  An automatic feature generation approach to multiple instance learning and its applications to image databases , 2010, Multimedia Tools and Applications.

[91]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[92]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[93]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[94]  Kien A. Hua,et al.  Leveraging user query log: toward improving image data clustering , 2008, CIVR '08.

[95]  Harry F. Davis,et al.  Introduction to vector analysis , 1961 .

[96]  Kien A. Hua,et al.  A Service-Oriented Approach to Storage Backup , 2008, 2008 IEEE International Conference on Services Computing.

[97]  Qi Zhang,et al.  EM-DD: An Improved Multiple-Instance Learning Technique , 2001, NIPS.

[98]  Shi-Min Hu,et al.  Optimal adaptive learning for image retrieval , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[99]  Yannis Manolopoulos,et al.  Performance of Nearest Neighbor Queries in R-Trees , 1997, ICDT.

[100]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[101]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[102]  Mark Craven,et al.  Supervised versus multiple instance learning: an empirical comparison , 2005, ICML.

[103]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[104]  Kien A. Hua,et al.  A non-linear dimensionality-reduction technique for fast similarity search in large databases , 2006, SIGMOD Conference.

[105]  Beng Chin Ooi,et al.  Towards effective indexing for very large video sequence database , 2005, SIGMOD '05.

[106]  Ambuj K. Singh,et al.  Dimensionality reduction for similarity searching in dynamic databases , 1998, SIGMOD '98.

[107]  Kien A. Hua,et al.  Handle local optimum traps in CBIR systems , 2008, SAC '08.

[108]  Ian Davidson,et al.  Measuring Constraint-Set Utility for Partitional Clustering Algorithms , 2006, PKDD.

[109]  Man Hon Wong,et al.  Fast time-series searching with scaling and shifting , 1999, PODS '99.

[110]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[111]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[112]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[113]  Christos Faloutsos,et al.  Searching Multimedia Databases by Content , 1996, Advances in Database Systems.

[114]  Hans-Peter Kriegel,et al.  Using sets of feature vectors for similarity search on voxelized CAD objects , 2003, SIGMOD '03.

[115]  Dan Klein,et al.  Interpreting and Extending Classical Agglomerative Clustering Algorithms using a Model-Based approach , 2002, ICML.

[116]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[117]  Dimitrios Gunopulos,et al.  A clustering framework based on subjective and objective validity criteria , 2008, TKDD.

[118]  Atul Negi,et al.  Novel approaches to principal component analysis of image data based on feature partitioning framework , 2008, Pattern Recognit. Lett..

[119]  S. S. Ravi,et al.  Identifying and Generating Easy Sets of Constraints for Clustering , 2006, AAAI.

[120]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[121]  Maurice Bruynooghe,et al.  A polynomial time computable metric between point sets , 2001, Acta Informatica.

[122]  Kien A. Hua,et al.  Constrained locally weighted clustering , 2008, Proc. VLDB Endow..

[123]  Atul Negi,et al.  SubXPCA and a generalized feature partitioning approach to principal component analysis , 2008, Pattern Recognit..

[124]  Hans-Peter Kriegel,et al.  The pyramid-technique: towards breaking the curse of dimensionality , 1998, SIGMOD '98.

[125]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[126]  H. J. Arnold Introduction to the Practice of Statistics , 1990 .

[127]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[128]  Hongyuan Zha,et al.  Principal manifolds and nonlinear dimensionality reduction via tangent space alignment , 2004, SIAM J. Sci. Comput..