Condition Number Analysis of Kernel-based Density Ratio Estimation

The ratio of two probability densities can be used for solving various machine learning tasks such as covariate shift adaptation (importance sampling), outlier detection (likelihood-ratio test), and feature selection (mutual information). Recently, several methods of directly estimating the density ratio have been developed, e.g., kernel mean matching, maximum likelihood density ratio estimation, and least-squares density ratio fitting. In this paper, we consider a kernelized variant of the least-squares method and investigate its theoretical properties from the viewpoint of the condition number using smoothed analysis techniques--the condition number of the Hessian matrix determines the convergence rate of optimization and the numerical stability. We show that the kernel least-squares method has a smaller condition number than a version of kernel mean matching and other M-estimators, implying that the kernel least-squares method has preferable numerical properties. We further give an alternative formulation of the kernel least-squares estimator which is shown to possess an even smaller condition number. We show that numerical studies meet our theoretical analysis.

[1]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[2]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[3]  J. J. Moré,et al.  Newton's Method , 1982 .

[4]  Gene H. Golub,et al.  Matrix computations , 1983 .

[5]  David G. Luenberger,et al.  Linear and nonlinear programming , 1984 .

[6]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[7]  A. Edelman Eigenvalues and condition numbers of random matrices , 1988 .

[8]  E. Zeidler Nonlinear functional analysis and its applications , 1988 .

[9]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[10]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[11]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[12]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[13]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[14]  S. Geer Empirical Processes in M-Estimation , 2000 .

[15]  Shang-Hua Teng,et al.  Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time , 2001, STOC '01.

[16]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[17]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[18]  Ding-Xuan Zhou,et al.  The covering number in learning theory , 2002, J. Complex..

[19]  Mikio Nakahara,et al.  Geometry, Topology and Physics, Second Edition , 2003 .

[20]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[21]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[22]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[23]  Masashi Sugiyama,et al.  Input-dependent estimation of generalization error under covariate shift , 2005 .

[24]  Alan Edelman,et al.  Tails of Condition Number Distributions , 2005, SIAM J. Matrix Anal. Appl..

[25]  Robert A. Lordo,et al.  Nonparametric and Semiparametric Models , 2005, Technometrics.

[26]  D. Spielman,et al.  Smoothed Analysis of the Condition Numbers and Growth Factors of Matrices , 2003, SIAM Journal on Matrix Analysis and Applications.

[27]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[28]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[29]  Motoaki Kawanabe,et al.  Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , 2007, NIPS.

[30]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[31]  Terence Tao,et al.  The condition number of a randomly perturbed matrix , 2007, STOC '07.

[32]  Martin J. Wainwright,et al.  Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization , 2007, NIPS.

[33]  Takafumi Kanamori,et al.  Inlier-Based Outlier Detection via Direct Density Ratio Estimation , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[34]  Takafumi Kanamori,et al.  Approximating Mutual Information by Maximum Likelihood Density Ratio Estimation , 2008, FSDM.

[35]  Masashi Sugiyama,et al.  Direct Density Ratio Estimation for Large-scale Covariate Shift Adaptation , 2008, SDM.

[36]  M. Kawanabe,et al.  Direct importance estimation for covariate shift adaptation , 2008 .

[37]  Steffen Bickel,et al.  Discriminative Learning Under Covariate Shift , 2009, J. Mach. Learn. Res..

[38]  Takafumi Kanamori,et al.  A Least-squares Approach to Direct Importance Estimation , 2009, J. Mach. Learn. Res..