DESIGN BASED INCOMPLETE U-STATISTICS

U-statistics are widely used in fields such as economics, machine learning, and statistics. However, while they enjoy desirable statistical properties, they have an obvious drawback in that the computation becomes impractical as the data size $n$ increases. Specifically, the number of combinations, say $m$, that a U-statistic of order $d$ has to evaluate is $O(n^d)$. Many efforts have been made to approximate the original U-statistic using a small subset of combinations since Blom (1976), who referred to such an approximation as an incomplete U-statistic. To the best of our knowledge, all existing methods require $m$ to grow at least faster than $n$, albeit more slowly than $n^d$, in order for the corresponding incomplete U-statistic to be asymptotically efficient in terms of the mean squared error. In this paper, we introduce a new type of incomplete U-statistic that can be asymptotically efficient, even when $m$ grows more slowly than $n$. In some cases, $m$ is only required to grow faster than $\sqrt{n}$. Our theoretical and empirical results both show significant improvements in the statistical efficiency of the new incomplete U-statistic.

[1]  O. Linton,et al.  Testing for Stochastic Monotonicity , 2006 .

[2]  Contributions to Probability and Statistics in Honour of Gunnar Blom. , 1985 .

[3]  P. Sen Almost Sure Convergence of Generalized $U$-Statistics , 1977 .

[4]  M. Hušková,et al.  Generalized bootstrat for studentized U-statistics: A rank statistic approach , 1993 .

[5]  N. Lin,et al.  Fast surrogates of U-statistics , 2010, Comput. Stat. Data Anal..

[6]  H. Dehling,et al.  Random quadratic forms and the bootstrap for U -statistics , 1994 .

[7]  Tie-Yan Liu,et al.  Ranking Measures and Loss Functions in Learning to Rank , 2009, NIPS.

[8]  Stéphan Clémençon,et al.  SGD Algorithms based on Incomplete U-statistics: Large-Scale Minimization of Empirical Risk , 2015, NIPS.

[9]  Bruce G. Lindsay,et al.  Variance estimation of a general u-statistic with appllication to cross-validation , 2014 .

[10]  Boxin Tang,et al.  Strong orthogonal arrays and associated Latin hypercubes for computer experiments , 2013 .

[11]  Hang Li Learning to Rank , 2017, Encyclopedia of Machine Learning and Data Mining.

[12]  R. V. Mises On the Asymptotic Distribution of Differentiable Statistical Functions , 1947 .

[13]  E. L. Lehmann,et al.  Consistency and Unbiasedness of Certain Nonparametric Tests , 1951 .

[14]  G. K. Eagleson ORTHOGONAL EXPANSIONS AND U‐STATISTICS , 1979 .

[15]  Guninar Blom Some properties of incomplete U-statistics , 1976 .

[16]  W. Hoeffding A Class of Statistics with Asymptotically Normal Distribution , 1948 .

[17]  On the asymptotic distribution of u-statistics , 1979 .

[18]  N. Herrndorf An invariance principle for reduced U-statistics , 1986 .

[19]  Andrew Trotman,et al.  Learning to Rank , 2005, Information Retrieval.

[20]  P. Sen Weak Convergence of Generalized U-statistics , 2008 .

[21]  S. Janson The asymptotic distributions of incomplete U-statistics , 1984 .

[22]  D. Freedman,et al.  Some Asymptotic Theory for the Bootstrap , 1981 .

[23]  G. Rempała,et al.  Minimum variance rectangular designs for U-statistics , 2004 .

[24]  Martin Hilbert,et al.  The World’s Technological Capacity to Store, Communicate, and Compute Information , 2011, Science.

[25]  On Incomplete U-Statistics Having Minimum Variance , 1982 .

[26]  Boxin Tang Orthogonal Array-Based Latin Hypercubes , 1993 .

[27]  Kengo Kato,et al.  Randomized incomplete $U$-statistics in high dimensions , 2017, The Annals of Statistics.

[28]  B. M. Brown,et al.  Reduced $U$-Statistics and the Hodges-Lehmann Estimator , 1978 .

[29]  Paul Janssen,et al.  Consistency of the Generalized Bootstrap for Degenerate $U$-Statistics , 1993 .