Statistical Industry Classification

We give complete algorithms and source code for constructing (multilevel) statistical industry classifications, including methods for fixing the number of clusters at each level (and the number of levels). Under the hood there are clustering algorithms (e.g., k-means). However, what should we cluster? Correlations? Returns? The answer turns out to be neither and our backtests suggest that these details make a sizable difference. We also give an algorithm and source code for building "hybrid" industry classifications by improving off-the-shelf "fundamental" industry classifications by applying our statistical industry classification methods to them. The presentation is intended to be pedagogical and geared toward practical applications in quantitative trading.

[1]  Fionn Murtagh,et al.  Algorithms for hierarchical clustering: an overview , 2012, WIREs Data Mining Knowl. Discov..

[2]  F. Dias,et al.  Determining the number of factors in approximate factor models with global and group-specific factors , 2008 .

[3]  Zura Kakushadze,et al.  Multifactor Risk Models and Heterotic CAPM , 2016, 1602.04902.

[4]  Gregory Connor,et al.  A Test for the Number of Factors in an Approximate Factor Model , 1993 .

[5]  atherine,et al.  Finding the number of clusters in a data set : An information theoretic approach C , 2003 .

[6]  M. C. Ortiz,et al.  Selecting variables for k-means cluster analysis by using a genetic algorithm that optimises the silhouettes , 2004 .

[7]  Zura Kakushadze Mean-Reversion and Optimization , 2014 .

[8]  William N. Goetzmann,et al.  Active Portfolio Management , 1999 .

[9]  Willie Yu,et al.  How to combine a billion alphas , 2017 .

[10]  Zura Kakushadze,et al.  Heterotic Risk Models , 2015, 1508.04883.

[11]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[12]  Zura Kakushadze,et al.  Statistical Risk Models , 2016, 1602.08070.

[13]  W. Sharpe The Sharpe Ratio , 1994 .

[14]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[15]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[16]  J. Baik,et al.  The Oxford Handbook of Random Matrix Theory , 2011 .

[17]  Christian Hennig,et al.  Recovering the number of clusters in data sets with noise features using feature rescaling factors , 2015, Inf. Sci..

[18]  Martin Vetterli,et al.  The effective rank: A measure of effective dimensionality , 2007, 2007 15th European Signal Processing Conference.

[19]  J. Bouchaud,et al.  Financial Applications of Random Matrix Theory: a short review , 2009, 0910.1205.

[20]  Jerry D. Gibson,et al.  Coefficient rate and lossy source coding , 2005, IEEE Transactions on Information Theory.

[21]  L. Lorne Campbell,et al.  Minimum Coefficient Rate for Stationary Random Processes , 1960, Inf. Control..

[22]  L. K. Hansen,et al.  Feature‐space clustering for fMRI meta‐analysis , 2001, Human brain mapping.

[23]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[24]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[25]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .