Modified Machine Learning Model and Stock Classification Research Based on Unbalanced Data

With the development of Chinese financial market, more and more investors paid attention to the stock market. How to analysis stock scientifically is a crutial issue that investors should consider. In order to do stock selection, the financial indicators of listed companies are particularly important. However, in real world the number of high-quality stocks is much smaller than ordinary stocks, that is, the dataset is unbalanced. And company's financial data is often high dimensional and contain many irrelevant features. In this paper, firstly we propose a hybrid BASMOTE algorithm based on the borderline-SMOTE algorithm and ADASYN algorithm. Introduce the ADASYN algorithm's adaptive thought to the borderline-SMOTE algorithm, so as to obtain more effective and reasonable new minority examples. Secondly, a hybrid feature selection method, HPMG, is proposed, which introduces the wrapper thought and ensemble thought into traditional feature selection methods. We use multi-dimensional financial indicators of A-Shares data of Chinese market, the validity of the BASMOTE algorithm and the HPMG are compared respectively with existing over-sampling methods and feature selection methods. It proves that the BASMOTE algorithm and HPMG are better than the existing over-sampling methods and feature selection methods.

[1]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[2]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[3]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[4]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[5]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[6]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[7]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .