Instance Selection by Identifying Relevant Events Using Domain Knowledge and Minimal Human Involvement

Billions of events (image, video, tweet, purchase, delivery, or failure) are captured in the era of Big Data and used as training data in business analytic models. But, the event relevance and the consequent effect on a target variable are difficult to ascertain. Contemporaneous events, and not specified market latencies, might lead to a fuzzily defined training data set and poor classification results. Instance selection (IS) methods aim to choose relevant training examples, while reducing a training set, to a subset. Both goals contribute in Big Data projects. But, traditional approaches identified by a literature review are only heuristic and do not consider fuzzy effects and therefore fail. Thus, relevant instances must classified by experts. But, the content of events changes quickly and manual assignment to maintain a realistic model is time and cost intensive. We propose an alternative approach. Here, the relevance of an event is deduced from the effect it causes on a target variable after its publication. Such additional business domain knowledge can be expected to allow a more precise selection of instances and thus a successful prediction. An application in the natural gas market is presented that identified more relevant tickers than other approaches. This approach contributes to a scientific discussion by integrating domain knowledge into IS. Until now, only a few approaches have been introduced, all of which demand human involvement and are expensive. Our approach is automatic, thereby saving analysts' time and expense.

[1]  Huan Liu,et al.  On Issues of Instance Selection , 2002, Data Mining and Knowledge Discovery.

[2]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[3]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[4]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[5]  Punam V. Khandar,et al.  KNOWLEDGE DISCOVERY and SAMPLING TECHNIQUES with DATA MINING for IDENTIFYING TRENDS in DATA SETS , 2010 .

[6]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification : Survey of Methods , 2010 .

[7]  James Allan,et al.  Language models for financial news recommendation , 2000, CIKM '00.

[8]  Marek Grochowski,et al.  Comparison of Instance Selection Algorithms II. Results and Comments , 2004, ICAISC.

[9]  George A. Tsihrintzis,et al.  The Class Imbalance Problem , 2017 .

[10]  Mohan S. Kankanhalli,et al.  Multimedia data mining: state of the art and challenges , 2010, Multimedia Tools and Applications.

[11]  Wlodzislaw Duch,et al.  LVQ algorithm with instance weighting for generation of prototype-based rules , 2011, Neural Networks.

[12]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[13]  Antonio González Muñoz,et al.  A Set of Complexity Measures Designed for Applying Meta-Learning to Instance Selection , 2015, IEEE Transactions on Knowledge and Data Engineering.

[14]  Xiaoyong Liu,et al.  A Hybrid Algorithm for Text Classification Problem , 2012 .

[15]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[16]  Carsten Felden,et al.  Price Trend Forecasting Through Textual Data , 2015, AMCIS.

[17]  George Forman,et al.  Pragmatic text mining: minimizing human effort to quantify many issues in call logs , 2006, KDD '06.

[18]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[19]  Witold Pedrycz,et al.  Conditional Fuzzy C-Means , 1996, Pattern Recognit. Lett..

[20]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[21]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Marek Grochowski,et al.  Comparison of Instances Seletion Algorithms I. Algorithms Survey , 2004, ICAISC.

[23]  Seral Özşen,et al.  Comparison of AIS and fuzzy c-means clustering methods on the classification of breast cancer and diabetes datasets , 2014 .

[24]  José Francisco Martínez Trinidad,et al.  A review of instance selection methods , 2010, Artificial Intelligence Review.

[25]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[26]  John Elder,et al.  Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications , 2012 .