The Critical Dimension Problem: No Compromise Feature Selection

The important feature selection problem has been studied extensively and a variety of algorithms has been proposed for data analysis and mining tasks in diverse applications. As the era of "big data" arrives, the development of effective techniques for identifying important features or attributes in very large datasets will be highly valuable in dealing with many of the challenges that come with it. This paper describes work in progress regarding a related general problem: for a given dataset, is there a "Critical Dimension" or minimum number of features that are necessary for achieving good results? In other words, for a dataset with many features, how many are truly relevant and important to be included in, say machine learning and/or data mining tasks to ensure that acceptable performance is achieved? Moreover, if a Critical Dimension indeed exists, how to identify the features that need to be included? The problem is first analyzed formally and shown to be intractable. An ad hoc method is then designed for obtaining approximate solution; next experiments are performed on a selection of datasets of varying sizes to demonstrate that for many datasets there indeed exist a Critical Dimension. The significance of the existence or lack thereof in datasets is explained.