This is an introductory chapter in which (i) The goals of core data analysis as a tool helping to enhance and augment knowledge of the domain are outlined. Since knowledge is represented by the concepts and statements of relation between them, two main pathways for data analysis are summarization, for developing and augmenting concepts, and correlation, for enhancing and establishing relations. (ii) A set of eight cases involving small datasets and related data analysis problems is presented. The datasets are taken from various fields such as monitoring market towns, computer security protocols, bioinformatics, and cognitive psychology. (iii) An overview of data visualization, its goals and some techniques, is given. (iv) A general view of strengths and pitfalls of data analysis is provided. (v) An overview of the concept of classification as a soft knowledge structure widely used in theory and practice is given. 1.1 Summarization and Correlation: Main Goals of Core Data Analysis 1.1.1 The Goals for Knowledge Enhancing The term Data Analysis has been used for quite a while, even before the advent of computer era, as an extension of mathematical statistics, starting from developments in cluster analysis and other multivariate techniques before WWII, bringing forth the concepts of “exploratory” data analysis and “confirmatory” data analysis in statistics (see, for example, Tukey 1977). The former was supposed to cover a set of techniques for finding patterns in data, and the latter to cover more conventional mathematical statistics approaches for hypothesis testing. “A possible definition of data analysis is the process of computing various summaries and derived values from the given collection of data” and, moreover, the process may become more intelligent if attempts are made to automate some of the reasoning of skilled data analysts and/or to utilize approaches developed in the Artificial Intelligence areas © Springer Nature Switzerland AG 2019 B. Mirkin, Core Data Analysis: Summarization, Correlation, and Visualization, Undergraduate Topics in Computer Science, https://doi.org/10.1007/978-3-030-00271-8_1 1 (Berthold and Hand 2003, p. 3). Overall, the term Data Analysis is usually applied as an umbrella to cover all the various activities mentioned above, with an emphasis on mathematical statistics and its extensions. The situation can be seen as follows. Classical statistics takes the view of data as a vehicle to fit and test mathematical models of the phenomena the data refer to. The data mining and knowledge discovery discipline claims to use data so that new knowledge is added. What is knowledge remains undefined. Knowledge of fact and knowledge of God and knowledge of regularities in nature can be distinguished though. It should be sensible then to look at those methods that relate to an intermediate level and contribute to the theoretical—rather than any—knowledge of the phenomenon. These would focus on ways for augmenting or enhancing theoretical knowledge of a specific domain which the data being analyzed relate to. The term “knowledge” encompasses many diverse layers or forms of information, starting from individual facts to those of literary characters to major scientific laws. But when focusing on a particular domain the dataset in question comes from, its “theoretical” knowledge structure can be considered as comprised of two main types of elements: (i) concepts and (ii) statements relating concepts. Concepts are terms referring to aggregations of similar entities, such as apples or plums, or similar categories such as fruit comprising both apples and plums, among others. When created over data objects or features, these are referred to, in data analysis, as clusters or factors. Statements of relation between concepts express regularities relating different categories. Two features are said to correlate when a co-occurrence of specific patterns in their values is observed as, for instance, when a feature’s value tends to be the square of another feature. The observance of a correlation pattern can lead sometimes to investigation of a broader structure behind the pattern, which may further lead to finding or developing a theoretical framework from which the correlation follows. It is useful to distinguish between quantitative correlations such as algebraic expressions involving data features and categorical ones expressed in a non-quantitative way, for example, as logical production rules or more complex structures such as decision trees. Correlations may be used for both understanding and prediction. In applications, the latter has been until recently by far more important. Moreover, the prediction problem is much easier to make sense of operationally, hence machine learning has concentrated on this. What is said above suggests that there are two main pathways for data analysis to augment theoretical knowledge: (i) developing new concepts by “summarizing” data and (ii) deriving new relations between concepts by analyzing “correlation” between various aspects of the data. The quotation marks are used here to point out that each of the terms, summarization and correlation, significantly extends its conventional meaning. Indeed, while everybody would agree that the average mark does summarize the marking scores on test papers, it would be more daring to see in the same light derivation of students’ hidden talent scores by approximating their test marks or finding a cluster of similarly performing students. Still, the mathematical structures behind each of these three activities—calculating the average, finding a hidden factor, and designing a cluster structure—are analogous, which suggests that classing them all under the “summarization” umbrella may be 2 1 Topics in Substance of Data Analysis
[1]
Stephen Guattery,et al.
On the Quality of Spectral Separators
,
1998,
SIAM J. Matrix Anal. Appl..
[2]
J. Morgan,et al.
Problems in the Analysis of Survey Data, and a Proposal
,
1963
.
[3]
P. Rousseeuw.
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
,
1987
.
[4]
Boris Mirkin,et al.
Mathematical Classification and Clustering
,
1996
.
[5]
R. Prim.
Shortest connection networks and some generalizations
,
1957
.
[6]
Gilles Louppe,et al.
Understanding variable importances in forests of randomized trees
,
2013,
NIPS.
[7]
Michael J. Brusco,et al.
Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques
,
2007,
J. Classif..
[8]
Niels G. Waller,et al.
Correlation Weights in Multiple Regression
,
2010
.
[9]
Sandra Paterlini,et al.
Differential evolution and particle swarm optimisation in partitional clustering
,
2006,
Comput. Stat. Data Anal..
[10]
Jitendra Malik,et al.
Normalized cuts and image segmentation
,
1997,
Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[11]
George Loizou,et al.
Similarity clustering of proteins using substantive knowledge and reconstruction of evolutionary gene histories in herpesvirus
,
2010
.
[12]
Susana Nascimento,et al.
Unsupervised Fuzzy Clustering for the Segmentation and Annotation of Upwelling Regions in Sea Surface Temperature Images
,
2009,
Discovery Science.
[13]
Donato Malerba,et al.
A Comparative Analysis of Methods for Pruning Decision Trees
,
1997,
IEEE Trans. Pattern Anal. Mach. Intell..
[14]
M E J Newman,et al.
Modularity and community structure in networks.
,
2006,
Proceedings of the National Academy of Sciences of the United States of America.
[15]
Aiko M. Hormann,et al.
Programs for Machine Learning. Part I
,
1962,
Inf. Control..