Getting to Know Your Data

Publisher Summary This chapter is about getting familiar with the data. Knowledge about the data is useful for data preprocessing, the first major task of the data mining process. The various attribute types are studied. These include nominal attributes, binary attributes, ordinal attributes, and numeric attributes. Basic statistical descriptions can be used to learn more about each attribute's values. Given a temperature attribute, one can determine its mean (average value), median (middle value), and mode (most common value). These are measures of central tendency, which give us an idea of the “middle” or center of distribution. Knowing such basic statistics regarding each attribute makes it easier to fill in missing values, smooth noisy values, and spot outliers during data preprocessing. Knowledge of the attributes and attribute values can also help in fixing inconsistencies incurred during data integration. Plotting the measures of central tendency shows us if the data are symmetric or skewed. Quantile plots, histograms, and scatter plots are other graphic displays of basic statistical descriptions. These can all be useful during data preprocessing and can provide insight into areas for mining. The field of data visualization provides many additional techniques for viewing data through graphical means. These can help identify relations, trends, and biases “hidden” in unstructured data sets. The similarity/dissimilarity between objects may also be used to detect outliers in the data, or to perform nearest-neighbor classification. There are many measures for assessing similarity and dissimilarity. In general, such measures are referred to as proximity measures.

[1]  Lee Sigelman,et al.  The American Political Science Review Citation Classics , 2006, American Political Science Review.

[2]  William J. Dixon,et al.  Political Similarity and American Foreign Trade Patterns , 1993 .

[3]  Theodore Johnson,et al.  Exploratory Data Mining and Data Cleaning , 2003 .

[4]  Garry Young,et al.  Good Times, Bad Times, and the Diversionary Use of Force , 1993 .

[5]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[6]  John L. Sullivan,et al.  An Alternative Conceptualization of Political Tolerance: Illusory Increases 1950s–1970s , 1979, American Political Science Review.

[7]  Harold D. Clarke,et al.  Recapturing the Falklands: Models of Conservative Popularity, 1979–83 , 1990, British Journal of Political Science.

[8]  Duncan C. Thomas,et al.  Does Head Start Make a Difference? , 1993 .

[9]  P. Bachrach,et al.  Two Faces of Power , 1962, American Political Science Review.

[10]  Download Book,et al.  Information Visualization in Data Mining and Knowledge Discovery , 2001 .

[11]  Daniel A. Keim,et al.  Visual Techniques for Exploring Databases , 1997, KDD 1997.

[12]  Arthur H. Miller,et al.  Political Issues and Trust in Government: 1964–1970 , 1974, American Political Science Review.

[13]  Zachary Elkins,et al.  Gradations of Democracy? Empirical Tests of Alternative Conceptualizations , 2000 .

[14]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[15]  Edward R. Tufte,et al.  Determinants of the Outcomes of Midterm Congressional Elections , 1975, American Political Science Review.

[16]  Martin Shubik,et al.  A Method for Evaluating the Distribution of Power in a Committee System , 1954, American Political Science Review.

[17]  C. Granger,et al.  Spurious regressions in econometrics , 1974 .

[18]  Christine S. Lipsmeyer,et al.  The Eyes that Bind: Junior Ministers as Oversight Mechanisms in Coalition Governments , 2011 .

[19]  Martin Johnson,et al.  Polarized Political Communication, Oppositional Media Hostility, and Selective Exposure , 2012 .

[20]  H. McClosky,et al.  Consensus and Ideology in American Politics , 1964, American Political Science Review.

[21]  Daniel A. Keim,et al.  Information Visualization and Visual Data Mining , 2002, IEEE Trans. Vis. Comput. Graph..

[22]  Bruce A. Larson,et al.  The Price of Leadership: Campaign Money and the Polarization of Congressional Parties , 2006, The Journal of Politics.

[23]  Jonathan N. Katz,et al.  What To Do (and Not to Do) with Time-Series Cross-Section Data , 1995, American Political Science Review.

[24]  Robert S. Erikson,et al.  Peasants or Bankers? The American Electorate and the U.S. Economy , 1992, American Political Science Review.

[25]  Michael Friendly,et al.  SAS System for Statistical Graphics,First Edition , 1991 .

[26]  Brian M. Pollins Does Trade Still Follow the Flag? , 1989, American Political Science Review.

[27]  Zeev Maoz,et al.  Normative and Structural Causes of Democratic Peace, 1946–1986 , 1993, American Political Science Review.

[28]  James L. Gibson,et al.  Alternative Measures of Political Tolerance: Must Tolerance Be "Least-Liked" ?* , 1992 .

[29]  Joanne Gowa,et al.  Power Politics and International Trade , 1993, American Political Science Review.

[30]  Jacques Bertin,et al.  Graphics and graphic information-processing , 1981 .

[31]  Gregory A. Caldeira,et al.  On the Legitimacy of National High Courts , 1998, American Political Science Review.

[32]  Kay Lehman Schlozman,et al.  Race, Ethnicity and Political Resources: Participation in the United States , 1993, British Journal of Political Science.

[33]  Edward R. Tufte Visual explanations: images and quantities, evidence and narrative , 1997 .

[34]  Edward R. Tufte,et al.  Envisioning Information , 1990 .

[35]  B. Marx The Visual Display of Quantitative Information , 1985 .

[36]  Laron K. Williams,et al.  Who should be chef?: The dynamics of valence evaluations across income groups during economic crises , 2013 .

[37]  Guy D. Whitten,et al.  A Cross-National Analysis of Economic Voting: Taking Account of the Political Context , 1993 .

[38]  P. Gottschalk,et al.  The Measurement of Poverty , 1983 .

[39]  James N. Druckman,et al.  The Implications of Framing Effects for Citizen Competence , 2001 .

[40]  Joanne Gowa,et al.  Bipolarity, Multipolarity, and Free Trade , 1989, American Political Science Review.

[41]  Christopher R. Westphal,et al.  Data Mining Solutions: Methods and Tools for Solving Real-World Problems , 1998 .

[42]  Michael S. Lewis-Beck,et al.  Who's the chef? Economic voting under a dual executive , 1997 .

[43]  R. Berk,et al.  The specific deterrent effects of arrest for domestic assault. , 1984, American sociological review.

[44]  Robert C. Luskin Measuring Political Sophistication , 1987 .

[45]  Ronald Inglehart,et al.  The Silent Revolution in Europe: Intergenerational Change in Post-Industrial Societies , 1971, American Political Science Review.

[46]  G. Kramer Short-Term Fluctuations in U.S. Voting Behavior, 1896–1964 , 1971, American Political Science Review.

[47]  Douglas A. Hibbs,et al.  Political Parties and Macroeconomic Policy , 1977, American Political Science Review.

[48]  D. Stasavage,et al.  Democracy and education spending in Africa , 2005 .

[49]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[50]  P. Fayers,et al.  The Visual Display of Quantitative Information , 1990 .

[51]  G. De Soete,et al.  Clustering and Classification , 2019, Data-Driven Science and Engineering.

[52]  D. Ellis Visual explanations: Images and quantities , 1997 .

[53]  Jay Verkuilen,et al.  Conceptualizing and Measuring Democracy , 2002 .

[54]  Johan P. Olsen,et al.  The New Institutionalism: Organizational Factors in Political Life , 1983, American Political Science Review.

[55]  Philip E. Tetlock,et al.  Analysis of the Dynamics of Political Reasoning: A General-Purpose Computer-Assisted Methodology , 1989, Political Analysis.

[56]  S. Lipset Some Social Requisites of Democracy: Economic Development and Political Legitimacy , 1959, American Political Science Review.

[57]  Ronald Inglehart,et al.  The Renaissance of Political Culture , 1988, American Political Science Review.

[58]  David R. Cameron,et al.  The Expansion of the Public Economy: A Comparative Analysis , 1978, American Political Science Review.

[59]  Gary King,et al.  How Not to Lie with Statistics: Avoiding Common Mistakes in Quantitative Political Science , 1986 .

[60]  Randolph M. Siverson,et al.  An Institutional Explanation of the Democratic Peace , 1999, American Political Science Review.

[61]  M. Weatherford,et al.  Measuring Political Legitimacy , 1992, American Political Science Review.

[62]  Peter L. Brooks,et al.  Visualizing data , 1997 .

[63]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[64]  Lanny W. Martin,et al.  Wasting Time? The Impact of Ideology and Size on Delay in Coalition Formation , 2003, British Journal of Political Science.

[65]  Jennifer Jerit,et al.  Are Survey Experiments Externally Valid? , 2010, American Political Science Review.

[66]  Ray C. Fair,et al.  The Effect of Economic Events on Votes for President , 1978 .

[67]  R. Putnam Tuning In, Tuning Out: The Strange Disappearance of Social Capital in America , 1995, PS: Political Science & Politics.

[68]  Randolph M. Siverson,et al.  The Political Determinants of International Trade: The Major Powers, 1907–1990 , 1998, American Political Science Review.

[69]  Craig Leonard Brians,et al.  Negative Campaign Advertising: Demobilizer or Mobilizer? , 1996, American Political Science Review.

[70]  Daniel N. Posner The Political Salience of Cultural Difference: Why Chewas and Tumbukas Are Allies in Zambia and Adversaries in Malawi , 2004, American Political Science Review.