Visual cleaning of genotype data

While some data cleaning tasks can be performed automatically, many more require expert human guidance to steer the cleaning process, especially if erroneous or unclean data is a product of relationships between entities. An example is pedigree genotype data: inheritance hierarchies in which the correctness of genotype data for any individual is judged on comparison to their relations' genotypes, as individuals should inherit DNA from their assumed ancestors. Thus, cleaning this data must consider the relationships between individuals; sometimes this means more data must be cleaned than first assumed, while in other situations it means errors across many individuals can be remedied by cleaning the data of a shared relation. Such judgements require a domain expert to hypothesise the effect changing particular data has on the wider data set. Using a visualization tool with the ability to undertake what-if interactions can assist a user in correctly cleaning such data. We achieve this by closely coupling an existing pedigree visualisation technique, VIPER, with a genotype cleaning algorithm, and then develop necessary extensions to the visualization to allow interactive data cleaning. A comparative user evaluation with biologists shows the advantages of this visualisation design over an existing cleaning tool and we discuss the challenges in the design of visual cleaning tools in which errors may be transitive.

[1]  Andreas Butz,et al.  Information visualization evaluation in large companies: Challenges, experiences and recommendations , 2011, Inf. Vis..

[2]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[3]  Steven F. Roth,et al.  An Interactive Visualization Environment for Data Exploration , 1997, KDD.

[4]  Waqas Ahmed Malik,et al.  An Interactive Graphical System for Visualizing Data Quality–Tableplot Graphics , 2010 .

[5]  Chris North,et al.  An Insight-Based Longitudinal Study of Visual Analytics , 2006, IEEE Transactions on Visualization and Computer Graphics.

[6]  Laura Almasy,et al.  Pedigree and genotype errors in the Framingham Heart Study , 2003, BMC Genetics.

[7]  Bongshin Lee,et al.  Revealing Uncertainty for Information Visualization , 2010, Inf. Vis..

[8]  Trevor Paterson,et al.  Evaluating the VIPER pedigree visualisation: Detecting inheritance inconsistencies in genotyped pedigrees , 2011, 2011 IEEE Symposium on Biological Data Visualization (BioVis)..

[9]  Steven F. Roth,et al.  Enhancing data exploration with a branching history of user operations , 2001, Knowl. Based Syst..

[10]  Keith Andrews,et al.  A Comparative Study of Four Hierarchy Browsers using the Hierarchical Visualisation Testing Environment (HVTE) , 2007, 2007 11th International Conference Information Visualization (IV '07).

[11]  C. Plaisant,et al.  Visualizing Missing Data : Classification and Empirical Study , 2005 .

[12]  Ben Shneiderman,et al.  The eyes have it: a task by data type taxonomy for information visualizations , 1996, Proceedings 1996 IEEE Symposium on Visual Languages.

[13]  Alfred Kobsa User Experiments with Tree Visualization Systems , 2004 .

[14]  Dominique Brodbeck,et al.  Research directions in data wrangling: Visualizations and transformations for usable and credible data , 2011, Inf. Vis..

[15]  Luca Aceto,et al.  The complexity of checking consistency of pedigree information and related problems , 2008, Journal of Computer Science and Technology.

[16]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[17]  Matthew O. Ward,et al.  Quality-aware visual data analysis , 2011, Comput. Stat..

[18]  Trevor Paterson,et al.  Visualising Errors in Animal Pedigree Genotype Data , 2011, Comput. Graph. Forum.

[19]  Piter Bijma,et al.  Genetics Selection Evolution Effects of Pedigree Errors on the Efficiency of Conservation Decisions , 2022 .

[20]  Robert J. Moorhead,et al.  A User Study to Compare Four Uncertainty Visualization Methods for 1D and 2D Datasets , 2009, IEEE Transactions on Visualization and Computer Graphics.

[21]  A. Law,et al.  Genotypechecker: an interactive tool for checking the inheritance consistency of genotyped pedigrees. , 2011, Animal genetics.

[22]  Ben Shneiderman,et al.  D-Dupe: An Interactive Tool for Entity Resolution in Social Networks , 2006, 2006 IEEE Symposium On Visual Analytics Science And Technology.

[23]  Robert F. Cohen,et al.  Validating Graph Drawing Aesthetics , 1995, GD.

[24]  John E. Leide,et al.  Controlled user evaluations of information visualization interfaces for text retrieval: Literature review and meta-analysis , 2008, J. Assoc. Inf. Sci. Technol..

[25]  David M. Nichols,et al.  Experiences with starfield visualizations for analysis of library collections , 2005, IS&T/SPIE Electronic Imaging.

[26]  Heidrun Schumann,et al.  The Visualization of Uncertain Data: Methods and Problems , 2006, SimVis.