论文信息 - Why Visualize Data When Coding? Preliminary Categories for Coding in Jupyter Notebooks

Why Visualize Data When Coding? Preliminary Categories for Coding in Jupyter Notebooks

Data visualization becomes a crucial component in data analytics, especially data exploration, understanding, and analysis. Effective data visualization impacts decision-making and aids in discovering and understanding relationships. It leads to benefits in data-intensive software development tasks e.g., feature engineering in machine learning-based software projects. However, it is unknown how visualizations are used in competitive programming. The idea of this paper is to report early results on what visualizations are prevalent in competitive programming. Grandmasters are the highest level reached in competitions (novice, expert, master, and grandmaster). Analyzing the visualizations of 7 high-rank competitors (i.e., Grandmaster) in Kaggle, we identify and present a catalog of visualizations used to both tell a story from the data, as well as explain the process and pipelines involved to explain their coding solutions. Our taxonomy includes nine types from over 821 visualizations in 68 instances of Jupyter notebooks. Furthermore, most visualizations are for data analysis for distribution (DA Distribution), and frequency (DA Frequency) are most used. We envision that this catalog can be useful to better understand different situations in which to employ these visualizations.

[1] Shakira Khan. Data Visualization to Explore the Countries Dataset for Pattern Creation , 2021, International Journal of Online and Biomedical Engineering (iJOE).

[2] Jin L. C. Guo,et al. Splitting, Renaming, Removing: A Study of Common Cleaning Activities in Jupyter Notebooks , 2021, 2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW).

[3] Christopher A. Brooks,et al. What Makes a Well-Documented Notebook? A Case Study of Data Scientists’ Documentation Practices in Kaggle , 2021, CHI Extended Abstracts.

[4] F. Lanubile,et al. KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle , 2021, 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR).

[5] João Luiz Dihl Comba,et al. Data Visualization for the Understanding of COVID-19 , 2020, Computing in Science & Engineering.

[6] Inamullah Khan,et al. Modal Parameter Identification of Bridge based on Large Scale Data Sets , 2017 .

[7] David W. Binkley,et al. Dependence cluster visualization , 2010, SOFTVIS '10.

[8] Leland Wilkinson,et al. Playfair’s commerical and political atlas and statistical breviary , 2007 .

[9] Kwan-Liu Ma,et al. A spreadsheet interface for visualization exploration , 2000, Proceedings Visualization 2000. VIS 2000 (Cat. No.00CH37145).

[10] William A. Wallace,et al. Visualization and the process of modeling: a cognitive-theoretic view , 2000, KDD '00.

[11] Colin Ware,et al. Information Visualization: Perception for Design , 2000 .

[12] Gregory M. Nielson,et al. Data Visualization: The State of the Art , 2003, Data Visualization: The State of the Art.