Advanced data science toolkit for non-data scientists – A user guide

Abstract Emerging modern data analytics attracts much attention in materials research and shows great potential for enabling data-driven design. Data populated from the high-throughput CALPHAD approach enables researchers to better understand underlying mechanisms and to facilitate novel hypotheses generation, but the increasing volume of data makes the analysis extremely challenging. Herein, we introduce an easy-to-use, versatile, and open-source data analytics frontend, ASCENDS (Advanced data SCiENce toolkit for Non-Data Scientists), designed with the intent of accelerating data-driven materials research and development. The toolkit is also of value beyond materials science as it can analyze the correlation between input features and target values, train machine learning models, and make predictions from the trained surrogate models of any scientific dataset. Various algorithms implemented in ASCENDS allow users performing quantified correlation analyses and supervised machine learning to explore any datasets of interest without extensive computing and data science background. The detailed usage of ASCENDS is introduced with an example of experimental high-temperature alloy data.

[1]  Sangkeun Lee,et al.  Petascale supercomputing to accelerate the design of high-temperature alloys , 2017, Science and technology of advanced materials.

[2]  David A. Freedman,et al.  Statistical Models: Theory and Practice: References , 2005 .

[3]  Zi-kui Liu,et al.  High-throughput thermodynamic calculations of phase equilibria in solidified 6016 Al-alloys , 2019, Computational Materials Science.

[4]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[5]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[6]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[7]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[8]  Tim Mueller,et al.  Machine Learning in Materials Science , 2016 .

[9]  Andrew Williams,et al.  ASCENDS: Advanced data SCiENce toolkit for Non-Data Scientists , 2020, J. Open Source Softw..

[10]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[11]  Dongwon Shin,et al.  Data analytics approach for melt-pool geometries in metal additive manufacturing , 2019, Science and technology of advanced materials.

[12]  Alexander Luedtke,et al.  The Generalized Mean Information Coefficient , 2013, 1308.5712.

[13]  Dongwon Shin,et al.  High-throughput thermodynamic screening of carbide/refractory metal cermets for ultra-high temperature applications , 2019, Calphad.

[14]  Ulas Bagci,et al.  Characterizing non-linear dependencies among pairs of clinical variables and imaging data , 2012, 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[15]  Rahul Khanna,et al.  Efficient Learning Machines , 2015, Apress.

[16]  R. Ramprasad,et al.  Machine Learning in Materials Science , 2016 .

[17]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[19]  Mark Asta,et al.  High-throughput calculations in the context of alloy design , 2019, MRS Bulletin.

[20]  Dhanesh Chandra,et al.  High-throughput thermodynamic computation and experimental study of solid-state phase transitions in organic multicomponent orientationally disordered phase change materials for thermal energy storage , 2019, Calphad.

[21]  Philip Sedgwick,et al.  Pearson’s correlation coefficient , 2012, BMJ : British Medical Journal.

[22]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[23]  Michael Mitzenmacher,et al.  Measuring Dependence Powerfully and Equitably , 2015, J. Mach. Learn. Res..

[24]  Zheming Yuan,et al.  A New Algorithm to Optimize Maximal Information Coefficient , 2016, PloS one.

[25]  Dongwon Shin,et al.  Modern Data Analytics Approach to Predict Creep of High-Temperature Alloys , 2018, Acta Materialia.

[26]  Daniel W. Davies,et al.  Machine learning for molecular and materials science , 2018, Nature.