An Empirical Study of Challenges in Converting Deep Learning Models

There is an increase in deploying Deep Learning (DL)-based software systems in real-world applications. Usually, DL models are developed and trained using DL frameworks like TensorFlow and PyTorch. Each framework has its own internal mechanisms/formats to represent and train DL models (deep neural networks), and usually those formats cannot be recognized by other frameworks. Moreover, trained models are usually deployed in environments different from where they were developed. To solve the interoperability issue and make DL models compatible with different frameworks/environments, some exchange formats are introduced for DL models, like ONNX and CoreML. However, ONNX and CoreML were never empirically evaluated by the community to reveal their prediction accuracy, performance, and robustness after conversion. Poor accuracy or non-robust behavior of converted models may lead to poor quality of deployed DL-based software systems. We conduct, in this paper, the first empirical study to assess ONNX and CoreML for converting trained DL models. In our systematic approach, two popular DL frameworks, Keras and PyTorch, are used to train five widely used DL models on three popular datasets. The trained models are then converted to ONNX and CoreML and transferred to two runtime environments designated for such formats, to be evaluated. We investigate the prediction accuracy before and after conversion. Our results unveil that the prediction accuracy of converted models are at the same level of originals. The performance (time cost and memory consumption) of converted models are studied as well. The size of models are reduced after conversion, which can result in optimized DL-based software deployment. We also study the adversarial robustness of converted models to make sure about the robustness of deployed DL-based software. Leveraging the state-of-the-art adversarial attack approaches, converted models are generally assessed robust at the same level of originals. However, obtained results show that CoreML models are more vulnerable to adversarial attacks compared to ONNX. The general message of our findings is that DL developers should be cautious on the deployment of converted models that may 1) perform poorly while switching from one framework to another, 2) have challenges in robust deployment, or 3) run slowly, leading to poor quality of deployed DL-based software, including DL-based software maintenance tasks, like bug prediction.

[1]  F. Khomh,et al.  Studying the Practices of Deploying Machine Learning Projects on Docker , 2022, EASE.

[2]  S. Sagar Imambi,et al.  PyTorch , 2021, Programming with TensorFlow.

[3]  Yu Liu,et al.  Enhancing the interoperability between deep learning frameworks by model conversion , 2020, ESEC/SIGSOFT FSE.

[4]  Jeff Heaton Applications of Deep Neural Networks , 2020, ArXiv.

[5]  Foutse Khomh,et al.  Analysis of Modern Release Engineering Topics : – A Large-Scale Study using StackOverflow – , 2020, 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[6]  Yaoliang Yu,et al.  Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[7]  Tung D. Le,et al.  Compiling ONNX Neural Network Models Using MLIR , 2020, ArXiv.

[8]  Manuel Serrano,et al.  Replication package for , 2020, Artifact Digital Object Group.

[9]  Jianjun Zhao,et al.  An Empirical Study Towards Characterizing Deep Learning Development and Deployment Across Different Frameworks and Platforms , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[10]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[11]  Matthias Bethge,et al.  Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models , 2017, ICLR.

[12]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[15]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[16]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[17]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[18]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.