An Empirical Study on Program Failures of Deep Learning Jobs
Deep learning has made significant achievements in many application areas. To train and test models more efficiently, enterprise developers submit and run their deep learning programs on a shared, multi-tenant platform. However, some of them fail after a long execution time due to code/script defects, which reduces the development productivity and wastes expensive hardware resources like GPU and network I/O.
This paper presents the first comprehensive empirical study on program failures of deep learning jobs. 4960 real failures are collected from Platform-H of Company-X. We manually examine their failure messages and classify them into 20 categories. In addition, we identify the common root causes and bug-fix solutions on a sample of 400 failures. To better understand the current testing and debugging practices for deep learning, we also conduct developer interviews. Our major findings include: (1) 48.0% of the failures occur in interaction with the platform rather than in the execution of code logic, mostly due to the discrepancies between local and platform execution environments; (2) Deep learning specific failures (13.5%) are mainly caused by inappropriate model parameters/structures and framework API misunderstanding; (3) Current debugging practices are not efficient for fault localization in many cases, so developers need more deep learning specific tools. Based on our findings, we further suggest possible research topics and tooling support that could facilitate future deep learning development.
Tue 7 Jul Times are displayed in time zone: (UTC) Coordinated Universal Time change
|08:05 - 08:17|
|08:17 - 08:29|
Peixin ZhangZhejiang University, Jingyi WangNational University of Singapore, Singapore, Jun SunSingapore Management University, Guoliang DongComputer College of Zhejiang University, Xinyu WangZhejiang University, Xingen WangZhejiang University, Jin Song DongNational University of Singapore, Dai TingHuawei Corporation
|08:29 - 08:32|
|08:32 - 08:35|
Yongqiang TIANThe Hong Kong University of Science and Technology, Zhihua ZengZhejiang University, Ming WenHuazhong University of Science and Technology, China, Yepang LiuSouthern University of Science and Technology, Tzu-yang KuoThe Hong Kong University of Science and Technology, Shing-Chi CheungDepartment of Computer Science and Engineering, The Hong Kong University of Science and Technology
|08:35 - 08:47|
Nargiz HumbatovaUniversità della Svizzera italiana, Gunel JahangirovaUniversità della Svizzera italiana, Gabriele BavotaUniversità della Svizzera italiana, Vincenzo RiccioUniversità della Svizzera italiana, Andrea StoccoUniversità della Svizzera italiana, Paolo TonellaUniversità della Svizzera italiana
|08:47 - 08:59|