We are at an inflection point where software engineering meets the data-centric world of big data, machine learning, and artificial intelligence. As software development gradually shifts to the development of data analytics with AI and ML technologies, existing software engineering techniques must be re-imagined to provide the productivity gains that developers desire. We conducted a large scale study of almost 800 professional data scientists in software industry to investigate what a data scientist is, what data scientists do, and what challenges they face. This study has found that ensuring correctness is a huge problem in data analytics. Software developers currently do not have a good method for increasing confidence on the quality of data analytics development. We argue for re-targeting software engineering tools and research directions to address new challenges in the era of data-centric software development. We showcase a few examples of software engineering techniques that re-invent automated debugging and testing for data analytics. In particular, we discuss interactive debugging primitives, automated debugging, data provenance, and symbolic-execution based test generation for big data analytics in Apache Spark. We conclude with open problems in software engineering to support data-centric software development and the needs of AI and ML workforce.
Miryung Kim is a Professor in the Department of Computer Science at the University of California, Los Angeles and is a Director of Software Engineering and Analysis Laboratory. She is known for her research on code clones — code duplication detection, management, and removal solutions. Recently, she has taken a leadership role in defining the emerging area of software engineering for data science.
She received her B.S. in Computer Science from Korea Advanced Institute of Science and Technology in 2001 and her M.S. and Ph.D. in Computer Science and Engineering from the University of Washington in 2003 and 2008 respectively. She ranked No. 1 among all engineering and science students in KAIST in 2001 and received the Korean Ministry of Education, Science, and Technology Award, the highest honor given to an undergraduate student in Korea. She received various awards including an NSF CAREER award, Google Faculty Research Award, and Okawa Foundation Research Award. She was previously an assistant professor at the University of Texas at Austin. Her research is funded by National Science Foundation, Air Force Research Laboratory, Google, IBM, Intel, Okawa Foundation, and Samsung and currently, she is leading a 4.9M Office of Naval Research project on synergistic software customization. She is a Program Co-Chair of the IEEE 35th International Conference on Software Evolution and Maintenance and an Associate Editor of IEEE Transactions on Software Engineering and Empirical Software Engineering.