Improving Data Scientist Efficiency with ProvenanceTechnical
Data scientists frequently analyze data by writing scripts. We conducted a contextual inquiry of interdisciplinary researchers, which revealed that parameter tuning is a highly iterative process and that debugging is time-consuming. As analysis scripts evolve and become more complex, analysts have difficulty conceptualizing their workflow. In particular, after editing a script, it becomes difficult to determine precisely which code blocks depend on the edit. Consequently, scientists frequently re-run entire scripts instead of re-running only the necessary parts. We present ProvBuild, a tool that leverages language-level provenance to streamline the debugging process by reducing programmer cognitive load and decreasing subsequent runtimes, leading to an overall reduction in elapsed debugging time. ProvBuild uses provenance to track dependencies in a script. When an analyst debugs a script, ProvBuild generates a simplified script that contains only the information necessary to debug a particular problem. We demonstrate that debugging the simplified script lowers a programmer’s cognitive load and permits faster re-execution when testing changes. The combination of reduced cognitive load and shorter runtime reduces the time necessary to debug a script. We quantitatively and qualitatively show that even though ProvBuild introduces overhead during a script’s first execution, it is a more efficient way for users to debug and tune complex workflows. ProvBuild demonstrates a novel use of language-level provenance, in which it is used to proactively improve programmer productively rather than merely providing a way to retroactively gain insight into a body of code.
Sat 11 JulDisplayed time zone: (UTC) Coordinated Universal Time change
00:00 - 01:00 | P27-ApplicationsSoftware Engineering in Practice / Technical Papers at Silla Chair(s): Ganesha Upadhyaya Harmony.one | ||
00:00 12mTalk | Big Code != Big Vocabulary: Open-Vocabulary Models for Source codeTechnical Technical Papers Rafael-Michael Karampatsis The University of Edinburgh, Hlib Babii Free University of Bozen-Bolzano, Romain Robbes Free University of Bozen-Bolzano, Charles Sutton Google Research, Andrea Janes Free University of Bozen-Bolzano DOI Pre-print | ||
00:12 12mTalk | Engineering for a Science-Centric Experimentation PlatformSEIP Software Engineering in Practice Nikos Diamantopoulos Netflix, Inc., Jeffrey Wong Netflix, Inc., David Issa Mattos Chalmers University of Technology, Ilias Gerostathopoulos Vrije Universiteit Amsterdam, Matthew Wardrop Netflix, Inc., Tobias Mao Netflix, Inc., Colin McFarland Netflix, Inc. | ||
00:24 12mTalk | Managing data constraints in database-backed web applicationsTechnical Technical Papers Junwen Yang University of Chicago, Utsav Sethi University of Chicago, Cong Yan University of Washington, Alvin Cheung University of California, Berkeley, Shan Lu University of Chicago | ||
00:36 12mTalk | Improving Data Scientist Efficiency with ProvenanceTechnical Technical Papers Jingmei Hu Harvard University, Jiwon Joung Harvard University, Maia Jacobs Harvard University, Margo Seltzer University of British Columbia, Krzysztof Gajos Harvard University |