ProvBuild: Improving Data Scientist Efficiency with Provenance (An Extended Abstract) (ICSE 2020 - Posters)

Write a Blog >>

Wed 24 June - Thu 16 July 2020

Who

Jingmei Hu, Jiwon Joung, Maia Jacobs, Krzysztof Gajos, Margo Seltzer

Track

ICSE 2020 ICSE Posters

Time Zone

The program is currently displayed in (UTC) Coordinated Universal Time.

Use conference time zone: (UTC) Coordinated Universal TimeSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 8 Jul 2020 02:10 - 03:00 at Poster Special Room - P306-Posters

Abstract

Data scientists frequently analyze data by writing scripts. We conducted a contextual inquiry with interdisciplinary researchers, which revealed that parameter tuning is a highly iterative process and that debugging is time-consuming. As analysis scripts evolve and become more complex, analysts have difficulty conceptualizing their workflow. In particular, after editing a script, it becomes difficult to determine precisely which code blocks depend on the edit. Consequently, scientists frequently re-run entire scripts instead of re-running only the necessary parts. We present ProvBuild, a tool that leverages language-level provenance to streamline the debugging process by reducing programmer cognitive load and decreasing subsequent runtimes, leading to an overall reduction in elapsed debugging time. ProvBuild uses provenance to track dependencies in a script. When an analyst debugs a script, ProvBuild generates a simplified script that contains only the information necessary to debug a particular problem. We demonstrate that debugging the simplified script lowers a programmer’s cognitive load and permits faster re-execution when testing changes. The combination of reduced cognitive load and shorter runtime reduces the time necessary to debug a script. We quantitatively and qualitatively show that even though ProvBuild introduces overhead during a script’s first execution, it is a more efficient way for users to debug and tune complex workflows. ProvBuild demonstrates a novel use of language-level provenance, in which it is used to proactively improve programmer productively rather than merely providing a way to retroactively gain insight into a body of code. ProvBuild is a data analysis environment that uses change impact analysis to improve the iterative debugging process in script-based workflow pipelines. It is the first debugging tool to leverage language-level provenance to reduce cognitive load and execution time.

Jingmei Hu

Harvard University

Jiwon Joung

Harvard University

Maia Jacobs

Harvard University

Krzysztof Gajos

Harvard University

Margo Seltzer

University of British Columbia

Canada

Time Zone

The program is currently displayed in (UTC) Coordinated Universal Time.

Use conference time zone: (UTC) Coordinated Universal TimeSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 8 Jul
Displayed time zone: (UTC) Coordinated Universal Time change

02:10 - 03:00	P306-PostersICSE 2020 Posters at Poster Special Room

02:10 50m Poster		A Practical, Collaborative Approach for Modeling Big Data Analytics Application Requirements ICSE 2020 Posters Hourieh Khalajzadeh Monash University, Australia, Anj Simmons Deakin University, Mohamed Abdelrazek Deakin University, John Grundy Monash University, John Hosking University of Auckland, Qiang He , Prasanna Ratnakanthan , Adil Zia , Meng Law
02:10 50m Poster		ProvBuild: Improving Data Scientist Efficiency with Provenance (An Extended Abstract) ICSE 2020 Posters Jingmei Hu Harvard University, Jiwon Joung Harvard University, Maia Jacobs Harvard University, Krzysztof Gajos Harvard University, Margo Seltzer University of British Columbia
02:10 50m Poster		Elite Developers' Activities at Open Source Ecosystem Level ICSE 2020 Posters James Jones University of California, Irvine, David Redmiles University of California, Irvine
02:10 50m Poster		Semantic Analysis of Issues on Google Play and Twitter ICSE 2020 Posters Aman Yadav , Fatemeh Hendijani Fard University of British Columbia
02:10 50m Poster		An Intelligent Tool for Combatting Contract Cheating Behaviour by Facilitating Scalable Student-Tutor Discussions ICSE 2020 Posters Jake Renzella Deakin University, Andrew Cain Deakin University, Jean-Guy Schneider Deakin University
02:10 50m Poster		Poster: How Has Forking Changed in the Last 20 Years? A Study of Hard Forks on GitHub ICSE 2020 Posters Shurui Zhou Carnegie Mellon University, USA / University of Toronto, CA, Bogdan Vasilescu Carnegie Mellon University, Christian Kästner Carnegie Mellon University
02:10 50m Poster		An Oracle Language for Autonomous Vehicles ICSE 2020 Posters Ana Nora Evans University of Virginia, USA, Mary Lou Soffa University of Virginia, Sebastian Elbaum University of Virginia, USA