Big Code != Big Vocabulary: Open-Vocabulary Models for Source code (ICSE 2020 - Technical Papers)

Write a Blog >>

Wed 24 June - Thu 16 July 2020

Who

Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, Andrea Janes

Track

ICSE 2020 Technical Papers

Time Zone

The program is currently displayed in (UTC) Coordinated Universal Time.

Use conference time zone: (UTC) Coordinated Universal TimeSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sat 11 Jul 2020 00:00 - 00:12 at Silla - P27-Applications Chair(s): Ganesha Upadhyaya

Abstract

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported. All datasets, code, and trained models used in this work are publicly available.

Link to Preprint

http://homepages.inf.ed.ac.uk/s1467463/documents/icse20-main-1325.pdf

DOI

https://doi.org/10.1145/3377811.3380342

Rafael-Michael Karampatsis

The University of Edinburgh

United Kingdom

Hlib Babii

Free University of Bozen-Bolzano

Italy

Romain Robbes

Free University of Bozen-Bolzano

Italy

Charles Sutton

Google Research

United States

Andrea Janes

Free University of Bozen-Bolzano

Time Zone

The program is currently displayed in (UTC) Coordinated Universal Time.

Use conference time zone: (UTC) Coordinated Universal TimeSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Sat 11 Jul
Displayed time zone: (UTC) Coordinated Universal Time change

00:00 - 01:00	P27-ApplicationsSoftware Engineering in Practice / Technical Papers / Paper Presentations at Silla Chair(s): Ganesha Upadhyaya Harmony.one

00:00 12m Talk		Big Code != Big Vocabulary: Open-Vocabulary Models for Source codeTechnical Technical Papers Rafael-Michael Karampatsis The University of Edinburgh, Hlib Babii Free University of Bozen-Bolzano, Romain Robbes Free University of Bozen-Bolzano, Charles Sutton Google Research, Andrea Janes Free University of Bozen-Bolzano DOI Pre-print
00:12 12m Talk		Engineering for a Science-Centric Experimentation PlatformSEIP Software Engineering in Practice Nikos Diamantopoulos Netflix, Inc., Jeffrey Wong Netflix, Inc., David Issa Mattos Chalmers University of Technology, Ilias Gerostathopoulos Vrije Universiteit Amsterdam, Matthew Wardrop Netflix, Inc., Tobias Mao Netflix, Inc., Colin McFarland Netflix, Inc.
00:24 12m Talk		Managing data constraints in database-backed web applicationsTechnical Technical Papers Junwen Yang University of Chicago, Utsav Sethi University of Chicago, Cong Yan University of Washington, Alvin Cheung University of California, Berkeley, Shan Lu University of Chicago
00:36 12m Talk		Improving Data Scientist Efficiency with ProvenanceTechnical Technical Papers Jingmei Hu Harvard University, Jiwon Joung Harvard University, Maia Jacobs Harvard University, Margo Seltzer University of British Columbia, Krzysztof Gajos Harvard University