Write a Blog >>
ICSE 2020
Wed 24 June - Thu 16 July 2020
Sat 11 Jul 2020 00:00 - 00:12 at Silla - P27-Applications Chair(s): Ganesha Upadhyaya

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported. All datasets, code, and trained models used in this work are publicly available.

Conference Day
Sat 11 Jul

Displayed time zone: (UTC) Coordinated Universal Time change

00:00 - 01:00
00:00
12m
Talk
Big Code != Big Vocabulary: Open-Vocabulary Models for Source codeACM SIGSOFT Distinguished Paper AwardsArtifact ReusableTechnicalArtifact Available
Technical Papers
Rafael-Michael KarampatsisThe University of Edinburgh, Hlib BabiiFree University of Bozen-Bolzano, Romain RobbesFree University of Bozen-Bolzano, Charles SuttonGoogle Research, Andrea JanesFree University of Bozen-Bolzano
DOI Pre-print
00:12
12m
Talk
Engineering for a Science-Centric Experimentation PlatformSEIP
Software Engineering in Practice
Nikos DiamantopoulosNetflix, Inc., Jeffrey WongNetflix, Inc., David Issa MattosChalmers University of Technology, Ilias GerostathopoulosVrije Universiteit Amsterdam, Matthew WardropNetflix, Inc., Tobias MaoNetflix, Inc., Colin McFarlandNetflix, Inc.
00:24
12m
Talk
Managing data constraints in database-backed web applicationsArtifact ReusableTechnicalArtifact Available
Technical Papers
Junwen YangUniversity of Chicago, Utsav SethiUniversity of Chicago, Cong YanUniversity of Washington, Alvin CheungUniversity of California, Berkeley, Shan LuUniversity of Chicago
00:36
12m
Talk
Improving Data Scientist Efficiency with ProvenanceArtifact ReusableTechnicalArtifact Available
Technical Papers
Jingmei HuHarvard University, Jiwon JoungHarvard University, Maia JacobsHarvard University, Margo SeltzerUniversity of British Columbia, Krzysztof GajosHarvard University