Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale.
In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an \emph{open vocabulary} source code NLM that can scale to such a corpus, 100 times larger than in previous work, and outperforms the state of the art. To our knowledge, this is the largest NLM for code that has been reported.
Thu 9 JulDisplayed time zone: (UTC) Coordinated Universal Time change
17:10 - 18:00 | |||
17:10 50mPoster | Recognizing Developers' Emotions while Programming ICSE 2020 Posters Daniela Girardi University of Bari, Nicole Novielli University of Bari, Davide Fucci Blekinge Institute of Technology, Filippo Lanubile University of Bari | ||
17:10 50mPoster | Importance-Driven Deep Learning System Testing ICSE 2020 Posters Simos Gerasimou University of York, UK, Hasan Ferit Eniser MPI-SWS, Alper Sen Bogazici University, Turkey, Alper Çakan Bogazici University, Turkey | ||
17:10 50mPoster | Open-Vocabulary Models for Source Code (Extended Abstract) ICSE 2020 Posters Rafael-Michael Karampatsis The University of Edinburgh, Hlib Babii Free University of Bozen-Bolzano, Romain Robbes Free University of Bozen-Bolzano, Charles Sutton Google Research, Andrea Janes Free University of Bozen-Bolzano | ||
17:10 50mPoster | Do Preparatory Programming Lab Sessions Contribute to Even Work Distribution in Student Teams? ICSE 2020 Posters Markus Borg RISE Research Institutes of Sweden AB | ||
17:10 50mPoster | Building a Theory of Software Teams Organization in a Continuous Delivery Context ICSE 2020 Posters Leonardo Alexandre Ferreira Leite University of São Paulo, Fabio Kon University of São Paulo, Gustavo Pinto UFPA, Paulo Meirelles Federal University of São Paulo | ||
17:10 50mPoster | Refactor4Green: A Game for Novice Programmers to Learn Code Smells ICSE 2020 Posters |