Open-Vocabulary Models for Source Code (Extended Abstract) (ICSE 2020 - Posters)

Write a Blog >>

Wed 24 June - Thu 16 July 2020

Who

Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes, Charles Sutton, Andrea Janes

Track

ICSE 2020 ICSE Posters

Time Zone

The program is currently displayed in (UTC) Coordinated Universal Time.

Use conference time zone: (UTC) Coordinated Universal TimeSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 9 Jul 2020 17:10 - 18:00 at Poster Special Room - A310-Posters

Abstract

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale.

In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an \emph{open vocabulary} source code NLM that can scale to such a corpus, 100 times larger than in previous work, and outperforms the state of the art. To our knowledge, this is the largest NLM for code that has been reported.

Rafael-Michael Karampatsis

The University of Edinburgh

United Kingdom

Hlib Babii

Free University of Bozen-Bolzano

Italy

Romain Robbes

Free University of Bozen-Bolzano

Italy

Charles Sutton

Google Research

United States

Andrea Janes

Free University of Bozen-Bolzano

Time Zone

The program is currently displayed in (UTC) Coordinated Universal Time.

Use conference time zone: (UTC) Coordinated Universal TimeSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 9 Jul
Displayed time zone: (UTC) Coordinated Universal Time change

17:10 - 18:00	A310-PostersICSE 2020 Posters at Poster Special Room

17:10 50m Poster		Recognizing Developers' Emotions while Programming ICSE 2020 Posters Daniela Girardi University of Bari, Nicole Novielli University of Bari, Davide Fucci Blekinge Institute of Technology, Filippo Lanubile University of Bari
17:10 50m Poster		Importance-Driven Deep Learning System Testing ICSE 2020 Posters Simos Gerasimou University of York, UK, Hasan Ferit Eniser MPI-SWS, Alper Sen Bogazici University, Turkey, Alper Çakan Bogazici University, Turkey
17:10 50m Poster		Open-Vocabulary Models for Source Code (Extended Abstract) ICSE 2020 Posters Rafael-Michael Karampatsis The University of Edinburgh, Hlib Babii Free University of Bozen-Bolzano, Romain Robbes Free University of Bozen-Bolzano, Charles Sutton Google Research, Andrea Janes Free University of Bozen-Bolzano
17:10 50m Poster		Do Preparatory Programming Lab Sessions Contribute to Even Work Distribution in Student Teams? ICSE 2020 Posters Markus Borg RISE Research Institutes of Sweden AB
17:10 50m Poster		Building a Theory of Software Teams Organization in a Continuous Delivery Context ICSE 2020 Posters Leonardo Alexandre Ferreira Leite University of São Paulo, Fabio Kon University of São Paulo, Gustavo Pinto UFPA, Paulo Meirelles Federal University of São Paulo
17:10 50m Poster		Refactor4Green: A Game for Novice Programmers to Learn Code Smells ICSE 2020 Posters Vartika Agrahari , Sridhar Chimalakonda Indian Institute of Technology Tirupati