ICSE 2020
Wed 24 June - Thu 16 July 2020
Tue 7 Jul 2020 15:00 - 15:12 at Silla - A3-Code Summarization Chair(s): Shaohua Wang

Software developers use a mix of source code and natural language text to communicate with each other: and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems - traceability, and reuse via precise extraction of code snippets from mixed text. In this paper, we borrow code-switching techniques from Natural Language Processing and adapt them to apply to mixed text to solve two problems: language identification and token tagging. Our technique, Posit, simultaneously provides abstract syntax tree tags for source code tokens, part-of-speech tags for natural language words, and predicts the source language of a token on mixed text. To realize Posit, we trained a biLSTM network with a Conditional Random Field output layer using abstract syntax tree tags from the CLANG compiler and part-of-speech tags from the Standard Stanford part-of-speech tagger. Posit improves the state-of-the-art on language identification by 10.6% and PoS/AST tagging by 23.7% in accuracy.

POSIT Slides (slides_icse20_posit.pdf)3.25MiB

