Write a Blog >>
ICSE 2020
Wed 24 June - Thu 16 July 2020
Thu 9 Jul 2020 01:17 - 01:25 at Baekje - P16-Security and Learning Chair(s): Lingming Zhang

Application Programming Interfaces (APIs) have been widely discussed on social-technical platforms (e.g., Stack Overflow). Extracting API mentions from such informal software texts is the prerequisite for API-centric search and summarization of programming knowledge. Machine learning based API extraction has demonstrated superior performance than rule-based methods in informal software texts that lack consistent writing forms and annotations. However, machine learning based methods have a significant overhead in preparing training data and effective features. Training a reliable machine learning based API extraction model for a library often requires several hundreds of manually labeled sentences mentioning this library’s APIs. The effort to prepare training data for hundreds of libraries would be prohibitive. Furthermore, it may also be difficult to prepare sufficient high-quality training data for APIs of some less frequently discussed libraries or frameworks. Another related challenge is to select effective features for a machine learning model to recognize a particular library’s APIs. Designers of a machine learning based API extraction model have to manually select the most effective features for different libraries’ APIs.

In our paper, we propose a multi-layer neural network-based architecture for API extraction. Our architecture automatically learns character-, word- and sentence-level features from the input texts, thus removing the need for manual feature engineering and the dependence on advanced features (e.g., API gazetteers) beyond the input texts. Our neural architecture is composed of the character-level convolutional neural network (CNN), word-level embeddings, and sentence-level Bi-directional Long Short-Term Memory (Bi-LSTM) network for automatically learning character-, word- and sentence-level features from input texts, respectively. We also propose to adopt transfer learning to adapt a source-library-trained model to a target-library, thus reducing the overhead of manual training-data labeling when the software text of multiple programming languages and libraries need to be processed.

We conduct extensive experiments with six libraries of four programming languages which support diverse functionalities and have different API-naming and API-mention characteristics. Our experiments involve three Python libraries (Pandas, NumPy and Matplotlib), one Java library (JDBC), one JavaScript library (React), and one C library (OpenGL). We manually label API mentions in 3600 Stack Overflow posts (600 for each library) for the experiments. Our experiments investigate the performance of our neural architecture for API extraction in informal software texts, the importance of different features, the effectiveness of transfer learning. Our results confirm not only the superior performance of our neural architecture than existing machine learning based methods for API extraction in informal software texts, but also the easy-to-deploy characteristic of our neural architecture.

Our paper makes the following four contributions:

Our work is the first one to consider not only the performance of machine learning based API extraction methods but also the easy deployment of such methods for the software text of multiple programming languages and libraries.

We propose a multi-layer neural architecture to automatically learn to extract effective features from the input texts for API extraction, thus removing the need for manual feature engineering as well as the dependence on features beyond the input texts.

We adopt transfer learning to reduce the overhead of manual labeling of the training data of a subject library. We evaluate the effectiveness of transfer learning across libraries and programming languages and analyze the factors that affect its effectiveness.

We conduct extensive experiments to evaluate our architecture as a whole as well its components. Our results reveal insights into the design of effective mechanisms for API extraction tasks.

Thu 9 Jul
Times are displayed in time zone: (UTC) Coordinated Universal Time change

icse-2020-paper-presentations
01:05 - 02:05: Paper Presentations - P16-Security and Learning at Baekje
Chair(s): Lingming ZhangThe University of Texas at Dallas
icse-2020-papers01:05 - 01:17
Talk
Jinyin ChenCollege of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China, Keke HuCollege of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China, Yue YuCollege of Computer, National University of Defense Technology, Changsha 410073, China, Zhuangzhi ChenCollege of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China, Qi XuanInstitute of Cyberspace Security, Zhejiang University of Technology, Hangzhou 310023, China, Yi LiuInstitute of Process Equipment and Control Engineering, Zhejiang University of Technology, Hangzhou 310023, China, Vladimir FilkovUniversity of California at Davis, USA
icse-2020-Journal-First01:17 - 01:25
Talk
Suyu MaMonash University, Zhenchang XingAustralia National University, Chunyang ChenMonash University, Cheng ChenPricewaterhouseCoopers Firm, Lizhen QuMonash University, Guoqiang LiShanghai Jiao Tong University
icse-2020-papers01:25 - 01:37
Talk
Xueling ZhangUniversity of Texas at San Antonio, Xiaoyin WangUniversity of Texas at San Antonio, USA, Rocky SlavinUniversity of Texas at San Antonio, Travis BreauxCarnegie Mellon University, Jianwei NiuUniversity of Texas at San Antonio
icse-2020-papers01:37 - 01:49
Talk
Peiming LiuTexas A&M University, Gang ZhaoTexas A&m University, Jeff HuangTexas A&M University
icse-2020-papers01:49 - 02:01
Talk
Ana Nora EvansUniversity of Virginia, USA, Bradford CampbellUniversity of Virginia, Mary Lou SoffaUniversity of Virginia