Predicting Software Defect Type using Concept-based ClassificationJ1
This is an extended abstract and presentation proposal for the manuscript ID EMSE-D-18-00360R1 accepted by the Empirical Software Engineering journal. The journal paper is not yet online. The accepted manuscript is uploaded along with this proposal.
Automatically predicting the defect type of a software defect from its description can significantly speed up and improve the software defect management process. The standard supervised learning based approach for this task (Thung et al., WCRE2012) needs 90% of labeled data for training the classifier. Creating such data is an expensive and effort-intensive task requiring domain-specific expertise.
In this paper, we propose to circumvent this problem by carrying out concept-based classification (CBC) of software defect reports with help of the Explicit Semantic Analysis (ESA) framework. We first create the concept-based representations of a software defect report and the defect types in the software defect classification scheme by projecting their textual descriptions into a concept-space spanned by the Wikipedia articles. Then, we compute the semantic similarity between these concept-based representations and assign the software defect type that has the highest similarity with the defect report. The proposed CBC approach achieves accuracy (F1 score = 63.16%) similar to the state-of-the-art semi-supervised and active learning approach (Thung et al., , ICPC 2015) for this task without requiring labeled training data. The state-of-the-art approach requires labels for 15% of input defects and achieves accuracy (F1 score) of 62.3%.
Unlike the state-of-the-art approach, our method does not need access to the source-code used to fix the defect. We use just the textual description of the defect reports and the keywords describing the defect types in the defect classification scheme. Note that learning a classifier without labeled training data is known as zero-shot learning and it is a significantly harder task than learning a classifier using labeled data. The proposed concept-based classification of software defect types is the first instance of zero-shot learning philosophy in the software defect analytics domain.