Duplicate Bug Report Detection with A Combination of
Information Retrieval and Topic Modeling

Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N. Nguyen, David Lo, and Chengnian Sun




Detecting duplicate bug reports helps reduce triaging efforts and save developers' time in fixing the same bugs. Among many automated detection approaches, text-based information retrieval (IR) approaches have been shown to outperform others in term of both accuracy and time efficiency. However, those IR-based approaches do not detect well the duplicate reports on the same technical issues using different descriptive terms.

Our tool, DBTM, is a duplicate bug report detection approach that takes advantage of both IR-based features and topic-based features. In DBTM, a bug report is considered as a textual document describing one or more technical issues/topics in a system. Duplicate bug reports describe the same technical issue(s) even though the issue(s) is reported in different terms. In our topic model, we extend Latent Dirichlet Allocation (LDA) to represent the topic structure for a bug report as well as the duplication relations among them. Two duplicate bug reports must describe about the shared technical issue(s)/topic(s) in addition to their own topics on different phenomena, and in addition to their own terms. The topic selection of a bug report is affected not only by the topic distribution of that report, but also by the buggy topic(s) for which the report is in-tended. We also apply Ensemble Averaging technique to combine IR and topic modeling in DBTM. We use Gibbs sampling to train DBTM on historical data with identified duplicate bug reports and then detect other not-yet-identified duplicate ones.