Duplicate Bug Report Detection with A Combination of
Information Retrieval and Topic Modeling

Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N. Nguyen, David Lo, and Chengnian Sun


DBTM's Approach

Combination of Topic Model and BM25F


In our model, we have two prediction experts, y1 is an expert based on the topic model (T-model), and y2 is another expert based on textual features (BM25F). The two experts have different advantages in the prediction of duplicate bug reports. The textual expert (y2) is stricter in comparison, therefore, it is better in the detection of duplicate bug reports written with the same textual tokens. However, it does not work well with the bug reports that describe the same technical issue but are written with different terms. On the other hand, T-model can detect the similarity about topics of two bug reports even when they are not similar in texts. However, since topic is a way of dimension reduction of text contents, the comparison in topic is less strict than in texts. By combining both models, we take advantage of both worlds. DBTM is able to detect duplicate bug reports based on both types of similarity on topics and texts.

The combined expert is a linear combination of the two experts:

                                       y=α1 y12 y2

where α1 and α2 are the parameters to control the significance of experts in estimating duplicate bug reports. They satisfy α12 = 1 and are project-specific. In the extreme case, when α1 = 1, α2 = 0, only topic-based expert is used and when α1 = 0, α2 = 1, only text-based one is used.