Duplicate Bug Report Detection with A Combination of
Information Retrieval and Topic Modeling

Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N. Nguyen, David Lo, and Chengnian Sun


Empirical Evaluation


Statistics of All Bug Report Data
Table 1. Statistics of All Bug Report Data


The evaluation setting is the same as in REP. All bug reports were sorted in the chronological order. We divided the data set into two sets. The training set includes the first M reports in the repository, of which 200 reports are duplicates. It was used to train the parameters for T-Model, BM25F, and DBTM. The remaining reports were used for testing. At each execution, we ran DBTM through the testing reports in the chronological order. When it determines a duplicate report b, it returns the list of top-k potential duplicate report groups. If a true duplicate report group G is found in the top-k list, we count it as a hit. We then added b to that group to use for later retrieval. The top-k accuracy (i.e. recall rate) is measured by the ratio of the number of hits over the total number of considered bug reports.