Duplicate Bug Report Detection with A Combination of
Information Retrieval and Topic Modeling

Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N. Nguyen, David Lo, and Chengnian Sun


Motivating Examples


Generally, a bug report is a record in a bug-tracking database, containing several descriptive fields about the reported bug(s). Important fields in a bug report include a unique identification number of the report (ID), creation time (CreationDate), the reporting person (Reporter), and most importantly, a short summary (Summary) and a full description (Description) of the bug(s).


Observations on a bug report


Bug report BR2 in Eclipse project
Figure 1. Bug report BR2 in Eclipse project


Figure 1 displays an example of an already-fixed bug report in Eclipse project. This bug report was assigned with ID 2 and reported on 10/10/2001 by Andre Weinand for a bug on Eclipse v2.0. It described that the system always use the default text editor to open and display any resource file (e.g. a GIF image) stored in the repository despite its type. Analyzing the contents of BR2, we have the following observations:

  1.  This bug report is about two technical functions in Eclipse: artifact manipulation (MAN) and versioning (VCM).

  2.  The bug occurred in the code implementing MAN. That is, the operation opening a resource file in the repository was incorrectly implemented. We can consider MAN as a technical issue reported in BR2. We can see that Andre Weinand reported that issue in the context of opening version repository resource (VCM). He also described the phenomenon in the context of opening a GIF image file.

  3. In the bug report BR2, the technical function MAN can be recognized in its contents via the words that are relevant to MAN such as editor, open, view, content, resource, file, and text. Similarly, the relevant terms to VCM in the report are repository, resource, and file. Considering bug reports as textual documents, we can view the described technical issues as the topics of those documents.


Observations on a duplicate bug report


Bug report BR9779, a duplicate of BR2
Figure 2. Bug report BR9779, a duplicate of BR2


Figure 2 presents bug report #9779, filed on 02/13/02 by a different person, Jeff Brown. This report was determined by Eclipse's developers as reporting the same bug as in BR2. Analyzing the content of BR9779 and comparing it to BR2, we can see that:
  1.  BR9779 also reported on the same bug in the file manipulating technical function MAN. Jeff Brown reported the issue in the context of opening remote files with more de-tailed information on the code realizing that function: OpenRemoteFileAction, the class responsible for opening a remote file, is directly associated with org.eclipse.ui.DefaultTextEditor, i.e. it always uses the default editor to open a remote file. He provided a suggestion for fixing by using Workbench's IEditorRegistry for the default editor given the filename, etc.
  2.  In addition to the similar terms used to describe MAN topic in both reports (e.g. open, editor, file), there are different terms expressing similar meanings such as honor, mapping, resource type, sensible way in BR2, and appropriate, registered editor, file type in BR9779. The terms for VCM are different in BR9779 and in BR2, e.g. remote, revision, history. Importantly, due to additional information, new terms/topics are used in BR9779 (e.g. registered editor, user preference, navigator).


The detection of duplicate bug reports has benefits in software maintenance. Duplicate bug reports, filed by people with different points of view and experience could provide different kinds of information about the bug(s), thus, help in the fixing process. Importantly, detecting such duplications will help avoid redundant bug fixing efforts. Due to the different contexts and phenomena in which the same bug were exposed and discovered by different reporters, the technical issue could be reported with different terms in addition to similar terms. Duplicate reports describe the same technical issue, however, reporters might discuss other relevant topics and phenomena, and provide the insights on the bug including suggested fixes and relevant technical functions.

Those observations suggest that the detection of duplicate bug reports could be based on not only the technical terms, but also the technical topics in the reports. Intuitively, topics are latent, semantic features, while terms are visible, textual features of the documents. They would complement each other, and we could combine them to achieve higher detection accuracy.