Generally, a bug report is a record in a bug-tracking database, containing several descriptive fields about the reported bug(s). Important fields in a bug report include a unique identification number of the report (ID), creation time (CreationDate), the reporting person (Reporter), and most importantly, a short summary (Summary) and a full description (Description) of the bug(s).
Observations on a bug report
Figure 1. Bug report BR2 in Eclipse project
Figure 1 displays an example of an already-fixed bug report in Eclipse project. This bug report was assigned with ID 2 and reported on 10/10/2001 by Andre Weinand for a bug on Eclipse v2.0. It described that the system always use the default text editor to open and display any resource file (e.g. a GIF image) stored in the repository despite its type. Analyzing the contents of BR2, we have the following observations:
Observations on a duplicate bug report
Figure 2. Bug report BR9779, a duplicate of BR2
Figure 2 presents bug report #9779, filed on 02/13/02 by a different person, Jeff Brown. This report was determined by Eclipse's developers as reporting the same bug as in BR2. Analyzing the content of BR9779 and comparing it to BR2, we can see that:
The detection of duplicate bug reports has benefits in software maintenance. Duplicate bug reports, filed by people with different points of view and experience could provide different kinds of information about the bug(s), thus, help in the fixing process. Importantly, detecting such duplications will help avoid redundant bug fixing efforts. Due to the different contexts and phenomena in which the same bug were exposed and discovered by different reporters, the technical issue could be reported with different terms in addition to similar terms. Duplicate reports describe the same technical issue, however, reporters might discuss other relevant topics and phenomena, and provide the insights on the bug including suggested fixes and relevant technical functions.
Those observations suggest that the detection of
duplicate bug reports could be based on not only the technical terms, but also
the technical topics in the reports. Intuitively, topics are latent, semantic
features, while terms are visible, textual features of the documents. They would
complement each other, and we could combine them to achieve higher detection