Empirical Study - Results

For each collected bug in the oracle, we ran MkFault on the buggy Makefile (and the included ones) to produce the list of suspicious lines in the Makefiles with their suspiciousness scores. We measured MkFault's performance by its effectiveness score and top-n accuracy.
The effectiveness score E is defined as follows:

where Rank(fault) is the rank of the faulty line in MkFault's ranked list and TrLines is the number of lines that are involved in the evaluation and execution traces of the Makefile (i.e., the number of lines that a developer would need to inspect using a debugger when detecting the fault). That is, the effectiveness score E is the percentage of lines that need not be inspected by the developer by using MkFault's results. A higher effectiveness score indicates more effort being saved in fault localization. If the faulty line has the same suspiciousness score with other lines, its rank can vary from the smallest to the largest ranking number for that set of lines (called S_fault set). For example, with the following suspiciousness scores assigned to lines L1 to L5: L1=0.7, L2 =0.3, L3=0.5, L4=0.9, L5=0.7, and assuming L1 is faulty, then S_fault=fL1, L5g and L1 can rank either second or third out of the five lines. To address such cases, we compute the effectiveness score as both E_high and E_low for the highest and lowest ranks of S_fault (for the previous example, E_high = 60% and E_low = 40%).

For top-n accuracy, we count the number of times (or hits) that the faulty line is ranked among the top n of the ranked list returned by MkFault. The top-n accuracy is measured by the ratio of the number of hits over the total number of localization cases.

The results:

System	List	LOC	TrLines	E_high	E_low	Top-1	Top-5	Top-10
Actiongame	Cases	691	513	100%	100%	68-89%	100-100%	100-100%
Blood Frontier	Cases	769	665	100%	100%	82-100%	100-100%	100-100%
Dream Toolbox	Cases	400	127	97%	59%	7-73%	53-87%	60-93%
GMod	Cases	430	23	88%	68%	15-65%	25-75%	25-85%
X10	Cases	262	36	95%	76%	5-75%	10-80%	20-95%
Average	-	510	273	96%	80%	35-81%	58-88%	61-95%

For top-n accuracy, each cell has two numbers corresponding to the cases where the faulty line is to be ranked at the highest or the lowest among the set of lines with the same score. As seen, E_high is within 88-100%, and E_low is within 59-100%. On average, E_high is 96% and E_low is 80%. Thus, MkFault has high effectiveness and could help save debugging effort of up to 96%. Also, MkFault can achieve high accuracy. In up to 81% of the cases, a single recommended location contains the fault. In 58-88% of the cases, one could find the fault in the first 5 recommended lines. The variance in top-n accuracy (e.g., 58-88% for top-5 accuracy) is due to the fact that the faulty line often shares the same suspiciousness score with other lines and their rankings range from less than the n-th rank to more than the n-th rank (i.e., falling out of the top-n result).

Fault Localization for Build Crashes

Jafar Al-Kofahi, Hung Viet Nguyen, Tien N. Nguyen

Empirical Study - Results