Experiments on Parameter Sensitivity Analysis
We first aimed to measure the accuracy of GraPacc with
various values of its parameters. In GraPacc, there are 4
- δ: the feature similarity threshold (Section
- α, β, γ: the name-based similarity weights in
comparing the package, class, and method names of two
Figure 1. Accuracy with different thresholds on Dom4J
In our first experiment, we ran GraPacc on the subject
system Dom4J with 127 source files, 565 test methods that
use Java Util, and the total of 660 recommended API usages
with 1,637 API elements. We fixed α = β = γ = 1/3, varied
δ from 0.3 to 1 with step of 0.1, and measured precision,
recall, and f-score at those values of δ. Figure 1 shows
the result. As shown, precision ranges from 80% to 93%,
recall from 70% to 79%, and f-score from 75% to 85%. In
general, f-score is at a very high level and is quite stable. As
seen, when the name-based feature similarity δ increases,
f-score also slightly increases and peaks at δ = 0.9. This
is reasonable because high threshold on name-based feature
similarity helps GraPacc avoid incorrect matching between
types, variables, and other API elements from a query to the
patterns, i.e. improves precision. However, strict matching,
i.e. δ = 1, reduces the recall, since it allows only identical
names to be matched, thus, reduces the numbers of the
matched features and the pattern candidates.
In our next experiment, we ran GraPacc on Dom4J with
various weighting values for the parameters α, β, and γ, respectively.
The higher the values of α, β, and γ are, the more
weights are given to the similarity of package, class, and
method names, respectively, in feature comparison between
a query and a pattern. Since the sum of all three parameters
must be equal to 1, we varied only β and γ, each from 0 to
1 with the increasing step of 0.1. We fixed δ at 0.5 to allow
α , β, and γ to have more impact on the accuracy.
Table 1. Accuracy with different β and γ
Table 1 shows the f-score values with various values of β
and γ. Each row corresponds to the change of f-score with
respect to the change of γ when β is fixed at a value from
0 to 1. As seen, when β is zero (disregarding the similarity
between class names), f-score values are lower than those
in other cases. However, if β is non-zero, when γ increases,
i.e. more weights are put to the method's name similarity,
f-score also increases. Moreover, as seen in each column (i.e.
when γ is fixed), when (non-zero) β is varied from 0.1 to 1,
f-score does not change much. Importantly, f-score achieves
the highest point of 81%, when γ is 0.6 and β = α = 0.2.
In addition, the region of maximum f-score values is around
the values of 0.5-0.8 for γ, and those of 0.1-0.5 for β.
Thus, this result shows that putting a higher weight for
method names similarity than those for package and class
names similarity yields higher accuracy. However, we
cannot discard the package and class names (i.e. α and β) when
comparing the features.