Usefulness in Code Completion
We conducted a controlled experiment to evaluate how well our
code completion tool GraPacc can assist developers in programming, in
comparison with the standard tool support, including Google and Google
Code Search (GCS) for tutorials and code examples, and Eclipse built-in
code completion support. The general key evaluation approach is to have
human subjects with relatively equal programming experience perform
coding tasks with the code completion support from GraPacc and with the
support from the standard tools Google/GCS/Eclipse (GCE), and then to
compare their resulting code.
1. Experiment settings:
We prepared 6 different programming tasks (see descriptions). The tasks basically
involve standard Java libraries including java.util, java.lang, and
To reduce bias, the tasks were designed by the second author of our
paper. The first author independently mined/prepared the API usage
patterns of those libraries from our collected data set without knowing
To avoid the situations in which each subject uses an algorithm
that does not utilize any of those Java libraries (thus, does not need
GraPacc and GCE with those libraries), we designed the tasks such that
they could be easily implemented using API elements of those libraries.
Each task description includes a short guidance on the algorithm and
the general data structures in those libraries, such as array, list,
map, set, stack, etc that should be used. The tasks cover several API
elements related to many functions of the libraries.
We invited four Ph.D. students majoring in Software Engineering
at Iowa State University to complete those tasks. Subjects 1 and 2 have
8-9 years of programming in Java, while subjects 3 and 4 have 5-6
years. All human subjects are familiar with those aforementioned Java
libraries. For comparison, subjects 1 and 3 formed Group 1, and the
others formed Group 2, making each group have comparably similar
mixture of experience. To further avoid the imbalance between two
groups, we applied the same crossover technique and required two groups
to exchange their roles after each task.
2. Evaluation metrics:
We measure the usefulness of the tool support in two dimensions:
code quality and developer effort.
- For code quality, we used the number of
programming errors (e.g. bugs). To measure it, we tested the submitted
code of the subjects. For each task, via white box testing, we also
designed a test suite to test and reveal bugs in the functions of
their completed code.
- Developer effort is generally measured via
the amount of actual written code. Thus, the usefulness of a tool
could be measured via the ratio of the amount of code provided by tool
over the total amount of written code. The higher that ratio is, the
more code is actually filled in via the tool support, thus, the more
useful the tool is. We used the number of tokens as the metric for the
volume of code.