Graph-based Statistical Language Model for Code

Anh Tuan Nguyen, and Tien N. Nguyen


Source code is repetitive. The programming patterns detected from code have been shown to be useful in many software engineering (SE) applications including integrated development environment (IDE) supports, code search, documentation, etc. Recognizing their importance, many approaches have been proposed to detect code patterns in various forms. They can generally be classified into two categories: 1. deterministic pattern mining algorithms, e.g., mining frequent item sets, frequent graphs, etc. 2. statistical language models like n-gram. Recent research in SE has shown that n-gram language model is useful in capturing fine-grained code patterns to support code suggestion.

n-gram is popularly used for text analysis in natural language processing (NLP). However, source code in a program has well-defined syntax and semantics according to the programming languages. The syntax and many levels of semantics can be represented by tree-based data structures or graph-based data structures. Using n-gram model to capture API usage patterns faces the following issues.

  1. It is not always that there is a required order between two API method calls. However, to model that pattern, n-gram requires a total order among API calls.
  2. The code of a pattern often interleaves with the code for project-specific logic. An n-gram could incorrectly include the nearby tokens of the project-specific usage into the pattern.
  3. In many cases, the prior n API calls do not sufficiently provide the context for code suggestion since the API calls in the same usage/pattern could be far apart and cannot be captured within n elements.

In this work, we introduce GraLan, a graph-based statistical language model, that is able to learn from a corpus of graphs to compute the probability that a graph B would appear given a set of context graphs Ctxt. To demonstrate the applications of GraLan, we used it to build a code suggestion engine for the “next” API element. API elements are API method calls and field accesses, and related control units such as if, while, for used in an API usage.

We also developed ASTLan, an AST-based language model, that was adapted from GraLan to support the suggestion of the next valid syntactic template and the detection of common syntactic templates. IDEs could provide to developers such commonly used syntactic templates during editing.

We have conducted an empirical evaluation on GraLan’s and ASTLan’s code suggestion capability. Our empirical evaluation on a large corpus of open-source projects has shown that our engine is more accurate in API code suggestion than the state-of-the-art approaches, and in 75% of the cases, it can correctly suggest the API with only five candidates. ASTLan has high accuracy in suggesting syntactic template and is able to detect many common syntactic templates.