Statistical Semantic Language Model for Source Code

Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen



Previous research has shown that source code in program-ming languages exhibits a good level of repetition. For example, a for loop such as for (int i = 0; i < n; i++) or a printing statement System.out.println(...) occur frequently in many source files. Hindle et al. found that such code regularities-patterns can be captured by the n-gram statistical language model via training on existing codebases.

The state-of-the-art statistical n-gram language model for capturing such code repetitions/patterns and code suggestion relies on the lexical information and local context of code tokens. It could not capture code at the lexical level due to the differences of lexemes (e.g. str versus s, len versus l) and also can provide only the local context.

The second kind involves multiple co-occurring tokens that often come together to realize the same technical functionality/concerns. Knowing the technical concerns of a source file could benefit for the prediction of the next token.

SLAMC (Semantic LAnguage Model for source Code) incorporates semantic information into code tokens and models the regularities/patterns on their semantic values/annotations (called sememes ), rather than their lexical values (lexemes). A token is annotated with its data type and semantic role if available. In addition to the local context of code tokens, we also consider the global technical concerns of the source files and the pairwise associations of code tokens. We develope an n-gram topic model to capture the influence of both local context and global concerns on the next token's occurrence.

Based on SLAMC's ability of next-token suggestion, we developed a new code suggestion engine that is configurable for Java or C# which would complete the current code to form a meaningful code unit and most likely appear next. Meaningful code units are defined based on the language, and appearance likelihood is computed based on the generating probabilities of the sequences. Our key contributions are:

  1. SLAMC, a novel statistical semantic language model for source code with the integration of semantic n-grams, global concern, and pairwise association,
  2. A code suggestion engine based on SLAMC,
  3. An empirical evaluation on its accuracy and comparison to the state-of-the-art lexical n-gram model