Statistical Translation of English Texts to API Code Usage Templates

Anh Tuan Nguyen, Peter Rigby, Thanh Nguyen, Mark Karanfil and Tien N. Nguyen

(please try the web-based tool here)



Introduction

Nowadays, software reuse is vital in the process of developing modern software products. An important and practical way of software reuse is realized through software libraries and frameworks. Software developers use libraries/frameworks via the Application Programming Interface (API) elements including classes and methods in order to access the provided functionality of the libraries. The designers of software libraries have many intended usages for their APIs, in which not all of them are well documented in the documentation or programming guides. Therefore, developers who want to properly use the libraries/frameworks often have to search for the API usage examples on the Web or seek for the answers from developers’ discussion boards or forums (e.g., StackOverflow).

To support users in searching for source code examples, several code search engines have been developed (e.g., Black Duck Open Hub, Codase, etc.). Despite their successes, the existing code search approaches are limited to the code examples that have been processed and encoded in their databases. While the existing program synthesis approaches are capable of generating a potentially new program that matches the given constraints, they are limited to domain-specific code.

In this work, we introduce T2API, a context-sensitive, statistical machine translation approach that takes a given English textual description on a programming task as a query, and learns from a large code corpus to synthesize the corresponding API usage template for the task. The resulting combination of API elements, i.e. an API usage, is synthesized from the API elements or smaller API usages that exist in the training code corpus. However, as a whole, the synthesized usage might not exist in the corpus. 
T2API is the first work toward using machine translation between texts and source code. In T2API, we focus only on API usage template for Java and Android. While existing approaches treat code synthesis as information retrieval or constraint solving problems, we view it as a statistical machine translation problem. That is, we aim to translate a given textual description on a programming task/purpose into the code fragments of API usages realizing the same task.

Specifically, 
T2API works in two steps. Given a text, the first step is to infer the relevant API elements, which are most likely to be used to realize the task described in the text.  In the second step, T2API ensembles the API elements produced in the first step to create an API usage template. In T2API, we develop a novel algorithm that synthesizes the API usages by maximizing the likelihoods of those elements being used together in certain orders in a large-scale code corpus. 

Machine translation brings to 
T2API the following key advances. First, to derive the relevant API elements, we do not try to match words against API elements via textual similarity. Instead, we use the statistical IBM Model to learn the relevant API elements. Second, the words in a textual query are considered in the context of other words in the query in the process of finding the relevant API elements. That is, to find the relevant API elements for each word, we consider the API elements for the other words in the query as a context.

Our experiments on similarity comparison and survey show both the high accuracy and usefulness of
T2API.