“Automatic patch generation by learning correct code”, POPL 2016, by Fan Long and Martin Rinard from MIT
Background: this paper presents an approach to automatically fix software. The terminology is “generating patches”. The very first work that initiated this line of works was published in 2011. It currently is a hot research topic in the software engineering research community. Generating patches can be divided into two steps: 1) “fault localization”: localizing where the changes should be made to fix the bug and 2) “code transformation”: adding, deleting and modifying the code at the fault location. To automate patch generation, we use existing fault localization tools to accomplish “fault localization”. This step is approximation. The second step is done by constructing a search space of all possible code transformation that can be made for the current location and applying search algorithms to find the most correct one (this is the place where this paper makes contributions). The correctness of the patch is determined using test suites. Test suites are not ideal specifications.
There have been many approaches on how to construct the search space (the example work includes GenProg and SPR). For example, GenProg assumes that we can replace the fault locations with some statements from the current program. It is a very strong assumption and surprisingly works for some cases. The search has been done via genetic algorithms and machine learning. Further details please refer to the article “Automatic Software Repair: A Survey”.
Position of the paper: This paper first demonstrated that there are patterns in “patches” that can be learned across projects in dependent of the types of errors. The feature engineering is definitely one of the important contributions of this paper. It is also the first paper that demonstrated the successes of automatic repair for large scale real-world code.
Summary of the method: this paper designed a feature extraction method applicable for all patches, capturing (1) the co-occurrence of the values in the patches and the values located in their surrounding context, and (2) the relations of the code transformation types in the patch and the types of statements located near the patch. It constructed a probabilistic model that combines both fault location information and the feature vector of the patches to predict the probability of the correctness of a given patch. It explores the “maximum likelihood estimation” and “hinge-loss function” to learn the model parameters.
Scale of the experiments and results: for training, the paper used 777 human-written patches from 8 C repositories; the paper tested 69 real-world defects and generated the correct patches for 18 of them and the plausible patches for 39 of them. The correct patches are semantic equivalent with human written patches, judged by human code inspection. The plausible patches are patches that pass test suites. On average, it takes 12 hours to generate the search space and train the model, and it takes about 2 hours to find (meaning ranking the patches one a time until find the best one) the first plausible/correct patch (the performance number is not very clear from the paper and I need to double check this).
Why this work is great: (1) addressing a trending, well-defined problem, not much effort needed to motivate the work (2) the idea is novel, not trivial and backed up by general sense (3) large scale experiments, detailed results and replicate package (4) the paper is easy to understand via good structures and examples. The paper is insightful as it explains why their design decisions make sense. It contains details that we can reproduce it.
Final thoughts: 13 out of successful 18 patches are from the same program php, while php data are used in training. That said, the current results do not provide very strong evidences that learning patches across different programs work well. How much the training data matters? If we have millions of patches instead of 777, how this will change the results? I have not yet seen how complicated patches it generates. The data is located at Amazon bucket s3: fan-rep-data