You Are Hired — State of code Completion



Original Source Here

You Are Hired — State of Code Completion

August 06, 2022

Machine-learning-powered code completion tools have slowly but surely started to make their way into software engineers’ everyday lives.

Earlier versions of some of these tools have been around for years, however, the release of GitHub Copilot, which delivered quality multi-line suggestions raised the bar for what deep-learning-based autocompletion could do and was generally met with positive sentiment (Our team has happily been using it for a few months now).

Since then we’ve seen advances from other tools as well, such as IntelliCode, which has been stated to collaborate with the Copilot team, Tabnine which is updating their models, and Kite.

Still, these tools generally either offer you a fixed set of models, require you to send your code through their API, or provide limited adaptability to the codebase. Considering the potential boost to software engineers’ workflow that such tools can provide as they develop it is important that they remain open and accessible to everyday companies.

Fortunately, there have been efforts to do just that. These are:

  • InCoder — A project by Daniel Fried et. al. that was trained on 670,000 open-source repositories and is capable of code completion and in-filing. 1.3 B and 6B parameter
  • CODEGEN — By Erik Nijkamp, Bo Pang, Hiroaki Hayashi et. al. at Salesforce Research created a family of models aimed at code generation given a problem statement.
  • GPT-Code-Clippy — The project has gathered its own version of Github open-source code to compile a database and trained multiple models for code completion.

In this blog, we explore the proposed architectures, their usability, and the performance of these openly available models compared to Github Copilot.

What makes them tick?

InCoder was done by researchers from Facebook AI Research, University of Washington, UC Berkeley, TTI-Chicago, and Carnegie Mellon. It uses a causal masking objective, which allows bi-directional generation and code infilling, as well as code generation.

CODEGEN was created at Salesforce Research and is a family of transformer models trained similar to usual large-scale NLP tasks using greedy decoding and top-p sampling. The creators of the project discussed various approaches to how the code generation should be performed, and opted for a human-in-the-loop approach, specifically where the programmer guides the process by describing the next steps in natural language, while the model generates corresponding code blocks. A multi-turn benchmark was developed for the project to better evaluate the performance of models in such a setting.

GPT-Code-Clippy is an open-source project, which also opens the datasets used during the training and related processing scripts, allowing further experimentation from interested individuals. The dataset extends on the pile- an 800GB large text modeling corpus by adding filtered GitHub repositories based on metrics such as the number of starts, commits, and licenses. GPT Neo was used as the basis for fine-tuning the models, which can be expected to perform in GPT-2 to GPT-3 performance range.

Holding an Interview

After setting up the models for inference, all of which are available through Hugging Face we started playing around with the generators.

There are some tunable parameters during the prediction, such as the length of the predictions and the temperature, we found that the models performed better with higher temperatures, since otherwise they were relatively “shy” on our problems and offered either no or very short predictions which could be comparable to IntelliSense.

After playing with them for a bit we decided to test them in the most objective way we knew how — by holding LeetCode interviews.

We used two problems of easy and medium levels of difficulty to test the models. Of course, there is a concern that the model has already seen these problems either through GitHub repositories or LeetCode or user feedback itself in the case of Copilot. Which would make the comparison unfair. Still, LeetCode questions cover a broad range of topics and scenarios and we change the wording of the description a bit, so even if the model has seen the problem, accurate recall and adjustment to the given description would be considered a plus.

We evaluate the models by two criteria, since we are primarily interested in usability, we use qualitative evaluations of to extent they are able to solve the problem independently, and then how they speed up a programmer’s attempt at solving the problem.

The first problem we used was the “Two Sum” problem, a fairly common programming question that asks the developer to find indices of numbers in an array that add up to a provided target value.

Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target.

Input: nums = [1,3,8], target = 4
Output: [0,1]
Explanation: Because nums[0] + nums[1] == 4, we return [0, 1].

This problem can be solved by naive solutions by nested loops, in O(n) complexity using hash tables, with the solution looking something like this:

The resulting solutions can be seen below.

Out of the generated solutions:

Copilot — ✅

Incoder — ✅/ ❌ (Minor errors for brute force solution)

CodeGen — ❌

Code-Clippy — ❌

We then tried to generate solutions allowing a human in the loop. In this case we evaluate how many corrections the human had to make for the correct solution (the lower the better).

Copilot — 0

Incoder — 2

CodeGen — 5

Code-Clippy — X (Most of the code was written by the human).

In this test, the copilot performed very well, with Incoder also speeding up the coding time. Unfortunately, fixing the generations of CodeGen and Code-Clippy took some time, so their positive effect on productivity in the current state is more questionable, but future iterations of the models might improve that.

The second problem of medium-level difficulty that we used is Longest Substring Without Repeating Characters. As the name implies, the challenge is to find the length of the longest substring in a given string that has no repeating characters.

Given a string s, find the length of the longest substring without repeating characters.

The naive solution involves nested loops through all possible combinations, giving it O(n²) complexity. A more optimal, linear solution can be achieved using window sliding.

The solution can look similar to this:

Unfortunately, only Copilot came close with a reasonable-looking code including correct base case, reasonable variables, and structure:

However, due to incorrect indexing, even its solution was incorrect. Overall, it seems that all of the tested code generators still need some development before correctly tackling more tricky problems, and considering debugging such can be quite a challenge, their positive impact in such work might be limited.

Another thing to note is that the predictions when running locally are fairly slow since most common batch-based accelerations are not utilized, but that is more of an MLOps problem and even for a small team the models could be set up to provide predictions in a more timely and efficient manner. Fortunately, there are also some projects that improve integration into existing tools, such as Code Clippy VSCode for Code-Clippy.

Summary

Overall, the copilot performed the best among the models, which could be expected of its large user base and constant feedback and updates that the model gets, compared to the other solutions which seem to exist relatively frozen in time with occasional updates.

Still, Incoder performed well enough that with some improvements we can see the benefit of using it in day-to-day coding tasks. It is our hope that projects such as Code Clippy VSCode will be picked up and supported to help developers spend less time on menial work and keep the field competitive.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: