40% of Code Produced by GitHub Copilot Vulnerable to Threats: Research

In a classic â€˜as you sow, so shall you reap’ scenario, GitHub Copilot released in June is producing buggy code, thanks to the unfiltered data it was trained on. If developers are not cautious, this may have severe security implications.

The promise of a pair programmer for program synthesis based entirely on artificial intelligence, which incipiently seemed too good to be true, has been rendered buggy by a set of five researchers. Latest research by academics from the New York University (NYUOpens a new window ) has revealed that code generated by GitHub Copilot is laden with security issues.

NYU researchers Hammond PearceOpens a new window , Baleegh AhmadOpens a new window , Benjamin TanOpens a new window , Brendan Dolan-GavittOpens a new window , and Ramesh KarriOpens a new window discovered that nearly 40% of the code that GitHub Copilot generated has vulnerabilities.

GitHub Copilot is currently available as a technical preview. It is designed to streamline the software development process by helping developers with relevant code suggestions as and when they type the code. The tool is based on machine learning expertise lent by Microsoft-owned GitHub, from Microsoft-backed AI research company OpenAI.

Microsoft has previously developed RobustFillOpens a new window , DeepCoderOpens a new window , and MetabobOpens a new window to realize the AI pair programmer ambitions. However, all have failed to create the buzz that GitHub Copilot has.

An AI pair programmer is someone, or in this case, something that works side by side with a programmer that reviews each line of code. â€œIt helps you quickly discover alternative ways to solve problems, write tests, and explore new APIs without having to tediously tailor a search for answers on the internet,â€ explained Nat FriedmanOpens a new window , CEO at GitHub.

Copilot is the company’s latest program/code synthesizer based on OpenAI’s AI Codex, a tool that can translate natural language into code. OpenAI Codex is a descendant of GPT-3Opens a new window , a deep learning model that is capable of generating human-like text. Unlike GPT-3, which generates text in the English language, Codex generates code in programming languages.

The machine learning model that powers GitHub Copilot is trained on natural language and billions of lines of code that is open sourced, and is available publicly on GitHub. Not only does the ML-based tool suggest coders with what can or should come next, but it also automatically generates code snippets based on developer comments.

GitHub Copilot is compatible with multiple programming languages, but the most efficiency was observed with Python, JavaScript, TypeScript, Ruby and Go.

Research and Findings

The NYU researchers produced 89 different scenarios wherein Copilot had to finish incomplete code. Upon completion of the scenarios, Copilot generated 1,692 programs of which approximately 40% had security vulnerabilities.

The noted 40% is a huge concern for a tool that is being positioned as a programming buddy. It becomes even more baffling when you consider the debates and informal discourses whose core subject is the prospect of AI-driven technologies taking over human jobs. Safe to say that we’re nowhere near human insolvency as imagined.

The five researchers also cross checked the completed code with a subset of Common Weakness Enumeration (CWE) list of the top 25 most dangerousOpens a new window software weaknesses for 2021. CWE is a list of software and hardware vulnerability types developed and managed by the security community of the non-profit organization MITRE.

GitHub Copilot Evaluation Methodology for MITRE Top 25 CWEs | Source: Academic ResearchOpens a new window

The evaluation was performed on a single computer running Linux (Ubuntu 20.04) on Intel i7-10750H processor, 16GB DDR4 RAM. Considering usage patterns of Copilot are restricted, steps 1, 2, 3a, and 4a were completed manually while teps 3b, 4a, and 5 (all sub-steps) were automated using Python scripts.

The analysis is centered on three dimensions:

Diversity of Weakness: Copilot’s performance is monitored with respect to the tool’s tendency to generate susceptible to the 25 top CWEs.
Diversity of Prompt: An extensive analysis of Copilot’s performance under a single at-risk CWE scenario with prompts containing subtle variations.
Diversity of Domain:Â Copilot’s response to the domain, i.e., programming language/paradigm.

As it turns out, the completed code was vulnerable to several vulnerability types including out-of-bounds read and write, cross-site scripting, OS command injection, improper input validation, SQL injection, use-after-free, path traversal, integer overflow, deserialization of untrusted data, unrestricted upload of dangerous files, missing authentication, pointer dereference, and others.

Why is Code Generated by GitHub Copilot Vulnerable?

Without pointing any fingers, the researchers believe it has something to do with the code that the Copilot was trained on.

â€œAs Copilot is trained over open source code available on GitHub, we theorize that the variable security quality stems from the nature of the community-provided code,â€ the researchers saidOpens a new window . â€œCode often contains bugsâ€”and so, given the vast quantity of unvetted code that Copilot has processed, it is certain that the language model will have learned from exploitable, buggy code.â€

So the chances that certain bugs, which are evident in code from open source repositories will more often be reproduced by Copilot are very high.

This actually means that GitHub Copilot was trained on unfiltered sets of repositories possibly with unsecured coding patterns. Whoever gave a go ahead on training Copilot on such sets of repositories is probably questioning their decision now.

In short, GitHub Copilot has picked up the bad habits of human developers. And since there is no peer reviewing, there may be instances where buggy code is accepted. The scholars concluded, â€œIdeally, Copilot should be paired with appropriate security-aware tooling during both training and generation to minimize the risk of introducing security vulnerabilities.â€

Closing Thoughts

The five researchers acknowledged that auto code generation tools like GitHub Copilot will have an overall positive impact on developer productivity. This is possibly the notion on which Microsoft is ceaselessly exploring the area. Copilot isn’t Microsoft’s first attempt at developing an AI pair programmer, and going by the research findings, seems like it won’t be the last either.

Let us know if you enjoyed reading this news on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We would love to hear from you!

40% of Code Produced by GitHub Copilot Vulnerable to Threats: Research

Research and Findings

Why is Code Generated by GitHub Copilot Vulnerable?

Closing Thoughts

Contact ESSID Solutions

Reach out to us for a free consultation on big data consultancy and development services.