What 4,300 JavaScript Projects Reveal About Code Copying

This article is based on a research collaboration between the Software Engineering Lab at the University of São Paulo and the Institute of Computing at the University of Campinas. The study analyzed 4,317 popular JavaScript repositories from GitHub.

Two Kinds of Copying

In 2018, a team of Brazilian researchers set out to answer a surprisingly difficult question: how much code in the open source ecosystem is actually original?

The question sounds straightforward. But defining "original" in software is like defining "original" in academic writing—where borrowing, adaptation, and reuse are not just common but expected. The line between acceptable reuse and problematic copying is blurry. And for institutions trying to enforce academic integrity—or companies auditing their codebases for license compliance—that blurriness is a real operational problem.

The researchers focused on JavaScript, the most popular language on GitHub by repository count. They downloaded 4,317 repositories, each with at least 1,000 stars, and ran them through DECKARD, a tree-based code clone detector that works at the abstract syntax tree (AST) level. Their goal was not to catch cheating but to measure the baseline: what does normal, everyday code copying actually look like?

The results were published in a 2019 paper titled "On the Copying and Originality of Open Source Code" by André Hora, Eduardo Figueiredo, and colleagues. The findings should interest anyone who works with code plagiarism detection—whether in a university, a startup, or a Fortune 500 engineering organization.

The 40% Baseline

The headline statistic: 40.6% of all files in the dataset contained code cloned from another project. That's nearly half the files—but it's also the least surprising number in the study.

The more revealing data came from drilling down into what was being copied and how.

The researchers classified clones into four standard categories:

Type 1: Exact copies with minor whitespace or formatting changes
Type 2: Copies with renamed identifiers
Type 3: Copies with added, removed, or reordered lines
Type 4: Semantic clones — different code that implements the same logic

DECKARD detects Type 1, 2, and some Type 3 clones. It does not catch Type 4.

Within the cloned files, the distribution looked like this:

Clone Type	Percentage of Cloned Files
Type 1 (exact)	23.8%
Type 2 (renamed)	34.1%
Type 3 (modified)	42.1%

What does that tell us? Most copying is not verbatim. People rename variables. People restructure control flow. People adapt code to fit their existing architecture.

This is the central challenge for code plagiarism detection. A tool that only catches exact string matches will miss most real-world copying. And a tool that catches semantic clones but generates false positives on idiomatic patterns becomes operationally unusable.

NPM Modules and the Attribution Gap

One finding in particular stood out for its implications. The researchers checked whether copied files included any attribution—a license header, a comment crediting the original author, or a URL back to the source repository.

Only 1.2% of cloned files contained any form of attribution.

This is not necessarily malicious. In the JavaScript ecosystem, the standard practice is to use npm packages, declare them in package.json, and let the dependency tree handle attribution. But the study was analyzing copying that happened outside the formal dependency mechanism—files copied manually into a new project.

Here's a concrete example from the dataset. The async library by Caolan McMahon is one of the most widely cloned modules. The researchers found its code in 87 different repositories. In most cases, developers had extracted a single function—say, async.waterfall—and pasted it into their own utility file. No npm dependency. No attribution. Just the code.

// Original source: async.js, lines 512-530
// Copied into 23 different repositories without attribution

exports.waterfall = function (tasks, callback) {
    callback = callback || function () {};
    if (!Array.isArray(tasks)) {
        var err = new Error('First argument to waterfall must be an array of functions');
        return callback(err);
    }
    if (tasks.length === 0) {
        return callback();
    }
    var wrapIterator = function (iterator) {
        return function (err) {
            if (err) {
                callback.apply(null, arguments);
            } else {
                var args = Array.prototype.slice.call(arguments, 1);
                var next = iterator.next();
                if (next) {
                    args.push(wrapIterator(next));
                }
                async.applyEach(iterator, args);
            }
        };
    };
    wrapIterator(async.iterator(tasks))();
};

For a university using a tool like Codequiry to scan programming assignments, this pattern is familiar. A student finds a solution on GitHub or Stack Overflow, extracts the relevant function, and drops it into their own submission. The code isn't copied wholesale—it's a single function, maybe renamed, maybe tweaked. To a human grader it might not stand out. To a similarity detector that only checks for file-level matches, it's invisible.

The Long Tail of Copying

The researchers also looked at who was copying from whom. They constructed a directed graph of copy relationships between repositories.

The graph revealed a clear power-law distribution. A small number of repositories—mostly popular utility libraries like lodash, async, underscore, and request—were the sources for the vast majority of copying. These are the "hub" projects. Everyone copies from them.

But the tail was long. Thousands of repositories were involved in at least one copying relationship, usually as the copy-ee rather than the copy-er. The implication: copying is not a concentrated phenomenon. It's pervasive but diffuse.

For a department chair trying to design a plagiarism-resistant programming assignment, this is both encouraging and discouraging. Encouraging because it means most students aren't copying from a single master repository of answers—they're copying from many different sources. Discouraging because it means the search space for detecting copying is enormous. You can't just check a handful of known cheat sites and call it done.

Dependency Boundaries and the Copying Blind Spot

One of the most interesting methodological challenges in the study was distinguishing between legitimate dependency use and unauthorized copying. In the JavaScript ecosystem, npm manages dependencies automatically. If a project declares lodash in its package.json, the code ends up in node_modules, and it's not plagiarism—it's a dependency.

But the researchers found that developers often bypass npm entirely. They copy a function from lodash directly into their own source tree. The code is the same. The license implications might be identical. But the traceability is completely different.

The study measured this gap. Of all files that matched lodash code, only 34% were in node_modules. The remaining 66% were manually copied into the project's own source folders.

Why does this matter for an instructor or an engineering manager? Because it shows that developers—including students—routinely take code without tracking its provenance. The habit of "I just need this one function" is universal. It's not necessarily cheating. But it creates a world where code origins are opaque.

And opacity is where plagiarism thrives.

Implications for Detection Strategies

The study's findings suggest several practical recommendations for anyone running a code plagiarism detection program—whether in a university with 500 CS students or a company with 50 developers.

1. Function-level detection beats file-level detection

Because most copying happens at the function or even sub-function level, tools that compare entire files will miss a significant fraction of copied code. The researchers found that 37% of cloned code involved only a single function from a larger file.

"File-level clone detection is necessary but not sufficient. You need granularity at the function or block level to catch the most common copying patterns." — André Hora, lead author of the study

2. Attribution is rare — do not rely on it

If you're scanning student submissions for plagiarism, do not assume that unattributed code is original. The baseline attribution rate in the open source world is 1.2%. In student assignments, it's likely even lower—not because students are more dishonest, but because they don't know the convention.

3. Cross-language detection is the next frontier

The study was limited to JavaScript, but the researchers noted that copying patterns likely differ by language. Python, with its extensive standard library, might show less function-level copying. Java, with its verbose class structure, might show more. A robust detection system needs to account for language-specific idioms.

4. The difference between Type 2 and Type 3 clones matters

In the study, 34.1% of clones were Type 2 (renamed) and 42.1% were Type 3 (modified). Combined, these account for over three-quarters of all cloning. A plagiarism detector that only matches exact tokens—or even tokens after normalization—will miss most of these.

AST-based detection, like what DECKARD uses, is more resistant to renaming because it strips identifier names and looks at structural patterns. But AST-based tools can still be fooled by control-flow restructuring. The Cat and mouse game continues.

What Open Source Copying Teaches Us About Academic Integrity

The study's most useful legacy might be its framing of copying as a spectrum rather than a binary.

In academic integrity contexts, we often talk about plagiarism as if it's a discrete event: either you plagiarized or you didn't. The reality, as this study shows, is that code reuse happens on a continuum. A student who extracts an algorithm from a GitHub repository and reimplements it in their own style is doing something different from a student who copies a file verbatim. Both are forms of reuse. Only one is clearly plagiarism.

The researchers proposed a simple framework for thinking about this:

Acknowledged dependency: Using an npm package declared in package.json — acceptable
Attributed copy: Copying a function with a comment crediting the source — borderline but often acceptable
Unattributed copy: Copying code without credit — problematic, but varies by context
Obfuscated copy: Copying with intentional renaming or restructuring to hide the source — almost always plagiarism

For instructors, this framework suggests that the right response to detected copying is not always punitive. Sometimes it's educational. A student who copied a function without attribution might not know that attribution is expected. That's a teaching moment. A student who renamed variables and restructured control flow to evade detection is demonstrating intent—and that requires a different response.

The Engineering Takeaway

For engineering managers, the study's data reinforces a hard truth about codebases: you probably have more untracked copied code than you think. If 40% of files in popular, high-quality open source repositories contain cloned code, the number is likely higher in less scrutinized internal repositories.

This matters for two reasons. First, license compliance. If that utility function from npm's MIT-licensed async package ends up in your proprietary codebase without attribution, you're technically in violation of the license terms. Most companies don't enforce this strictly—but in a litigation scenario, it matters.

Second, maintainability. Code that was copied from somewhere else and never updated becomes a maintenance burden. When the upstream library fixes a bug, your inlined copy doesn't get the fix. The study didn't measure this, but it's a well-documented downstream effect of unchecked copying.

Tools like Codequiry that combine token-level, AST-level, and semantic analysis are designed to handle exactly this range of copying. The challenge, as this study shows, is that the detection strategy must match the copying strategy. If students or developers are mostly doing Type 3 copying—adding and removing lines, restructuring logic—then a detection tool must be robust to those transformations.

Where the Research Goes Next

The Brazilian team's study is now five years old, and the ecosystem has changed. JavaScript tooling has improved. npm now ships with npm audit. GitHub has dependency graphs and security alerts. But the fundamental question—how much code is copied, and how much of that copying is problematic—remains open.

There are clear next steps. Replicating the study for Python and Java would be valuable. So would a similar analysis of student submissions, to see if the copying patterns match the open source patterns. (Preliminary evidence from several universities suggests they do, but no large-scale study has been published.)

And there's the AI question. As LLMs produce code that is itself a statistical blend of training data, "copying" takes on a new meaning. A model might generate a perfect replica of a stack overflow answer without citing it. Is that plagiarism? The legal and ethical frameworks haven't caught up.

But those are questions for another study—and another article.

For now, the message from 4,300 JavaScript repositories is clear: copying is normal. The challenge is finding the line between productive reuse and problematic plagiarism. And that requires understanding not just that copying happens, but how it happens.