How Much Copied Stack Overflow Code Do Plagiarism Tools Actually Catch

The Fundamental Difference Between Peer and Web Plagiarism

Every CS professor I know runs MOSS or JPlag at least once per semester. You upload a batch of student submissions, wait for the similarity matrix to populate, and scan the red-flagged pairs. It works well for catching two students who swapped solution files or one student who inherited last year's submission from a friend.

But here is what those tools do not do: they do not check whether your student's submission matches code on Stack Overflow, GitHub, or any coding tutorial website. They compare submissions against submissions — nothing more. If a student copies a solution from a public online source, modifies variable names, and submits it, MOSS sees something it has never seen before. It reports zero similarity.

This is not a critique of MOSS or JPlag. They were designed for peer comparison, and they do that well. But the web has become the default source for quick solutions, and a tool that does not scan the web cannot detect web-sourced code. That gap is the subject of this article.

Peer comparison finds shared cheating. Web scanning finds sourced cheating. They solve two different problems, and you need both.

What Traditional Similarity Tools Actually Measure

MOSS uses a winnowing algorithm that extracts k-gram fingerprints from each submission and compares them across the submission corpus. JPlag uses greedy string tiling with a minimum match length to find shared substrings. Both are effective at detecting submissions that share significant structural or textual overlap.

Here is a concrete example. A student is asked to implement a function that finds the longest common subsequence of two strings. They find this solution on Stack Overflow (with attribution removed):

def lcs(X, Y):
    m = len(X)
    n = len(Y)
    L = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(m + 1):
        for j in range(n + 1):
            if i == 0 or j == 0:
                L[i][j] = 0
            elif X[i-1] == Y[j-1]:
                L[i][j] = L[i-1][j-1] + 1
            else:
                L[i][j] = max(L[i-1][j], L[i][j-1])
    return L[m][n]

They rename variables: X becomes seq1, Y becomes seq2, L becomes dp_table, and they swap the elif for a nested if. They turn this in alongside twenty other submissions. MOSS compares it to every other submission in the batch. Nobody else used this exact variable-renaming scheme, so MOSS returns a low similarity score. The submission passes peer comparison cleanly.

That submission is still plagiarism. It was copied from a public web source, modified just enough to evade peer comparison, and submitted as original work. Peer comparison cannot see this because the source was never in the comparison set.

How Web Source Detection Works Under the Hood

Web source detection, sometimes called web plagiarism detection or code provenance scanning, uses a fundamentally different approach. Instead of comparing a submission against other submissions, it compares the submission against a pre-indexed corpus of publicly available source code from the web.

At the technical level, this involves several stages:

Crawling and indexing. The detection system crawls known code repositories — Stack Overflow Q&A posts, public GitHub repositories, tutorial websites (GeeksforGeeks, W3Schools, Programiz), language documentation examples, and open-source project snippets. Each snippet is parsed, normalized (whitespace stripped, comments removed), and stored as a set of fingerprints or AST nodes.
Query fingerprinting. The student submission undergoes the same normalization and fingerprint extraction. A typical implementation uses winnowing with a k-gram size tuned for code (often k=15 to k=30 tokens, depending on the language) to generate a compact hash set for the submission.
Matching against the web corpus. Each submission fingerprint is probed against the indexed web corpus. Matches are scored by the length and proportion of shared content, with higher weight given to contiguous runs of matching fingerprints rather than scattered single matches.
Attribution-aware scoring. Modern systems distinguish between code that includes its original source citation (e.g., a comment like // Source: stackoverflow.com/a/12345) and code that removes or never had attribution. Unattributed matches score significantly higher as potential plagiarism.

The key difference from peer comparison is the reference set. Peer comparison has a reference set of 20 to 200 submissions. Web scanning has a reference set of millions of snippets, continuously updated as new code is published online. The scale is not remotely comparable.

Codequiry's web scanning pipeline, for example, maintains an indexed corpus of code from the most commonly sourced websites in programming education. When a submission is checked, it is compared against that corpus alongside the peer comparison pass. The result is a combined report showing both peer matches and web source matches.

The Stack Overflow Blind Spot in Practice

To quantify the blind spot, I ran a small experiment during a second-year data structures course at a mid-sized university (name withheld). We collected 47 submissions for a graph traversal assignment. All submissions went through MOSS (peer comparison only) and through a combined pipeline that included web source scanning.

MOSS flagged 3 submission pairs with similarity above 60%. Two of those were legitimate collaboration (the course allowed limited pair work). One was a clear copy-paste between two students. Standard semester workflow.

The web scan flagged 11 submissions with significant overlap to publicly available code. Six of those matched Stack Overflow answers to within minor variable renames. Three matched GitHub repositories containing solution walkthroughs. Two matched tutorial websites that published near-identical assignment solutions. None of those 11 submissions had been flagged by MOSS because none of them overlapped significantly with any other student's work.

In this single assignment run, web scanning caught nearly four times as many questionable submissions as peer comparison did. And peer comparison had been the standard tool for years.

This is not an isolated data point. I have seen similar ratios across multiple courses and institutions. The web has become the primary source of copied code for students who choose to copy, and peer-only tools systematically miss it.

Why Boilerplate Code Creates False Positives

There is a genuine challenge with web source detection that deserves honest treatment: boilerplate code and common idioms produce false positives if handled naively.

Consider a beginner Python assignment to read a CSV file and compute column averages. Any reasonable solution will include something like:

import csv

with open('data.csv', 'r') as f:
    reader = csv.reader(f)
    next(reader)  # skip header
    for row in reader:
        # compute averages

This exact pattern appears in hundreds of thousands of web pages. A naive web scan would flag every submission that uses the csv module with a with open block as potentially copied. That is a false positive, and it is not a useful signal.

Good web detection systems handle this in two ways. First, they use match length thresholds — a 3-line boilerplate match is not flagged, but a 30-line match that includes the key algorithmic logic is. Second, they maintain frequency-weighted indexes: code that appears in thousands of web sources is downweighted because it is likely a common idiom rather than a copied snippet. Code that appears in exactly one Stack Overflow answer and matches a student submission closely is upweighted.

The practical result is that well-tuned web detection produces far fewer false positives than a naive implementation, but it requires the tuning to be done against real educational data. The threshold that works for a graduate algorithms course will not work for an introductory Python course. If you are evaluating a web scanning tool, ask the vendor what corpus they index and how they handle boilerplate. The answers tell you everything about whether the tool will be useful or noisy.

GitHub and Tutorial Sites Present a Different Problem

Stack Overflow code is usually short — a focused answer to a specific question. GitHub code is different. Repository solutions for assignments can be hundreds of lines long, fully structured, and intentionally written to be clear and educational. A student who finds a GitHub repository containing the exact assignment solution and submits it largely unchanged has committed a different kind of plagiarism than the student who copies a 15-line Stack Overflow snippet.

Tutorial websites like GeeksforGeeks and Programiz occupy a middle ground. They often present complete working code for common algorithms (DFS, Dijkstra, quicksort) alongside lengthy explanations. Students frequently take the code block at the end of the article, strip the comments, and submit it as their own work. The code is structurally identical to the original but with comments removed and variable names lightly changed.

Detecting this requires the web scanner to index the tutorial site's code blocks specifically, not just the page text. Many web scraping detection tools aim at text plagiarism (sentences and paragraphs). Code plagiarism requires a separate pipeline that extracts code blocks, normalizes them, and fingerprints them independently from the surrounding explanatory text. A tool that scans web pages as plain text will miss most code-specific plagiarism.

Codequiry's approach to this problem is to maintain separate indexes for code blocks versus text content, with language-specific normalization applied to the code index. A Python snippet from GeeksforGeeks is parsed as Python, not as general text, so the fingerprinting captures the structural tokens that survive variable renaming.

Combining Both Approaches for Full Coverage

The most complete picture of code originality comes from running peer comparison and web scanning as complementary passes, not as alternatives. Here is a practical workflow that several universities have adopted:

First pass: peer comparison. Run MOSS, JPlag, or Codequiry's peer similarity module across all submissions. Flag pairs with high similarity for standard academic integrity review. This catches shared cheating and collusion.
Second pass: web source scanning. Run the same submissions through a web-indexed scan. Flag submissions that match online sources above the configured threshold. This catches sourced cheating from Stack Overflow, GitHub, and tutorials.
Third pass: cross-reference. Low-similarity peer matches combined with high web-source matches suggest individual web-sourced work. High peer matches combined with low web-source matches suggest collusion. High scores on both passes suggest a student who both copied from the web and shared with a peer — more common than you might expect.

Three semesters of data from one large CS department showed that approximately 40% of flagged submissions appeared only in the web scan, 35% appeared only in the peer comparison, and 25% appeared in both. Relying on either pass alone would have missed a majority of cases.

For instructors and departments evaluating tooling, the practical recommendation is straightforward: if you are only running one pass, you are collecting less than half the available signal. Peer comparison and web scanning address different cheating behaviors, and the student population that copies from the web is not the same population that copies from a classmate. You need both.

Frequently Asked Questions

Do plagiarism tools like MOSS check web sources?
No. MOSS compares submissions only against other submissions in the same batch. It does not index or search against web content. Detecting code copied from the web requires a separate web scanning pass with an indexed corpus of online source code.

How do web source detectors avoid false positives from common code patterns?
Good detectors use frequency-weighted indexing (downweighting widely used patterns) and match-length thresholds (only flagging matches above a certain size). They also maintain separate indexes for different programming languages to avoid cross-language false matches.

Can web source detection find code that was copied and then heavily modified?
It depends on the modification. Variable renaming, whitespace changes, and comment removal are handled by normalization and AST-based fingerprinting. Heavy restructuring — changing control flow, splitting functions, rewriting logic — can evade detection, but such modifications also require the student to understand the code well enough that the pedagogical concern is different.

Should I use web scanning for every assignment or just major ones?
Students tend to copy from the web on smaller, more mechanical assignments (implementing a known algorithm, writing a utility function) more often than on larger, more open-ended projects. Running web scanning on the smaller assignments where the solution space is narrow yields the highest detection rate per submission scanned.