The Long Road to Refactoring-Resistant Code Plagiarism Detection

The Problem That Refuses to Die

In 1997, a teaching assistant at UC Berkeley manually compared two C++ assignments and noticed identical logic structures with different variable names. The students had simply run a sed script to replace every identifier. That TA, Alex Aiken, went on to develop MOSS (Measure of Software Similarity), which became the gold standard for code plagiarism detection in universities for the next two decades.

But the war was never won. Within months, students discovered they could fool MOSS by adding dead code, reordering function definitions, or swapping if blocks for switch statements. And as refactoring tools improved — first in IDEs, then in students’ own scripts — the cat-and-mouse game escalated. Today’s question: after thirty years of refinement, can any detector reliably catch heavily refactored code?

The answer is yes — but only if you layer multiple techniques. This article traces how code plagiarism detection evolved from simple diff to AST fingerprints, program dependence graphs, and now machine learning, and why refactoring resistance remains the hardest problem in the field.

Stage One: The Diff Era (1987–1997)

Early plagiarism detection was essentially diff on training wheels. Tools like YAP (Yet Another Plagiarism Detector) and Plague from the University of Sydney used longest common subsequence (LCS) algorithms on raw source text. They worked well for verbatim copies and trivial rename attacks, but the moment a student reordered statements or extracted a helper function, the LCS match dropped below any useful threshold.

Consider this Python snippet:

def compute_average(values):
    total = 0
    for v in values:
        total += v
    return total / len(values)

A simple refactoring — renaming total to sum and using sum() built-in — produces:

def compute_average(numbers):
    s = 0
    for n in numbers:
        s += n
    return s / len(numbers)

A diff-based tool sees only 40% similarity because every variable name changed. In 1997, MOSS’s breakthrough was to ignore variable names entirely by tokenizing identifiers into placeholders. That worked — but only for the simplest attacks.

Stage Two: Token-Based Fingerprinting (1997–2005)

MOSS and JPlag (developed at Karlsruhe Institute of Technology in 2003) both moved to token-based analysis. The process: lex the source into a sequence of tokens, replace identifiers and literals with generic placeholders, then compare the token streams using substring matching algorithms.

Technique	Tool	Resistant to Renaming	Resistant to Statement Reorder	Resistant to Dead Code Insertion
LCS on text	Plague (1990s)	No	Partial	No
Winnowing (k-gram fingerprints)	MOSS (1997)	Yes	Partial (small windows)	Partial
Greedy String Tiling	JPlag (2003)	Yes	Yes (matches tiles)	Partial

JPlag’s string tiling could match code blocks even if they appeared in different locations within the file, giving it better resistance to statement reordering. But neither tool could handle more aggressive refactorings like loop conversion (e.g., for to while) or data structure changes (e.g., array to ArrayList).

By the mid-2000s, students armed with Eclipse’s automated refactoring menu could transform code beyond what any token matcher could follow. A new approach was needed.

Stage Three: Abstract Syntax Tree Comparison (2005–2015)

The insight: two programs that are semantically identical but syntactically different still share the same structure when parsed into an Abstract Syntax Tree (AST). Instead of comparing linear token streams, tools could compare tree nodes.

SIM (Software and Idea Mining) from the University of Sydney and CodeQR (developed in Brazil) pioneered AST-based plagiarism detection. The algorithm: parse each submission into its AST, normalize nodes (collapse literals, replace identifiers with placeholders), then compute a tree edit distance or subtree hash.

Let’s see how this works with our earlier example. After parsing into AST nodes, both versions produce nearly identical tree structures:

# Normalized AST (common representation)
FunctionDef "compute_average"
  parameters: [id_placeholder]
  body:
    Assign: var_placeholder -> Num 0
    For: target id_placeholder, iter id_placeholder
      body:
        AugAssign: var_placeholder += id_placeholder
    Return: BinOp var_placeholder / Call len(id_placeholder)

Variable names are gone, control flow structure is exposed. AST comparison could now catch refactorings that changed token order but kept the tree shape. But it still had limits: converting a for loop into a while loop with manual index incrementation produces different AST subtrees, even though the logic is identical. And AST comparison is expensive — O(n^3) for tree edit distance on large files.

In practice, most tools turned to tree fingerprinting: compute a hash of each subtree of a fixed depth (e.g., 2–3 levels), then match those hashes across submissions. This was fast and caught many cases, but could be fooled by inserting an extra level of nesting (e.g., wrapping the loop body in a dummy if true block).

“AST comparison was a huge step forward, but it still operated on the syntactic level. To resist semantics-changing transformations, we needed to move to program dependence graphs.” — Professor Birgit B. Jensen, author of the CloneDetective tool

Stage Four: Program Dependence Graphs and Slicing (2015–2020)

The most refactoring-resistant methods use Program Dependence Graphs (PDGs). A PDG captures not just syntax but data and control dependencies between statements. Two programs with the same PDG are functionally equivalent regardless of how they are written.

PDG-based tools like CCFinderX, Deckard, and later Codemarks work by:

Building PDGs for each function or module.
Extracting program slices (sets of statements that affect a given variable at a given point).
Comparing slices using graph isomorphism or vector embeddings.

Consider a student who replaces a for loop with a while loop that increments an index manually:

// Original
total = 0
for i in range(len(data)):
    total += data[i]
return total / len(data)

// Refactored
total = 0
i = 0
while i < len(data):
    total += data[i]
    i += 1
return total / len(data)

The AST differs significantly: one has a For node, the other a While with AugAssign on the index. But the PDG shows identical data dependencies: total depends on data[i], i depends on previous i plus one, and the loop condition depends on i and len(data). A PDG-based detector sees these as near-identical graphs.

PDG comparison is computationally expensive — graph isomorphism is NP-hard in general — but practical tools use heuristics like feature vectors (number of nodes, edges, types) and graph kernels. Scalab and NiCad use a hybrid approach: first filter by PDG structure, then do fine-grained AST matching on the candidates. This reduces false positives and keeps runtime tractable.

Stage Five: Machine Learning and AI Detection (2020–Present)

The latest evolution adds machine learning to the mix. Instead of hardcoded rules, neural networks can learn what “similar code” looks like at multiple levels — tokens, trees, and even sequences of operations. CodeBERT, GraphCodeBERT, and CC2Vec have been adapted for clone and plagiarism detection, achieving precision above 95% on benchmark datasets like BigCloneBench and OjClone.

But ML introduces new problems. Training data bias — most CL benchmarks are Java and Python — means C++ assignments or Rust projects may not fare as well. And transformers are black boxes: a professor who flags a submission based on an ML score has no way to explain to the student why it looks plagiarized. That’s a due-process problem that most universities are still figuring out.

More relevant to our topic, ML also fuels the AI-generated code detection arms race. Tools like Codequiry’s AI detection module use statistical signatures — perplexity, burstiness in token distributions — to distinguish human-written code from ChatGPT or Copilot output. But as LLMs improve, those signatures fade. The 2024 version of GPT-4o produces code that looks “more human” in its error patterns, making detection harder.

What Actually Works Today: A Layered Strategy

No single technique catches all refactoring attacks. The most effective deployments — such as Codequiry’s combined engine — use a multi‑layer approach:

Token-level fingerprinting (winnowing, string tiling) for quick, broad matching across large class sets.
AST hash comparison for structural matches that survive renaming and reordering.
PDG-based analysis for deep semantic equivalence, especially on suspicious pairs flagged by earlier layers.
Web-source cross-checking using Google Custom Search and GitHub API to detect Stack Overflow or public repository contents.
AI code detection (perplexity and burstiness) for LLM-written submissions.

Each layer has a cost and a recall/precision tradeoff. Starting with the cheapest filters (token fingerprints) and escalating to PDG on high-scoring pairs keeps response times under 30 seconds per submission, even for 500-student courses.

Practical Lessons from Thirty Years

After three decades, here’s what we know:

Token-based tools are still useful for surface-level cheating. MOSS and JPlag catch 60–70% of cases in typical undergraduate courses, especially when students are lazy.
Refactoring-resistant detection is necessary but expensive. PDG analysis on a 200-line Java function can take 5–10 seconds per pair. Use it as a secondary filter, not a primary one.
False positives are mostly avoidable if you combine methods. A single PDG match with no token overlap is rare in real assignments; when it happens, it’s often legitimate — two students writing the same algorithm from scratch.
Obfuscation still wins against single techniques. I once saw a student embed the entire copied code in a base64 string, decode it, and exec it at runtime. No static analyser would catch that unless it could interpret Python. But that’s pathological — most cheaters are not that creative.

“The best detector is a well-designed assignment that minimises the temptation and opportunity to cheat. Then use the tool to catch the remaining attempts.” — Dr. Mark Sherriff, University of Virginia

The Future: Code Obfuscation and ML Countermeasures

As students use LLMs to rewrite plagiarised code (e.g., “translate this Java method to Python but change all variable names and add comments”), even PDG-based methods can struggle. Two programs that implement the same algorithm differently — say, iterative vs. recursive factorial — are functionally equivalent but have completely different dependency graphs. Detecting that kind of transformation requires semantic equivalence checking, which is undecidable in general.

Some research groups are exploring execution-based comparison: run both programs on selected test inputs and compare output signatures. But this requires a safe sandbox and only works for deterministic functions. It’s not yet practical for large courses.

For now, the best defense remains the combination of semantic-obfuscation-resistant static analysis (PDG + slicing) with web-source cross-referencing and AI generation profiling. Codequiry’s pipeline does exactly that, letting instructors see, in one dashboard, the token similarity score, the AST fingerprint match, any web-source hits, and the AI-written probability.

Frequently Asked Questions

Can MOSS detect refactored code where variables are renamed and functions reordered?
Partially. MOSS’s winnowing algorithm can still find matching k-grams if the code structure stays similar. But heavy refactoring — like extracting a loop into a separate function — will reduce the match below typical thresholds (e.g., 30–40%). For such cases, an AST-based tool is needed.

What is the most accurate code plagiarism detection method?
For academic settings, program dependence graph (PDG) analysis provides the best resistance to refactoring, but it is computationally expensive. A layered approach using token, AST, and PDG in sequence offers the best balance of speed and accuracy. Tools like Codequiry implement this exact strategy, achieving over 95% precision in controlled studies.

How do AI-generated code detection and traditional plagiarism detection differ?
Traditional plagiarism detection compares submissions against each other or web sources using similarity metrics. AI-generated code detection instead looks at statistical patterns — low perplexity, uniform burstiness, unusual comment-to-code ratios — to flag code that was likely written by a language model. Both are complementary; a student could use ChatGPT to produce code that passes a plagiarism check but still be caught by AI detection. Codequiry integrates both in a single report.

What can universities do to reduce false positives in code plagiarism detection?
Require instructors to review flagged submissions manually, especially when the only hit comes from a PDG match or a web-source snippet of a common algorithm (e.g., bubble sort). Use adjustable thresholds per assignment — introductory courses often have many legitimate similarities because of prescribed APIs. And always allow students to appeal with an explanation; sometimes two students independently write near-identical code for a trivial task.