How Code Similarity Detection Advanced From Strings to Semantics

The Problem That Won’t Go Away

Every CS professor who has graded a hundred introductory programming assignments knows the feeling: two submissions that are structurally identical but differ in variable names, whitespace, and comment text. In the 1990s, the standard defense was a manual diff or a simple string comparison. Both failed spectacularly against any student who knew to rename identifiers and swap loop bodies.

Code plagiarism detection had to evolve. It moved from literal strings to abstract representations, from shallow features to deep structural analysis. Today, the same tools must also contend with code written by large language models—ChatGPT, Copilot, Claude—where the plagiarist isn’t a student but an algorithm. The story of that evolution is a case study in applied computer science: how we learned to ignore surface detail and measure semantic similarity.

Phase 1: Strings and Diffs

In the early days (roughly 1980–1994), most detection was manual. Instructors would open two files side by side in an editor and visually compare. The first automated attempts used Unix diff or custom line-by-line comparison. These tools treat code as a sequence of characters or lines. A simple rename like

int main() {
    int a = 10;  // original
    int b = a + 5;
    printf("%d", b);
}
int start() {
    int first = 10;  // renamed version
    int second = first + 5;
    printf("%d", second);
}

would produce a 100% difference because every line changed. Students quickly learned that such trivial transformations defeated string matching. This phase also had high false-positive rates: two students independently writing a standard for loop could trigger a match. By 1993, the CS education community knew it needed something better.

Phase 2: Tokenization and the Birth of MOSS

The watershed moment came in 1994 with Alex Aiken’s MOSS (Measure Of Software Similarity) at UC Berkeley. MOSS introduced a two-step process: tokenization followed by fingerprinting.

Tokenization strips away identifier names, comments, and whitespace by mapping every token (keyword, operator, literal) to a fixed symbol. The code above becomes a token sequence like:

int main ( ) { int = 10 ; int = + 5 ; printf ( "%d" , ) ; }

Now both the original and renamed versions map to the same token stream. MOSS then applies a winnowing algorithm that selects a subset of hashed k-grams (typically k=5) to create a compact fingerprint. This fingerprint is robust against many simple obfuscations: renaming, whitespace changes, comment removal.

JPlag, developed at the University of Karlsruhe in 1996, used a different tokenization but the same principle. It specifically targeted structured programming languages (Java, C++, Python) and could even detect extracted method bodies—if a student took a 20-line function, split it into three sub-functions, JPlag would still find unusually high substring overlap in the token stream.

Token-based methods remain the most widely deployed plagiarism checkers in universities today. MOSS handles over 30,000 submissions per day at peak semesters. But they have a fundamental blind spot: refactoring that changes control flow.

A student who replaces a for loop with a while loop, or inlines a function call, will still produce a different token sequence—even when the logic is identical. Tokenization treats the syntactic form, not the semantic structure.

Phase 3: Abstract Syntax Trees and Structural Fingerprints

By the early 2000s, researchers pushed toward structure-aware methods. The key insight: an abstract syntax tree (AST) captures the syntactic structure after tokenization but before names and layout. Two pieces of code that differ syntactically (e.g., for vs while) can still have very similar ASTs if you normalize certain constructs.

Consider:

// Original
for (int i = 0; i < n; i++) {
    sum += arr[i];
}

// Refactored
int i = 0;
while (i < n) {
    sum += arr[i];
    i++;
}

Token-based tools see two different streams. An AST-based comparison sees two trees with the same shape: a loop node containing an increment, an array access, and an addition. The only difference is the loop header type. Tools like Simian (Software Similarity Analyzer) and later Sherlock from the University of Melbourne compute tree edit distances or hash subtrees to find near-duplicate subtrees.

AST methods resist many types of structural refactoring. They can also detect cross-language plagiarism: a Java for loop and a C++ for loop produce nearly identical ASTs after language-specific node removal. This was a crucial advance for courses that taught multiple languages.

However, AST comparison is computationally more expensive. A submission with hundreds of lines generates a tree with thousands of nodes; comparing all pairs across a class of 200 students means millions of tree-edit-distance calculations. Practical implementations use hashing and fingerprinting again, but on AST nodes: you hash a subtree and store it in an inverted index. Any two submissions sharing many subtree hashes are flagged.

Phase 4: Program Dependency Graphs and Semantics

The most sophisticated pre-AI method relied on program dependency graphs (PDGs). A PDG encodes data and control dependencies between statements. Two code snippets that implement the same algorithm (e.g., bubble sort) will have isomorphic PDGs even if the loops are structured differently or variable names differ.

This level of analysis catches the hard cases: replacing a recursive function with an iterative one, or swapping the order of two independent statements. PDG-based tools like GPlag (2004) and later commercial offerings achieved high recall against manual obfuscation—but suffered from even higher computational cost. For large courses, PDG analysis is rarely used in practice except for spot-checking suspicious cases.

The trade-off is clear: deeper semantic understanding requires more time and more sophisticated tooling. Most university workflows settle on a pipeline: a fast token-based filter (MOSS or JPlag) to narrow candidates, then manual review or AST-based deep dive for borderline cases.

Phase 5: The AI Detection Frontier

Starting in 2022, a new form of “plagiarism” emerged: code generated entirely by large language models. A student can prompt ChatGPT with “write a Java class that implements a binary search tree” and receive a unique piece of code that shares no token overlap with any other student. Traditional similarity checkers give it a 0% match—but the code was not written by the student.

Detecting AI-generated code requires a different lens. Researchers identified statistical signals in LLM output: perplexity (how predictable each token is) and burstiness (variance in sentence/token lengths). Human-written code tends to have varying identifier lengths, occasional typographical errors, and a natural rhythm of short and long lines. LLM output is more uniform—consistent lengths, no typos, predictable pattern of comment density.

Modern platforms, including Codequiry, combine traditional plagiarism detection with AI-specific metrics. They run a three-layer pass:

  1. Token-based similarity (MOSS-style) against the class corpus and known web sources.
  2. AST fingerprinting to catch cross-version refactoring.
  3. Perplexity and burstiness scoring to identify AI-generated code even when it appears original.
No single signal is definitive. The best practice is stacking evidence: high perplexity plus low similarity to other students plus unusually consistent formatting flags a submission for human review.

The evolution hasn’t ended. As LLMs improve, they may learn to mimic human variability, forcing detection tools to move beyond statistical surfaces to deeper semantic patterns—maybe even verifying that code contains the kinds of iterative debugging artifacts (e.g., leftover print statements, off-by-one fixes) that humans produce and LLMs rarely do.

What This Means for Instructors and Teams

Understanding this history helps you choose the right tools for the right threat model:

  • If you’re worried about students copying from each other (classic plagiarism), a token-based checker (MOSS, JPlag, Codequiry’s similarity engine) is sufficient and fast.
  • If you see sophisticated refactoring—extracting methods, replacing loops—you need AST or PDG analysis. Run a deep check on the top 10% of suspicious matches.
  • If you suspect AI generation, you need statistical signals. No conventional similarity tool can catch original AI code.
  • For industry code integrity (e.g., contractor submissions), combine all three layers plus license/policy scans.

The tools have come a long way from diff. But the principle remains: find the signal under the noise. Whether the noise is renamed variables or an LLM’s smoothed-over output, the job is the same—and it keeps getting harder.

Frequently Asked Questions

What is the strongest code plagiarism detection technique today?

No single technique is best. A combination of token-based fingerprinting (for speed) and AST-based comparison (for structural detection) catches the majority of student plagiarism. For AI-generated code, perplexity and burstiness analysis are added.

Can AI detection tools distinguish between ChatGPT and Copilot?

Not reliably. The statistical signatures are similar across major LLMs. Current detectors flag “likely AI-generated” without attributing to a specific model. The most effective approach is to cross-reference with web-source checks (e.g., did the code appear in a GitHub repo before the model’s training cutoff?).

How often do AST-based tools produce false positives?

AST normalization can over-match trivial code. For example, a common boilerplate loop (e.g., iterating over an array) will match many students precisely because it’s the standard idiom. Good tools weight matches by uncommonness of the subtree pattern.

Do any tools detect code that is paraphrased by an LLM?

Some emerging research looks at “semantic fingerprinting” using program functional equivalence—two pieces of code that produce the same output for all inputs. These tools are not yet production-ready. For now, statistical AI detection combined with manual review remains the best defense.