From Paper Traces to Abstract Syntax Trees: Code Similarity Then and Now

Before the Tools: Paper and Eyeballs

In the early 1990s, a computer science professor at a large state university faced a familiar problem. Two C++ assignments, submitted by different students, compiled without error and produced identical output. But the code itself looked different — variable names changed, comments removed, a few function bodies reordered. Could the professor prove they were copied?

She printed both submissions. She spent an afternoon with two highlighters, marking structurally equivalent blocks. She found that the same for-loop pattern appeared in both, just with reversed iteration direction and renamed counters. She wrote a note to the department chair. The students were referred to the honor board.

That process — manual, labor-intensive, and ultimately dependent on the grader's patience — was the state of the art for code plagiarism detection in 1992. It worked for egregious cases. It missed everything else.

The Diff Era: Unix as Primitive Plagiarism Detector

The first automation attempt was almost too obvious: use diff. The Unix utility, designed to find line-level differences between text files, seemed like a natural fit. Compare two student submissions. If the diff output is small, they're similar. Case closed.

$ diff student1.c student2.c
1,2c1,2
< // Final Project Submission
< // Student: Alice Smith
---
> // My awesome C program
> // Student: Bob Jones
5c5
< int sum = 0;
---
> int total = 0;
[...]

This fails spectacularly, of course. diff operates on exact string matches. Rename a variable, and the diff grows. Reorder two function definitions, and the diff explodes. Reformat the whitespace, and diff sees a deliberate rewrite.

Some professors tried diff -b -w to ignore whitespace. Others ran submissions through a preprocessor that stripped comments and normalized indentation. These hacks bought marginal improvement, but the fundamental limitation remained: diff sees text, not code. It has no concept of a variable declaration, a control flow structure, or a function signature.

The realization that spread through CS departments in the mid-1990s was this: code similarity detection is fundamentally different from text similarity detection. A C program and its copy with renamed identifiers share zero identical lines but share 100 percent of the same structure.

The Fingerprinting Revolution: Token-Based Similarity

In 1994, Alex Aiken at UC Berkeley developed a tool that would define the field for the next two decades. MOSS (Measure of Software Similarity) took a radically different approach: ignore the text, analyze the tokens.

MOSS works by tokenizing source code and hashing sequences of tokens into fingerprints. It then compares these fingerprints across submissions using an algorithm inspired by document fingerprinting in information retrieval. The critical insight: token sequences preserve program structure while being invariant to identifier names, comments, and whitespace.

Consider two equivalent code fragments:

// Student A
for (int i = 0; i < n; i++) {
    sum += array[i];
}
// Student B — Identical structure, renamed
for (int counter = 0; counter < length; counter++) {
    total += values[counter];
}

Text-based comparison sees completely different code. Token-based comparison sees the same sequence: FOR LPAREN INT ID ASSIGN INT_LITERAL SEMICOLON ID LT ID SEMICOLON ID INC RPAREN LBRACE ID PLUS ASSIGN ID LBRACKET ID RBRACKET SEMICOLON RBRACE.

MOSS also introduced the concept of winnowing — selecting a subset of fingerprints for comparison, which dramatically reduced the computational cost while maintaining accuracy. A submission of 1,000 lines generates many thousands of k-gram hashes, but only a fraction of those are selected as fingerprints. The result: MOSS could compare an entire class of 200 students in minutes, not hours.

By the late 1990s, MOSS had become the de facto standard in academic settings. Harvard used it. Stanford used it. The University of Texas used it. Most institutions lacked the infrastructure to run it themselves, so Aiken ran a public web service that persists to this day. The service is still free, still run by the same person, and still processing millions of submissions per year.

But MOSS had blind spots. It could not detect structure-preserving transformations that altered the token stream. Inlining a function, for example, would change the token sequence significantly. Introducing a redundant loop that performs no useful work — a pattern known as "control flow obfuscation" — could break token-based matching entirely.

The JPlag Alternative: Greedy String Tiling

Around the same time, Lutz Prechelt and his team at the University of Karlsruhe developed JPlag. Rather than fingerprinting, JPlag used a technique called greedy string tiling to find the longest common subsequences between tokenized programs.

The algorithm works like this: find the largest contiguous block of matching tokens between two submissions, mark it as matched, then repeat on the remaining unmatched regions until no sizable match remains. The percentage of matched tokens becomes the similarity score.

Algorithm: Greedy String Tiling (simplified)
1. Tokenize both programs A and B
2. Find the longest run of matching tokens between A and B
3. Mark those tokens as "matched" in both programs
4. Repeat step 2 on unmatched tokens until no run longer than threshold T exists
5. Similarity = (total matched tokens) / (total tokens in A)

JPlag proved particularly effective at detecting partial copying — a student taking one function from a classmate and integrating it into their own work. The greedy tiling approach naturally identified these islands of similarity within otherwise different codebases.

For nearly a decade, MOSS and JPlag coexisted as the two dominant tools. CS professors developed preferences. Some swore by MOSS's speed and winnowing efficiency. Others trusted JPlag's more transparent similarity reporting. Both tools burned out on the same fundamental limitation: they operated on flattened token sequences and lost information about program structure.

The AST Revolution: Structure-Aware Detection

The early 2010s brought a paradigm shift. Researchers and tool developers began asking: why analyze a linearized representation of code when we can analyze the tree structure directly?

Abstract Syntax Trees (ASTs) capture the hierarchical structure of programs — the nesting of loops inside conditionals, the scoping of variables, the call graph of functions. Two programs that implement the same algorithm through different variable naming and expression ordering will produce ASTs that are structurally similar, even when their token sequences diverge significantly.

// Token sequence (heavily obfuscated):
// FOR LPAREN ... RPAREN LBRACE IF LPAREN ... RPAREN LBRACE ... RBRACE RBRACE

// AST structure:
// ForStatement
//   Init: Assignment
//   Condition: BinaryOp
//   Body: Block
//     IfStatement
//       Condition: BinaryOp
//       Then: Block
//         ReturnStatement

Modern AST-based tools compute similarity by comparing subtrees. The approach handles many obfuscation techniques that defeated token-based methods:

  • Variable renaming — AST nodes contain identifiers, but comparison can normalize or ignore them
  • Expression restructuringa + b / c and a + (b / c) produce identical ASTs
  • Statement reordering — independent statements shuffled around still match at the subtree level
  • Code insertion — inserted code adds new subtrees but leaves existing ones unmodified

Codequiry's analysis pipeline, for example, incorporates AST comparison alongside token-based methods to catch both surface-level and structural similarity. This layered approach catches students who have learned to evade simpler tools by renaming everything and shuffling function order, but leave the underlying program structure intact.

The Arms Race Continues: Modern Obfuscation and Mitigation

Students adapt. Every detection technique eventually faces countermeasures. We have observed several patterns in recent years:

Obfuscation Technique Victim Detection Method Mitigation Strategy
Variable renaming + comment removal Token-based (MOSS, JPlag) Normalize identifiers to placeholders
Function inlining Token-sequence based AST subtree matching on decomposed functions
Dead code insertion Greedy string tiling Ignore low-frequency token patterns
Control flow obfuscation All token methods Program dependence graph analysis
Semantic equivalence via different constructs All structural methods Input-output analysis, dynamic traces

The hard truth: no detection method is invulnerable to a determined, technically sophisticated student. A student who rewrites the assignment from scratch, preserving only the logic and data structures, will produce code that no existing tool flags as suspicious. This is not a limitation of the tools — it is a fundamental property of the problem. Semantic equivalence is undecidable for Turing-complete languages.

What the tools do catch — and catch reliably — is the lazy copy. The student who renames identifiers and resubmits. The pair who split the work and then individually reformatted their halves. The group that shared a single solution and each tried to disguise it differently.

"The goal of plagiarism detection is not to catch every instance of academic dishonesty. The goal is to make dishonesty sufficiently detectable that the honest path remains the path of least resistance." — Lutz Prechelt, JPlag co-author

What the History Teaches Us About Building Better Detection

Three lessons emerge from thirty years of code similarity detection:

First, granularity matters. Line-level comparison is worthless. Token-level comparison catches renaming. AST-level comparison catches structural reorganization. Program dependence graphs catch control flow obfuscation. Each layer of the analysis adds a dimension of robustness, but also adds complexity and potential false positives.

Second, context is everything. A similarity score of 80 percent means different things in an introductory Python course versus a senior-level compiler construction project. In the former, students are working from constrained examples and may legitimately produce similar code. In the latter, the design space is vast and high similarity is genuinely suspicious. Modern tools must allow instructors to calibrate thresholds to their specific context.

Third, the tools are only one part of the system. The most effective academic integrity programs combine detection with clear communication of expectations, assignment design that makes copying harder than completing, and a fair, transparent adjudication process when matches are flagged. Codequiry's platform, for instance, provides not just similarity scores but detailed side-by-side comparisons that help instructors understand why two submissions are flagged — enabling informed decisions rather than automated judgments.

Looking Forward: The Next Decade

We are now seeing the emergence of hybrid approaches that combine multiple detection methods into unified pipelines. A modern analysis might run token-based fingerprinting, AST subtree matching, and control flow graph comparison in parallel, then aggregate results with machine-learned weights tuned to the specific assignment and programming language.

We are also seeing increased interest in cross-language plagiarism detection — identifying when a Java program has been transcribed into Python with preserved logic. This requires analysis at the level of algorithm structure rather than syntax, and remains an open research challenge.

And, of course, the recent emergence of large language models has introduced a new dimension: code that may be original to the submitter but was generated by an AI system. This presents different challenges from traditional plagiarism — the code is not copied from another student, but it may violate assignment policies about independent work. Detecting AI-generated code requires different techniques again, drawing on statistical analysis of token distributions and stylistic patterns that differ from human-written code.

The history of code similarity detection is a microcosm of computer science itself: an escalating series of problems and solutions, each new generation of tools responding to the limitations of the previous. The paper highlighters are long gone. The diff commands are retired. But the fundamental challenge remains: distinguishing learning from copying in a field where learning often looks like copying, and where the most efficient path to a working program is rarely the most educational one.