How a TA Spots Refactored Code in 300 Java Submissions

It was the third week of CS 101 at a large Midwestern university. Alex, a senior CS major working as a teaching assistant, had just received his first batch of 300 Java submissions for a basic sorting algorithm assignment. Most were straightforward implementations of bubble sort and insertion sort. But something felt off. Two submissions had identical variable names, identical method signatures, and identical comments—except one student had renamed a single method from sortArray to arrSort. A simple find-and-replace job.

That was the obvious case. The harder cases came later: submissions where students had rewritten loops as recursive functions, changed all variable names, and reordered method declarations. These looked like different code to the untrained eye—and to many plagiarism detection tools. Alex needed a deeper approach.

This article walks through the practical workflow Alex developed over a semester to detect refactored code plagiarism in large Java course sections. It covers the algorithms, tools, and manual inspection techniques that turned a laborious weekly task into a manageable, repeatable process. If you’re a TA, professor, or grader dealing with similar volumes, the patterns here will translate directly to your context.

The Assignment That Triggered the Shift

The assignment was straightforward: implement a program that reads a list of integers from a file, sorts them using a choice of three algorithms (bubble sort, insertion sort, quick sort), and outputs the sorted list with timing information. Students had to write the sorting algorithms themselves—no Collections.sort() or Arrays.sort() allowed. The spec required specific method signatures, a main class structure, and proper file I/O.

Alex ran the first batch through a basic token-based similarity checker. The tool produced a report ranking pairs by similarity percentage. The top pair scored 97%—the obvious find-and-replace case. Easy catch. But the next fifteen pairs scored between 55% and 70%. The tool flagged them, but Alex couldn’t immediately tell if they were legitimate independent work or well-disguised copies.

“Token-based tools are great for catching copy-paste with minor edits. But when a student refactors code—renaming everything, restructuring loops, changing control flow—the token sequence changes entirely. You need a different lens.” —Alex, TA for CS 101

This is the central problem in code plagiarism detection: two programs can be semantically identical but syntactically unrecognizable. Students who understand the assignment material can intentionally obfuscate their copied code to evade detection. The challenge is to find the structural fingerprint that survives refactoring.

Token-Based Detection: The Baseline

Most plagiarism detection tools—MOSS, JPlag, and Codequiry among them—start with tokenization. The idea is straightforward: strip away whitespace, comments, and variable names, then map the remaining syntactic elements to a sequence of token types. For Java, this means converting something like:

public class SortingApp {
    public static void bubbleSort(int[] arr) {
        int n = arr.length;
        for (int i = 0; i < n; i++) {
            for (int j = 1; j < (n - i); j++) {
                if (arr[j - 1] > arr[j]) {
                    int temp = arr[j - 1];
                    arr[j - 1] = arr[j];
                    arr[j] = temp;
                }
            }
        }
    }
}

Into a token sequence like: PUBLIC CLASS ID PUBLIC STATIC VOID ID LP ARRAY INT RP INT ID EQUALS ID DOT LENGTH FOR LP INT ID EQUALS NUM SEMI ID LESS ID PLUS PLUS RP FOR LP INT ID EQUALS NUM SEMI ID LESS LP ID MINUS ID RP PLUS PLUS RP IF LP ID LBRACKET ID MINUS NUM RBRACKET GREATER ID LBRACKET ID RBRACKET RP INT ID EQUALS ID LBRACKET ID MINUS NUM RBRACKET SEMI ...

This token sequence captures the structure of the code without being sensitive to identifier names. Two programs with identical token sequences are almost certainly plagiarized, even if all variables are renamed. Token-based detection catches the obvious refactoring attempt: changing variable names.

But consider what happens when a student restructures the loops. Suppose they replace the inner for loop with a while loop:

for (int i = 0; i < n; i++) {
    int j = 1;
    while (j < (n - i)) {
        if (arr[j - 1] > arr[j]) {
            int temp = arr[j - 1];
            arr[j - 1] = arr[j];
            arr[j] = temp;
        }
        j++;
    }
}

The token sequence changes dramatically. FOR becomes INT ASSIGNMENT WHILE SEMI INCREMENT. The sequence no longer matches the original. A token-based tool would report a lower similarity score—perhaps 60-70%—and a TA might dismiss it as independent work. This is where structural analysis becomes essential.

AST-Based Analysis: Seeing Past Syntax

Abstract Syntax Trees (ASTs) represent the structure of code independently of its surface syntax. Every Java program can be parsed into an AST where nodes represent constructs like method declarations, loops, conditionals, and expressions. Two programs that perform the same operations will have similar ASTs even if their textual representations differ.

For the bubble sort example, the AST for both the for-loop version and the while-loop version will contain:

A method declaration node for bubbleSort
A loop node representing the outer iteration (regardless of loop type)
A nested loop node representing the inner iteration
A conditional node for the comparison
Assignment nodes for the swap

The key insight: AST structure is far more resistant to refactoring than token sequences. Changing a for to a while doesn’t change the AST’s topology—it just changes the type of the loop node. A comparison algorithm that normalizes loop and conditional types can treat for and while as equivalent, focusing on the nesting structure.

Tools like Codequiry use AST-based fingerprinting internally. When Alex ran his batch of 300 submissions through a tool that combined token and AST analysis, the suspicious pairs jumped in similarity. The pair that scored 63% on tokens alone moved to 87% on combined AST-token analysis. That was enough to justify a closer look.

Building a Simple AST Fingerprint

You don’t need a production-grade tool to apply AST analysis. Even a simple fingerprint extracted from a standard Java parser can reveal structural similarity. Here’s a Python script that uses the javalang library to extract a tree of method signatures and loop depths from a Java file:

import javalang
from collections import defaultdict

def extract_ast_fingerprint(filepath):
    with open(filepath, 'r') as f:
        code = f.read()
    
    tree = javalang.parse.parse(code)
    fingerprint = defaultdict(list)
    
    for path, node in tree:
        if isinstance(node, javalang.tree.MethodDeclaration):
            method_name = node.name
            loop_depth = 0
            for sub_path, sub_node in tree.filter(javalang.tree.Statement):
                if isinstance(sub_node, (javalang.tree.ForStatement,
                                         javalang.tree.WhileStatement,
                                         javalang.tree.DoStatement)):
                    loop_depth += 1
            fingerprint[method_name].append(loop_depth)
    
    return fingerprint

This yields a fingerprint like: {'bubbleSort': [2], 'main': [0], 'readFile': [1]}. Two programs that both have a method called bubbleSort with loop depth 2, a main method with depth 0, and a readFile method with depth 1 share a structural signature. If you normalize method names (e.g., by hashing their internal AST subtree), you can compare the structural topology without relying on identifier names at all.

Alex built a simple tool that computed AST fingerprints for all 300 submissions. He then clustered submissions by fingerprint similarity. The results were illuminating: several clusters contained submissions from students in the same lab section. That alone wasn’t evidence—students in the same section might independently converge on similar structures—but it narrowed the search space considerably.

Fingerprinting Beyond AST: Normalized Code Representations

AST analysis alone can still be fooled by more aggressive refactoring. Consider a student who takes a bubble sort implementation and converts it to use recursion instead of iteration:

public static void bubbleSortRecursive(int[] arr, int n) {
    if (n == 1) return;
    for (int i = 0; i < n - 1; i++) {
        if (arr[i] > arr[i + 1]) {
            int temp = arr[i];
            arr[i] = arr[i + 1];
            arr[i + 1] = temp;
        }
    }
    bubbleSortRecursive(arr, n - 1);
}

The AST for this recursive version is quite different from the iterative version. A tool that only compares tree structure might miss the connection. This is where normalized code representations come in: you transform both programs into a canonical form before comparison.

Common normalization techniques include:

Loop-to-recursion normalization: Detect recursive patterns and convert them to equivalent iterative forms (or vice versa)
Control flow graph normalization: Convert both programs to a control flow graph and compare graph edit distance
Data flow normalization: Extract data dependencies independent of execution order
Operator reordering normalization: Sort commutative operations (e.g., a + b and b + a) into a canonical order

Each normalization step increases the computational cost of comparison. For a batch of 300 submissions, Alex had to balance thoroughness with the time available between grading deadlines. He developed a tiered approach:

First pass: Token-based similarity. Quick, catches obvious cases. Flag pairs above 70%.
Second pass: AST-based fingerprinting on the remaining pairs. Flag pairs above 80%.
Third pass: Normalized code representation on the flagged pairs from step 2. Manual review of the top matches.

This tiered workflow reduced the manual review load from 300²/2 ≈ 45,000 pairwise comparisons to a manageable 20-30 pairs per week.

Manual Inspection: The Human Component

No automated tool can definitively determine plagiarism. The final step always involves human judgment. Alex developed a checklist for manual inspection of flagged pairs:

Structural parallelism: Do both programs follow the same unusual algorithmic choices? For example, using bubble sort with a swapped flag when the spec didn’t require it.
Idiomatic quirks: Do both programs use the same non-standard idiom? E.g., using ++i in all loops when most students use i++; or using continue in the same unconventional places.
Error handling patterns: Do both programs handle edge cases in the same way? E.g., both use return Optional.empty() when the spec didn’t mention optionals; both catch NumberFormatException in the same location.
Comment and whitespace structure: Even if comments are removed, the pattern of logical groupings (blank lines between sections, indentation style) can reveal a shared source.
Variable name migration: If one program uses count and the other uses counter, that’s a trivial rename. But if one uses elementIndex and the other uses currentPos, that suggests a deliberate attempt to obscure the original variable name while preserving the conceptual role.

Alex found that the most reliable indicator was unnecessary complexity. Students who copy code often add irrelevant transformations to avoid detection. One memorable pair had a student who wrapped every method body in a try-catch block that caught Exception and did nothing—a pattern that served no functional purpose but dramatically changed the token sequence.

“The best indicator isn’t how similar the code looks—it’s why the code looks different. If the differences are cosmetic, trivial, or add no value, that’s a red flag. If the differences show genuine understanding—like using a more efficient algorithm—then it’s probably independent work.” —Alex

Scaling the Workflow with Tooling

Alex’s manual approach worked for 300 submissions, but not for the 1,200-student course other TAs handled. For larger courses, automation becomes essential. Tools like Codequiry provide an API that can run AST-based comparison across thousands of submissions and produce a ranked list of suspect pairs with detailed structural evidence.

The key feature Alex looked for in a tool was the ability to view normalized code side-by-side. Many tools show the original source with highlighting of matching tokens. That’s fine for copy-paste cases. But for refactored code, you need to see the AST structure, the control flow graph, or a normalized form where loops are collapsed and variable names are replaced with generic identifiers.

One approach Alex experimented with was to generate a “canonical form” for each submission by stripping all identifiers, replacing them with placeholder names (var0, var1, etc.), and then running a diff tool. Two submissions that differ only in identifier names will have identical canonical forms. Two submissions that differ in loop structure but share the same logical pattern will have very similar canonical forms.

The canonical form approach caught a case that had slipped through token-based analysis: a student who had taken a classmate’s code, replaced every for loop with an equivalent while loop, changed all variable names, and added random blank lines. The token-based tool gave a similarity of 58%—below the threshold. The canonical form gave 94%. Alex pulled both students aside, showed them the canonical diff, and received a confession within minutes.

Practical Recommendations for TAs and Professors

Based on Alex’s semester-long experience, here are actionable recommendations for anyone grading programming assignments at scale:

Use multiple detection layers. Token-based alone is insufficient. Combine it with AST analysis, normalized form comparison, or both. Codequiry and similar platforms offer these layers out of the box.
Set appropriate thresholds. Don’t set a single similarity threshold for all assignments. A 60% similarity might be suspicious in a heavily scaffolded assignment where most code is provided, but normal in an open-ended project where students independently converge on similar solutions.
Track false positives over time. Keep a log of pairs you reviewed and cleared. Patterns will emerge. If two students consistently have moderately similar code across multiple assignments, they’re either studying together legitimately or collaborating when they shouldn’t be.
Design assignments with detection in mind. Include required unique elements: a specific output format, a mandatory test harness, or a unique combination of algorithms. These create structural anchors that make plagiarism harder to hide.
Educate students before detection. Run an in-class demonstration of how plagiarism detection works. Show them an example of refactored code that was caught by AST analysis. Students who understand the system are less likely to attempt evasion.
Build a reference library. Save exemplar solutions from previous semesters. Run new submissions against this library. Students occasionally copy from an older student’s work posted on GitHub or shared in a study group.

The Bottom Line for Graders

Refactored code plagiarism is not an unsolvable problem. The techniques exist—AST analysis, normalized code representations, careful manual inspection—to catch even the most determined evaders. The challenge is adopting a workflow that fits your course size, assignment structure, and grading timeline.

Alex’s final advice to fellow TAs: Don’t trust a single similarity score. Use multiple tools, multiple passes, and your own judgment. The goal isn’t to catch every instance of plagiarism—it’s to create a fair process that rewards original work and deters cheating. When students know that a TA can spot a refactored submission by looking at the AST fingerprint, they’re far less likely to try.

And if you’re managing a course with hundreds or thousands of submissions, consider integrating a platform like Codequiry that does the heavy lifting of AST parsing, fingerprinting, and normalization across submissions. The time you save on detection is time you can spend on the parts of teaching that matter more: writing better assignments, giving detailed feedback, and helping students actually learn to code.