Can AST Comparison Survive Student Code Obfuscation

The Obfuscation Arms Race in Student Submissions

Every semester, a fraction of students copy a classmate’s solution and try to disguise it. The most common tricks—renaming variables, reformatting whitespace, changing comments—are trivial for modern plagiarism detectors to see through. But every year a small number of submissions arrive that genuinely fool MOSS or JPlag. These aren’t superficial renames; they involve systematic restructuring: converting a for loop to a while, inlining helper functions, or reversing the order of independent statements.

As a TA at a large public university, I once graded a pair of Fibonacci calculators where one student had replaced every recursive call with an explicit stack. The semantic equivalence was exact, but the token sequence was almost entirely different. MOSS returned a similarity score of 17%—well below the usual 40% threshold we flagged. A manual code review revealed identical comment remarks about handling edge cases, but the structure had been rewritten. Cases like this force us to ask: can AST comparison actually survive deliberate obfuscation, or are we just catching the lazy copiers?

What AST Comparison Actually Sees

AST-based tools (like JPlag’s structural mode, or Codequiry’s hybrid engine) parse source code into an abstract syntax tree and compare subtrees. They ignore variable names, whitespace, and formatting. Instead they match on operator types, control flow patterns, and expression structure. For example, two snippets that both compute the maximum of two numbers will produce similar ASTs even if one uses max(a,b) and the other uses a > b ? a : b—because both are ConditionalExpression nodes with a similar shape.

// Snippet A
if (x > y) {
    result = x;
} else {
    result = y;
}

// Snippet B
result = (x > y) ? x : y;

AST comparison sees both as an IfStatement (or ConditionalExpression) with the same condition and assignment branches. Variable renaming is irrelevant. This makes AST approaches far more robust against cosmetic changes than string-based diffing or token-sequence matching. But structure can still be transformed. The question is how much transformation it takes to create a structurally different AST while preserving semantics.

Token-Based Detection vs. Structural Matching: Where Each Breaks

MOSS (Measure Of Software Similarity) uses a token-sequence approach: it hashes overlapping k-grams of token types (identifiers, operators, keywords) and searches for shared substrings. Renaming variables barely affects token types (both become ID), but changing for to while changes the token sequence because different keywords appear at different offsets. Reordering independent statements like int a = 1; int b = 2; to int b = 2; int a = 1; doesn't change semantics but scrambles the token window. MOSS can and does miss these.

AST comparison, by contrast, sees the order of declarations inside a block as sibling nodes. Swapping independent siblings doesn't change the tree structure—the sibling relationship remains, and the parent-child edges are identical. This makes AST-based detection naturally resilient to statement reordering, as long as the control flow graph is preserved. However, if a student restructures the control flow entirely—say, flattening nested conditionals into a series of guard clauses—the AST shape can diverge significantly. A switch on a computed value can replace three if-else blocks, producing a fundamentally different subtree.

Control Flow Restructuring: The Gold Standard of Evasion

The most effective obfuscation I’ve seen in real student work involves converting a loop into a different loop type, or replacing recursion with iteration (or vice versa). Consider a simple factorial function:

// Recursive version
int factorial(int n) {
    if (n <= 1) return 1;
    return n * factorial(n - 1);
}

// Iterative version
int factorial(int n) {
    int result = 1;
    for (int i = 2; i <= n; i++) {
        result *= i;
    }
    return result;
}

The recursive version’s AST contains MethodInvocation nodes within the return statement. The iterative version has a ForStatement with a Block. These trees are structurally dissimilar. A pure AST comparator would see low similarity. A semantic clone detector (like those used in academic code clone research) might recognize they compute the same function, but most plagiarism tools in the market today are not semantic—they’re structural.

Does this mean AST-based engines fail entirely on restructuring? No—if the student leaves any fingerprint of the original structure (same variable name in at least one place, same comment, same constant order), the hybrid detector catches it. The real danger case is when the student transcribes the algorithm from memory, producing a completely different syntactic skeleton. That’s a much rarer skill than most instructors assume.

The Limits of AST: When Renaming Isn’t Enough

Variable renaming alone is meaningless to AST comparators—they ignore identifier tokens. But students sometimes rename methods or classes, which also produce no structural change. The real limit of AST is its inability to see linear sequences of identical expressions that have been split across functions. If the original code had a 20-line runSimulation() method and the student extracts a 10-line computeResults() helper, the AST now has two separate method nodes. The parent-child relationship of the extracted code changes: what was a sequence of statements inside one method becomes a MethodInvocation inside the original method and a new subtree elsewhere.

Advanced AST detectors handle some level of extraction by measuring subtree isomorphism depth-first, but the extracted method’s tree is no longer a direct subtree of the original. This creates a false negative unless the tool also computes method-level similarity and then cross-references across classes. Tools like Codequiry address this by comparing every method’s AST against every other method in the corpus (pairwise), and then aggregating those matches back to the file level. That adds computational cost but catches extract-refactor plagiarism that simpler tree alignment misses.

Combining AST with Other Signals for Robust Detection

No single technique catches every obfuscation. The most reliable approach in production is a hybrid pipeline:

String-based fingerprinting (winnowing) for exact copypasta and near-identical files.
Token-sequence comparison (MOSS-like k-gram hashing) for block moves and modest refactoring.
AST substructure comparison (subtree matching with normalization) for variable-renamed and reordered code.
Web-source cross-referencing to catch code lifted from GitHub or Stack Overflow.

In my experience running a large CS1 course with 600+ students, AST comparison catches about 15–20% of plagiarism cases that token-based methods miss, mostly involving swapped conditional branches and inlined method calls. Token-based methods still catch the bulk of lazy copiers (70% of cases). The remaining 10–15% are obfuscation attempts that neither AST nor token methods flag—typically, the student has rewritten the algorithm so thoroughly that the only shared artifact is a comment or a magic number. For those, a secondary manual review of the most suspicious low-similarity submissions (using Codequiry’s side-by-side highlighted view) is necessary.

Key insight: The #1 predictor of whether a student can successfully evade detection is not technical sophistication—it's time investment. Students who spend 10+ minutes manually rewriting code to avoid detection do get caught less often. But most students don't invest that time, and those who do rarely copy from a single source. They combine two or three sources, which itself introduces its own structural fingerprints.

Practical Implications for Course Design

Understanding what AST detectors can and cannot survive changes how we design assignments and set thresholds. If your tool is string-based only, you’ll miss refactored code. If it’s AST-only, you’ll miss code that has been extracted into multiple methods. The strongest signal emerges when you require students to submit all code files in a single zip and then run both a structural and a token-based pass across the entire corpus.

For instructors: set your similarity threshold conservatively (e.g., 35–40% for token-based; 25–30% for AST-based) and always inspect submissions flagged by at least two detectors. A 100% match on AST with 0% on token might indicate a student who renamed every method—worth looking at. Conversely, 0% on AST but 60% on token could mean the AST tree has been flattened into a different control structure while token patterns persist.

I also advise TAs to inject a dead-code requirement: ask students to include a logging statement that prints a specific phrase, or to use a particular import. These “canary” statements make obfuscation much harder because the student copying from a peer would either have to replicate the canary (unlikely to remember) or remove it (creating a mismatch that the detector notices as missing code).

Frequently Asked Questions

Does AST comparison catch changes in operator precedence (e.g., `a + b * c` vs `(a + b) * c`)?

Yes—the AST for the two expressions is different. One has a multiplication as the root of the subtree; the other has an addition. So AST catches precedence-altering parentheses automatically. Token-based comparators also catch them because token sequences differ.

Can AST-based tools detect code that has been converted from one language to another (e.g., Java to Python)?

Not directly. ASTs are language-specific. Parsers for Java and Python produce completely different node types. Cross-language detection requires a higher-level representation like an intermediate language (IL) or lattice-based semantic model, which few plagiarism tools implement. Codequiry supports cross-language similarity only within the same family (e.g., C/C++/Java) through a normalized token stream, not AST.

What’s the computational cost of AST comparison at scale (500+ submissions)?

Pairwise AST comparison is O(n²) in the number of files, with each comparison being roughly O(tree size). For 500 Java submissions each averaging 200 lines, a full pairwise AST scan takes about 2–4 minutes on modern hardware with parallel processing. The more expensive step is building the ASTs (parsing errors slow things down). Codequiry’s cloud-based infrastructure handles up to 10,000 submissions under 10 minutes using distributed substructure matching.