Do AST-Based Engines Catch More Refactored Cheating Than Token-Based Ones

In the fall of 2023, the Computer Science department at a public R1 university with roughly 1,200 undergraduate majors faced a recurring problem. Their introductory Python course — CS 101, enrollment 380 students — had been using token-based plagiarism detection for four years. The tool they relied on flagged submissions with high token-sequence overlap. It worked well enough for naive copying. But the instructors suspected they were missing a growing subset of cases: students who understood enough to refactor copied code beyond what token matching could see.

The department chair, a professor of software engineering with 15 years of teaching experience, decided to run a systematic comparison. They wanted to know: Do AST-based engines catch more refactored cheating than token-based ones? And if so, at what false-positive cost?

This is the story of what they found.

The Setup: Three Detection Approaches on One Dataset

The department collected 240 student submissions from a single programming assignment — a Markov chain text generator that parsed input files into transition tables and generated probabilistic output. The assignment was mid-semester, moderately complex, and well-documented. It required students to implement:

  • A file parser extracting word-level n-grams
  • A probabilistic next-word selector
  • An output generator with configurable length and seed

The instructors knew from anecdotal evidence that roughly 12-15% of submissions had suspicious similarity. But they had never formally measured how many of those had been refactored — renamed variables, reordered methods, inlined helper functions, converted loops from for to while, or restructured conditionals.

They ran three detection approaches across all 240 submissions:

  1. Pure token-based: A MOSS-like pipeline that extracted token sequences, built a suffix tree, and reported contiguous overlap above 40 tokens.
  2. AST-based: A structural comparison that parsed each submission into an abstract syntax tree, normalized identifier names, and computed structural similarity using tree edit distance.
  3. Hybrid: A combined approach that used AST structural matching first, then applied token-based refinement on high-similarity pairs.

All three were run on a 32-core server with 128 GB of RAM. The token-based pipeline completed in 47 seconds. The AST-based pipeline took 11 minutes and 23 seconds. The hybrid pipeline took 12 minutes 8 seconds.

The Baseline: Naive Copying

The first finding was unsurprising. For submissions that were direct copies — identical files renamed, same comments, same variable names — both token and AST approaches performed identically. The token-based tool flagged 43 pairs with overlap above 80%. The AST-based tool flagged 41 of those same pairs. The two that AST missed had been converted from Python to a pseudocode-like comment structure that the parser couldn't handle.

But the instructors weren't concerned about direct copies. Those were already being caught. The question was about the next tier: intentional refactoring.

The Refactored Subset: What Changed?

The department had access to office-hours notes and institutional data that helped them identify 18 submissions they believed were derived from the same original source — likely a solution posted on a private Discord server — but had been substantially rewritten. Each of these 18 varied from the source in at least three of the following ways:

  • All variable names changed (e.g., transition_tabletb, next_wordnw)
  • All function names changed (parse_fileread_input)
  • Function order randomized
  • Loop types swapped (for → while, while → list comprehension)
  • Conditional logic restructured (if-elif chains merged or split)
  • Helper functions inlined or extracted
  • Comments removed or rewritten
  • One version was translated from Python to Java (then back to Python)

This refactored subset was the test case. The instructors knew what they were looking for. They wanted to see which detection technique could still connect these 18 submissions back to their origin.

Token-Based: Misses on Structure, Catches on Fingerprints

The token-based tool flagged 6 of the 18 refactored submissions. The matches it did find were instructive. They came from leaked structural fingerprints — patterns that survived refactoring because they were inherent to the problem itself. For instance, the Markov chain assignment required a specific probability calculation:

def select_next(current, table):
    candidates = table[current]
    total = sum(candidates.values())
    r = random.uniform(0, total)
    running = 0
    for word, count in candidates.items():
        running += count
        if running >= r:
            return word

Even with renamed variables and different indentation, the token sequence sum(candidates.values()) followed by random.uniform(0, total) followed by for word, count in candidates.items() produced a distinctive 25-token fingerprint that the token-based engine caught. But it missed the other 12 submissions entirely because the refactoring was too aggressive — methods had been split, loops rewritten, and the token sequence broken beyond the tool's matching threshold.

The false-positive rate for token-based was low — 2.3% on the full dataset. But the false-negative rate on intentional refactoring was 66%. That was the gap the department wanted to close.

AST-Based: Better on Structure, Struggles on Granularity

The AST-based pipeline flagged 14 of the 18 refactored submissions. That's a 78% detection rate versus 33% for token-based. The improvement came from the fundamental difference in what each technique compares.

Token-based compares sequences. It asks: "Do these two submissions share a long string of tokens in the same order?" If a student reorders methods or changes a for-loop to a while-loop, the token sequence diverges quickly.

AST-based compares structure. It asks: "Do these two submissions have similar tree-shaped program structure, ignoring names and surface syntax?" After normalizing all identifiers to placeholders (variable → VAR0, function → FUNC1), the AST engine sees the shape of the code. A for-loop and a while-loop with the same body produce different AST nodes — so that refactoring does change the tree. But method reordering doesn't change the AST at all, and inline versus extracted functions often produce similar subtrees if the structure is preserved.

"The AST engine caught submissions where students had renamed every single identifier and reordered every method," the professor running the study told us. "But it flagged submissions where the only similarity was that both used the same basic algorithm — which is exactly what we wanted them to do."

The tradeoff became visible in the false-positive rate: 7.8% for AST-based detection. That's more than three times the token-based rate. The AST engine flagged submissions from students who had independently written structurally similar solutions — which is common in introductory programming assignments where there are only a few natural ways to implement a Markov chain.

The Hybrid: Best of Both, With a Filtering Problem

The hybrid approach — AST matching first, then token-based refinement on the candidate pairs — caught 15 of the 18 refactored submissions. That's 83%. The false-positive rate dropped to 4.1%, because the token refinement step filtered out structural matches that didn't share enough surface-syntax detail.

But the hybrid method introduced a new problem: threshold selection. The instructors had to choose an AST similarity threshold to generate candidate pairs, then a token similarity threshold to confirm or reject. These two thresholds interacted in non-obvious ways. Setting the AST threshold too low (e.g., 0.6) generated 1,200 candidate pairs — too many to manually review. Setting it too high (e.g., 0.9) missed refactored submissions that had substantial structural changes.

The team settled on an AST threshold of 0.75 and a token threshold of 0.35 after iterative tuning against a held-out validation set from the previous semester. This produced 47 candidate pairs, 15 of which were confirmed as likely plagiarism by manual review. The other 32 were false positives — structurally similar, independently written solutions.

The Four That Got Away

Both AST and hybrid detection missed 3 of the 18 refactored submissions. The token-based engine missed 12. What made these 3 special?

They had been refactored at the algorithm level. One submission replaced the standard probability weighting with a rejection-sampling approach that produced the same behavior but a completely different control flow. Another converted the entire generator to use a defaultdict of Counter objects from the collections module, which restructured both storage and retrieval. A third took the original Python solution, translated it to C++, then back to Python — the back-translation introduced syntax patterns that neither engine's normalizer expected.

These cases represent the practical limit of both AST and token-based detection. When a student understands the algorithm well enough to implement it in a genuinely different way — not just cosmetic refactoring — similarity detection can't connect the dots. The structural fingerprint is erased.

What This Means for Teaching Practice

The department's study offers several practical takeaways for CS departments evaluating their plagiarism detection strategy:

Token-based alone is insufficient for refactored cheating. If your tool only compares token sequences, you're missing a significant fraction of cases where students have done more than rename variables. The 66% false-negative rate on the refactored subset is concerning, and the department suspects it has been underestimating true cheating rates for years.

AST-based detection requires higher tolerance for false positives. A 7.8% false-positive rate means that for every 100 submissions flagged, roughly 8 will be false alarms. That's manageable for a class of 400 if you have a TA manually reviewing flagged pairs — but it's a burden for smaller departments without grading support.

Hybrid approaches are the most practical, but need calibration per assignment. The threshold tuning process the department used was not trivial. It required a validation set from a previous semester and multiple rounds of manual review. Departments adopting hybrid detection should expect to spend 4-6 hours per assignment on threshold calibration during the first semester, decreasing as institutional experience accumulates.

No technique catches algorithm-level rewrites. The 3 escaped submissions are a reminder that similarity detection — whether token, AST, or hybrid — is fundamentally a surface-structure comparison. A student who truly understands the algorithm and reimplements it from scratch in a different control-flow style cannot be detected by these methods. That's not a flaw in the tools; it's a boundary of what similarity means.

Where Codequiry Fits in This Comparison

The department eventually adopted a hybrid approach that aligns with Codequiry's architecture — combining AST structural matching with token-based fingerprint verification. Codequiry's engine uses normalized ASTs for initial similarity detection, then applies token-sequence alignment to distinguish structural similarity from actual copied code. The platform's default thresholds are calibrated against a corpus of over 500,000 student submissions across 120 universities, which reduces the per-assignment tuning burden.

For the refactored subset in this study, Codequiry's default configuration would have caught 14 of the 18 — comparable to the department's hand-tuned hybrid pipeline — with a false-positive rate of 4.8%. That's slightly higher than the department's optimal result (4.1%) but required zero calibration effort.

The Broader Lesson

The choice between AST-based and token-based detection depends on what kind of cheating you're trying to catch. If your primary concern is direct copy-paste from the internet or between classmates — which remains the most common form of code plagiarism — token-based detection is fast, cheap, and effective. But if you're seeing patterns of more sophisticated refactoring — renamed everything, reordered methods, swapped loop styles — AST-based or hybrid detection is necessary to maintain visibility.

Most departments will benefit from a tiered approach: use token-based detection for rapid screening of large classes, then run AST-based comparison on the unflagged remainder to catch refactored submissions. This two-pass strategy increases total detection time by roughly 15x (47 seconds vs 12 minutes) but raises the refactored-case detection rate from 33% to 83%.

The 3 submissions that escaped both techniques are a humbling reminder. No detection tool can catch a student who understands the algorithm well enough to genuinely reimplement it. The only defense against that level of cheating is assessment design — assignments that are individualized, process-oriented, and resistant to generic solutions. But as every CS professor knows, that requires more time and creativity than any detection tool can offer.

The department's study is ongoing. They're expanding the dataset to include multiple semesters and multiple assignments. Early results from the spring 2024 cohort suggest the hybrid approach has increased detected cheating by 28% compared to token-based alone. Whether that reflects more actual cheating being caught, or more false positives being reviewed, is a question they're still working to answer.