Inside UT Austin's Refactoring-Resistant Code Similarity Pipeline

The Problem That Standard String Matching Couldn't Catch

In Fall 2023, a teaching assistant in UT Austin's CS 312: Introduction to Programming course noticed something odd. Two final projects implementing a Huffman compression algorithm had nearly identical outputs, identical overall structure—yet line‑by‑line string comparison showed almost no matches. Variable names were different. Comments had been rewritten. One submission used a HashMap; the other used a TreeMap. Function ordering had been shuffled.

The TA ran both through a traditional plagiarism checker that relied on longest common subsequence. It returned a low similarity score of 12%. But any seasoned instructor could see the assignments were essentially the same solution, just aggressively refactored.

This is not an isolated incident. Refactoring—renaming identifiers, splitting or merging functions, swapping data structures—is the single most common tactic students use to evade detection. And it works astonishingly well against tools that only compare raw text or normalized token streams without structural normalization.

UT Austin’s solution, developed over the following semester, offers a template for any course where assignments are complex enough that students can’t just copy‑paste but can manually rewrite. The answer lies in abstract syntax tree (AST) and token‑based similarity analysis that strips away cosmetic changes.

Why Refactoring Resists Naive Similarity Checks

To understand the technique, consider a simple example. A student writes this Python function to compute factorial:

def factorial(n):
    if n <= 1:
        return 1
    else:
        return n * factorial(n-1)

Another student, having seen the same logic, might submit this:

def compute_factorial(x):
    if x == 0 or x == 1:
        result = 1
    else:
        result = x * compute_factorial(x-1)
    return result

A naïve diff tool would highlight different names, a different order of conditions, and the introduction of a temporary variable. It might report only 30–40% similarity. But structurally these two functions are identical: both define a recursive termination condition at the top, both make a single recursive call, and the control flow graphs are isomorphic.

Token‑based detection normalises identifiers (replacing each distinct identifier with a placeholder token) and compares the sequence of token types—keywords, operators, literals—rather than the raw text. The two factorial functions would produce token sequences nearly identical after normalization. However, token normalization alone fails when the control flow itself is altered—say, by converting recursion to iteration or swapping for loops for while loops.

That’s where AST structural similarity becomes critical. An AST (abstract syntax tree) representation captures the hierarchical structure of the code without concrete syntax details. Two ASTs can be compared using tree‑edit distance or more scalable hashing techniques. Codequiry combines both token‑based fingerprints and AST locality‑sensitive hashing to cover the gaps that pure string methods leave open.

How UT Austin Built Their Pipeline

The CS 312 course staff, led by Professor Amanda D. and head TA James R., designed a two‑stage detection pipeline for the Spring 2024 semester. Stage one was a standard token‑based comparison run weekly on all submissions. Stage two, triggered only when stage one flagged a suspicious cluster, used AST‑based analysis to assess whether semantic similarity exceeded a threshold of 85%.

They used a combination of open‑source tools and a commercial code similarity checker to handle scale. With 450 students, the weekly token run took about 12 minutes on a single server. The AST comparison was more expensive—roughly 30 seconds per pair of similar submissions—but only ran on the top 5% of candidates.

The results were striking:

  • Traditional string‑based similarity flagged 14 pairs as potentially plagiarised over the semester.
  • The token‑only stage flagged 41 pairs.
  • After AST‑based verification, 33 pairs were confirmed as likely cases of unattributed collaboration or code copying with refactoring.

Of those 33, only 7 had any overlap with the original 14 pairs from the string‑only check. The remaining 26 would have gone entirely undetected under the old system.

"We thought we were catching the problem," said Professor D. at a departmental seminar in May. "We were catching the lazy cheaters. The ones who put in effort to disguise their copying were sailing right through."

The Trade‑Offs: False Positives and Performance

No detection method is perfect. Token‑based and AST‑based approaches are susceptible to two specific issues: over‑detection on boilerplate code and under‑detection when students independently implement identical algorithms.

Boilerplate code—file headers, import statements, framework scaffolding—inflates similarity scores artificially. UT Austin’s pipeline addressed this with a pre‑processing step that stripped all standard library imports and common language constructs (e.g., public static void main in Java) before comparison.

The trickier problem is false positives from legitimate independent implementation. Two students might independently write a mergesort with nearly identical structure. The pipeline mitigated this by requiring that flagged pairs also share a non‑trivial number of identical error‑handling patterns or unusual variable names. When both students used the obscure variable name runner_ptr in a linked‑list problem, that was a stronger signal than both using node.

The false positive rate in Spring 2024 was roughly 1 in 5—for every five pairs flagged by the automated system, one was later determined by manual review to be coincidental. That’s acceptable, the course staff said, because the manual review is already part of the process. The goal was to reduce the number of cases that needed human judgment, not eliminate manual review entirely.

Lessons for Other Programs

UT Austin’s experience offers concrete takeaways for any programming course suffering from undetected refactored plagiarism:

  1. Don’t rely on string comparison alone. If your tool only compares raw source text or normalized token streams, add an AST‑based check. The gap between token and AST can catch an order of magnitude more cases.
  2. Treat boilerplate as noise. Build an exclusion list or use language‑specific parsers to ignore common scaffolding. Otherwise your similarity scores will be inflated by harmless structural overlap.
  3. Combine similarity with anomaly signals. Refactored plagiarism often leaves other traces—identical error outputs, unusual variable names shared across submissions, or submission timestamps within a narrow window. Combining similarity scores with these secondary signals increases precision.
  4. Expect to invest in tooling. The pipeline required about two weeks of a TA’s time to set up, plus another week of tuning thresholds. But once calibrated, it ran largely unattended for the semester.

For institutions that lack the resources to build custom pipelines, commercial solutions like Codequiry provide these capabilities out‑of‑the‑box. Their code plagiarism detection algorithm includes both token‑based and AST‑based analysis, along with web‑source checking to catch code copied from online repositories.

Frequently Asked Questions

Can refactoring‑resistant detection catch code rewritten in a different language?
Yes, but it requires cross‑language AST comparison, which is more complex. Tools that normalise to a language‑independent intermediate representation (e.g., control flow graphs) can detect plagiarism across Java, C++, and Python. Most academic settings deal with same‑language cases, though.

How much does AST comparison slow down the process for large courses?
AST tree‑edit distance is computationally expensive—O(n^3) in the worst case. However, practical tools use hash‑based approximations (like Winnowing on structure fingerprints) that reduce the overhead to near real‑time. For a course of 500 students, adding AST analysis to the top 10% of token‑flagged pairs typically adds 5–10 minutes per assignment.

Do students get false positives appealed?
Yes, and that’s a feature, not a bug. UT Austin’s process allowed students to submit a one‑page explanation of their design independent of any collaboration. Appeals were granted in three out of 33 cases in Spring 2024. Those students were asked to resubmit a trace of their development process (e.g., git logs) to strengthen future checks.