What AST Comparison Does That Token Matching Cannot
Token-based plagiarism detectors like MOSS (Measure Of Software Similarity) and JPlag split source code into a stream of lexical tokens and then compare subsequences using winnowing or greedy string tiling. These techniques work well when students copy code and only change variable names or add comments. But they break down when a student reorders entire functions, introduces new control flow wrappers, or converts a while loop into a for loop. The tokens change radically, even though the logic remains identical.
Abstract syntax tree (AST) comparison solves this by parsing the code into a tree that captures the structural meaning of the program: control flow, expression hierarchy, and scoping rules. After normalization — stripping identifiers, literals, and whitespace — the tree retains the shape of the algorithm. Two programs that do the same thing in the same way produce nearly identical ASTs, no matter how the student renamed variables or reformatted the code.
// Original faculty code
public static int factorial(int n) {
int result = 1;
for (int i = 2; i <= n; i++) {
result *= i;
}
return result;
}
// Student submission with renamed variables and different spacing
public static int fact(int num) {
int prod = 1;
for (int j = 2; j <= num; j++) {
prod = prod * j;
}
return prod;
}
Token-based comparison sees IDENTIFIER(factorial) vs IDENTIFIER(fact), IDENTIFIER(result) vs IDENTIFIER(prod), and different operator order (*= vs *). After normalization, AST comparison sees two identical MethodDeclaration nodes with a ForStatement child containing a CompoundAssignment or equivalent expression. The structure matches, so the plagiarism is flagged.
How AST Normalization Works
The first step in AST-based detection is to parse each submission into a tree using a language-specific parser (JavaParser for Java, Python’s ast module, etc.). The tree is then normalized: all user‑defined identifiers are replaced with a placeholder like ID, all literals with LIT, and every operator is replaced with a generic OP token. Some detectors also collapse chain calls and flatten nested expressions to a canonical form.
After normalization, the tree becomes a representation of structural signatures: the depth and shape of loops, the number of branches in conditionals, the order of method calls. A web of these signatures is then hashed and compared across submissions. Tools like Codequiry combine AST normalization with token‑based winnowing and a web source database to catch both structural clones and exact copies from online tutorials.
AST normalization is not about understanding what the code does — it is about recognizing how the code is built. Two programs that compute the same value but use different algorithms (e.g., iterative vs. recursive) will have different ASTs, which is the correct behavior for a plagiarism detector.
Structural Invariants That Survive Most Manual Changes
AST comparison is particularly good at resisting three common obfuscation techniques that defeat simple token‑based tools:
- Variable and method renaming. Normalization removes all names, so renaming has no effect.
- Whitespace and comment changes. The parser strips these before building the tree.
- Statement reordering within a block. If the student swaps the order of two independent assignments, the AST changes — but only if the dependencies differ. AST comparison can still detect the remaining invariant structure.
However, AST comparison is not invincible. Inserting dead code (e.g., a never‑executed if (false) block) adds nodes and changes the tree structure. Converting a for loop to a while loop with manual counter update can preserve the structure in many parsers, but wrapping the entire body in a try‑catch block can break the match. These limitations are why mature plagiarism detectors stack multiple algorithms.
When AST Comparison Fails
Consider a student who takes the original factorial method and refactors it into a recursive approach:
public static int factorial(int n) {
if (n <= 1) return 1;
return n * factorial(n - 1);
}
This version has a completely different AST: a ConditionalReturn instead of a ForStatement, and a recursive call instead of an iterative loop. AST comparison correctly sees this as original work, because the structural approach changed. That is a good outcome — the student demonstrated independent thinking.
Where AST comparison struggles is against semantic preserving but structurally different clones. For example, replacing for with a while loop that initializes and increments the counter manually often produces the same tree shape (since both are loop constructs in many ASTs), but inserting a no‑op loop body or splitting a single statement into multiple lines with new temporary variables can add enough nodes to break the match. These advanced obfuscation tactics require either token‑based fingerprinting over small windows or data‑flow analysis to catch.
Comparison of AST, Token, and Fingerprinting Approaches
| Technique | Resists Renaming | Resists Statement Reorder | Resists Dead Code Injection | Resists Loop Conversion |
|---|---|---|---|---|
| Winnowing (MOSS) | Yes (k‑grams ignore names) | No (order matters) | Partial (low threshold, but long k‑grams break) | No |
| Greedy String Tiling (JPlag) | Yes | No | Partial | No |
| AST Comparison (normalized) | Yes | Partial (reorders same structure) | Weak (new nodes change tree) | Partial (many loops share AST shape) |
| Hybrid (AST + token + web) | Yes | Yes (multiple signals) | Strong (covers different weaknesses) | Strong |
No single technique covers all evasion strategies. The most reliable approach is a layered detector that runs AST normalization first, then token‑based winnowing on the original and de‑sugared code, and finally cross‑references with online source databases. Codequiry implements exactly this stacked pipeline, which is why it consistently reports higher detection rates than tools relying on a single algorithm.
Practical Advice for Educators and Engineering Managers
If you are responsible for code integrity in an academic course or a development team, start by understanding what your detection tool does under the hood. Token‑based systems are fast and catch the majority of lazy copy‑paste, but they miss the structurally rewritten cheats. AST comparison catches the restructured submissions but may also produce false positives when two students independently arrive at the same algorithmic pattern (e.g., a common data structure traversal).
For critical decisions — honor code violations or code theft claims — never rely on a single metric. Use a tool that provides the AST similarity score alongside token similarity and web source matches. Review the highlighted code sections yourself. In practice, a hybrid detector like Codequiry reduces false alarms because it shows you all three indicators in one report, and you can drill into the matched regions. For routine grading, you can set automatic thresholds: flag submissions above 70% structural similarity for manual review, and automatically accept anything below 30%.
A good rule of thumb: if two submissions share the same AST shape for more than 60% of their file, and the token similarity is also above 50%, the probability of independent work is near zero. Investigate every time.
Integrating AST Detection into Your Workflow
Running AST comparison on every submission or code commit is straightforward with modern APIs. Most plagiarism detection services, including Codequiry’s platform, accept batch uploads and return similarity scores per file pair. You can automate the process as part of your CI/CD pipeline or assignment submission system. For example, a university course running on Gradescope or Canvas can submit student ZIP files to Codequiry’s API and receive structured JSON results with AST, token, and web source signals.
When interpreting results, remember that AST comparison is most useful for identifying clones that share the same logical structure. It is less useful for detecting copied boilerplate code (e.g., import statements, getter/setter patterns) where token matching is sufficient. Combine the two: use AST for the core algorithmic sections, token matching for the peripheral code, and web source checks for code lifted directly from Stack Overflow or GitHub gists.
Frequently Asked Questions
Can AST comparison detect code rewritten in a different language?
No, because each language has a different grammar and tree structure. Cross‑language plagiarism requires semantic analysis or intermediate representations (three‑address code) that are beyond standard AST comparison. Some advanced tools use abstract semantic graphs (ASG) to bridge languages, but they are not widely deployed in education settings yet.
How many nodes do I need to match before flagging a submission?
There is no universal threshold. For short programs (50–100 lines), 70% structural similarity is highly suspicious. For long files with repeated patterns, the baseline similarity between unrelated submissions can be 20–30%. Use the tool’s built‑in statistical outlier detection to avoid manual guesswork.
Does AST comparison work for all languages?
Most mature detectors support Java, Python, C/C++, JavaScript, and C#. The more exotic the language, the less likely a fully normalized AST is available. Codequiry currently supports the major languages used in academic CS curricula.
What is the performance overhead of AST comparison?
Parsing and normalizing trees is CPU‑intensive, but for a class of 500 Java submissions (each ~500 lines), the entire pairwise comparison completes in under two minutes on a moderate server. The bottleneck is usually file I/O, not the tree matching itself.