What Code Similarity Metrics Actually Measure in Student Work

Every semester, thousands of computer science instructors run similarity detection tools on student submissions. They get back color-coded reports showing percentages like 73% or 41% or 89%. Then comes the hard part: deciding what those numbers actually mean.

The problem is that code similarity metrics are deceptively simple. A 90% match between two Java assignments might indicate a student copied from a classmate. Or it might indicate that both students independently wrote the same boilerplate loop structure, imported the same standard libraries, and followed the same assignment specification. The tool doesn't know. The instructor has to interpret the output.

This article breaks down what three common detection techniques actually measure, where they excel, and where they produce noise. Understanding these mechanics separates instructors who make accurate plagiarism judgments from those who chase false positives or miss genuine violations.

Token-Based Similarity and the MOSS Algorithm

The most widely deployed code similarity system in academia is MOSS (Measure Of Software Similarity), created by Alex Aiken at Stanford in the 1990s. MOSS uses a token-based approach combined with winnowing, a fingerprinting technique that selects a subset of hash values to represent each document.

The core insight is simple: normalize the source code by replacing identifiers, keywords, and operators with abstract tokens, then compare sequences of those tokens. Two programs that use different variable names but identical control flow will produce matching token sequences. Consider this transformation:

// Original student submission
int calculateSum(int a, int b) {
    int result = a + b;
    return result;
}

// Tokenized representation
TYPE IDENTIFIER LPAREN TYPE IDENTIFIER COMMA TYPE IDENTIFIER RPAREN
LBRACE TYPE IDENTIFIER ASSIGN IDENTIFIER PLUS IDENTIFIER SEMICOLON
RETURN IDENTIFIER SEMICOLON RBRACE

MOSS then hashes overlapping k-grams (subsequences of tokens) and selects a subset of those hashes through the winnowing algorithm. Documents that share many of the same selected hashes receive higher similarity scores. This technique is resilient to renaming variables, changing comment text, and reformatting whitespace. It is not resilient to restructuring control flow, replacing loops with equivalent recursion, or changing algorithm choices entirely.

The practical consequence: MOSS catches the "copy the logic, rename everything" strategy that many students believe is undetectable. But it also flags two students who independently wrote the same mergesort implementation, because the token structure of any correct mergesort is going to look similar. Interpreting MOSS output requires understanding that the similarity percentage reflects token-sequence overlap, not evidence of copying.

AST Comparison and Structural Matching

Abstract Syntax Tree (AST) comparison operates at a deeper level than tokens. Rather than flattening code into a linear sequence, AST comparison builds a tree representation of the program's syntactic structure and then computes tree edit distance or subtree similarity between two programs.

An AST captures the hierarchical relationships between language constructs. Two programs that use different loop types but produce the same tree structure may still match at the AST level. Consider these two functionally identical code fragments:

// Version A: for loop
for (int i = 0; i < n; i++) {
    total += array[i];
}

// Version B: while loop
int i = 0;
while (i < n) {
    total += array[i];
    i++;
}

These produce different token sequences, so MOSS would see only partial similarity. An AST-based tool, however, can recognize that both represent an iteration pattern with initialization, condition check, body execution, and increment. The tree structures differ, but subtree matching algorithms can still identify the shared iteration subpattern.

AST-based detection is more resistant to superficial code transformations. Changing a for to a while, moving variable declarations, or reordering independent statements produces different token streams but often preserves significant tree structure. Several modern plagiarism detectors, including Codequiry's similarity engine, combine AST analysis with token-based methods to catch both shallow and deep forms of code reuse.

The limitation of AST comparison is computational cost. Computing optimal tree edit distance is O(n³) in the worst case, so practical tools use approximation algorithms or restrict comparisons to subtrees. This means AST-based tools may miss plagiarism that reorders entire function bodies or splits a single function into multiple smaller functions with different tree structures.

Fingerprinting and Hash-Based Set Comparison

The third major technique uses document fingerprinting, where a hash function converts code fragments into compact digital fingerprints. Unlike token-based or AST-based approaches, fingerprinting can operate on raw source text, normalized tokens, or any other representation. The key question is what the fingerprint represents.

Winnowing, the algorithm used in MOSS, creates a fingerprint by selecting the minimum hash value from each sliding window of w hashes. This produces a set of fingerprints that is roughly proportional to the document length but robust to small insertions or deletions. Two documents that share many fingerprint hashes are almost certainly similar.

Other fingerprinting approaches include SimHash, which creates a single 64-bit fingerprint for the entire document and measures similarity by Hamming distance between fingerprints. SimHash is much faster for document-level comparison but provides no information about where similarities occur in the code. MOSS-style winnowing preserves positional information because each fingerprint is associated with a specific hash value and document location.

The practical implication for instructors: fingerprint-based tools produce precise matching locations but coarse similarity estimates. A 70% match in MOSS means that 70% of the selected fingerprints from document A appear in document B's fingerprint set. It does not mean 70% of the source code is identical. The actual percentage of copied code is usually higher than the MOSS similarity score, because the winnowing algorithm discards many hashes to improve efficiency.

What Similarity Percentages Actually Mean

Instructors need a decision rule: above what threshold should they investigate? Research from several universities suggests that the answer depends heavily on the assignment type, programming language, and class level.

In a 2021 study at the University of California, San Diego, researchers analyzed four semesters of introductory programming assignments and found that baseline similarity between independent submissions averaged 35-45% for first-year Java assignments. Assignments that required specific API calls or fixed function signatures pushed baseline similarity higher. Assignments that allowed multiple algorithmic approaches pushed it lower.

For a standard CS1 assignment like "write a program that reads integers from standard input and prints the average," two independent solutions often share 50-60% token similarity simply because the required structure is constrained. An instructor who flags anything above 50% would generate an unmanageable number of false positives.

More experienced instructors and automated grading platforms like Codequiry use multiple signals in combination: similarity percentage, the length of matching segments, the presence of identical error handling or edge cases, and the timing of submissions. A long matching segment that appears in submissions from students who used the same Wi-Fi network within minutes of each other is stronger evidence than the same matching segment appearing in submissions from different time zones.

Cross-Language Similarity Detection

A growing challenge in academic integrity is cross-language plagiarism, where a student copies a solution from one language and translates it to another. A Python solution becomes Java, or JavaScript becomes C++. Token-based tools fail here because the token sets for different languages are entirely different. AST-based tools can sometimes succeed if both languages share similar syntactic structures.

Some modern detection systems address cross-language plagiarism by converting source code into an intermediate representation that abstracts away language-specific syntax. Control flow graphs, data flow graphs, or normalized ASTs can reveal structural similarity even when the source languages differ. A Java for loop and a Python for loop produce different ASTs, but their control flow graphs can be nearly identical.

The takeaway for instructors teaching multi-language courses: single-language detection tools will miss cross-language copying entirely. If your course allows students to choose between Python and Java, or if you teach a sequence of courses that transition between languages, you need either a multi-language detection tool or manual inspection of suspicious submissions.

Refactoring-Resistant Detection Techniques

Sophisticated students who understand how plagiarism detection works will try to defeat it. The most common evasion strategies include:

  • Splitting functions: Taking a single large function and dividing it into several smaller functions
  • Merging functions: Combining multiple small functions into one monolithic function
  • Reordering methods: Changing the order of method definitions in a class
  • Adding dead code: Inserting unnecessary variables, statements, or debug prints
  • Changing control structures: Converting if-else chains to switch statements or vice versa

Token-based tools are partially resistant to function splitting and method reordering, because token sequences within individual functions remain similar. But AST-based tools can be confused by function splitting, because the tree structure changes significantly.

Advanced detection systems use a combination of techniques. They may extract function-level fingerprints, compare functions across files regardless of ordering, and ignore dead code patterns. Some systems also analyze the distribution of programming errors: two submissions with the same incorrect implementation of a sorting algorithm are more suspicious than two submissions with the same correct implementation, because correct solutions converge while incorrect solutions diverge.

Practical Recommendations for Interpreting Results

Based on years of experience across multiple institutions, here are actionable guidelines for instructors who use code similarity tools:

Set appropriate thresholds by assignment type. For highly constrained assignments (specific algorithm, fixed input/output formats), investigate matches above 70%. For open-ended assignments (design your own data structure, implement any sorting algorithm), investigate matches above 40%.

Examine the matching segments, not just the percentage. A 60% match consisting of many short matching segments is less suspicious than a 60% match consisting of one long segment. Tools that show highlighted matching lines are essential for this step.

Check for identical errors. Two submissions that produce the same wrong output on edge cases, or that contain the same typo in a variable name, provide strong evidence of copying. Independent implementations almost never share the same bugs.

Consider the submission context. Late submissions, submissions from students who withdrew from the course previously, and submissions with identical metadata timestamps all warrant more scrutiny for the same similarity level.

Use multiple detection techniques. If your institution uses only one tool, run suspicious submissions through a second tool with a different detection approach. The combination of token-based and AST-based detection catches more plagiarism while producing fewer false positives than either method alone.

The goal of code similarity detection is not to maximize the number of flagged submissions. It is to identify a manageable set of genuine violations while minimizing false accusations. Understanding what the metrics actually measure is the only reliable path to that balance.