K-gram Fingerprinting for Source Code Similarity…

How One Algorithm Catches Renamed and Reordered Code

Two students submit identical logic. One has renamed every variable from x to data. The other reordered function definitions and inserted blank lines. A naive line diff sees nothing. A k-gram fingerprinting system catches them both in under a second.

This algorithm — originally published by Schleimer, Wilkerson, and Aiken in 2003 — powers nearly every serious code plagiarism detector, including Stanford’s MOSS and Codequiry. It works because it compares structural fingerprints, not raw text. Variable names vanish. Formatting disappears. What remains is the skeleton of the program’s logic.

Here is exactly how it works, step by step, with working Python code you can run against real student submissions.

Step 1: Tokenize Source Code Into Meaningful Units

Raw text is useless for structural comparison. The first step is lexical tokenization — converting the source file into a stream of syntactic tokens. Every language has its own token rules, but the principle is universal: discard whitespace, comments, and string literals; keep keywords, operators, identifiers, and punctuation.

For example, this C++ snippet:

int max(int a, int b) {
    if (a > b) return a;
    else return b;
}

Tokenizes to something like:

int max ( int a , int b ) { if ( a > b ) return a ; else return b ; }

Each token is a symbol. In practice, tokens are assigned integer IDs to speed up later hashing. A simple Python tokenizer for a subset of Python or C can be written in ~50 lines using tokenize from the standard library, or for custom languages one can use ANTLR or Tree-sitter.

Key insight: by mapping identifiers (variable names, function names) to generic placeholder tokens — e.g., VAR — the system becomes resistant to trivial renaming. The token stream for int max(int a, int b) becomes int VAR ( int VAR , int VAR ) { ... }. Now renaming a to data produces the identical token sequence.

Step 2: Generate K-grams from the Token Stream

With a token sequence of length n, we slide a window of size k across it, producing a list of n - k + 1 overlapping subsequences called k-grams.

For token sequence T = [A, B, C, D, E] and k=3:

k-grams:
[A, B, C]
[B, C, D]
[C, D, E]

In code plagiarism detection, typical values for k range from 5 to 15 tokens. Smaller k catches more partial copying but increases false positives. Larger k is more specific but may miss heavily restructured code. MOSS uses k = 5 for most languages; Codequiry adjusts k per language based on typical statement length.

“Choosing k is a tradeoff. Too small, and common idioms like for (int i=0; i<n; i++) produce shared k-grams across unrelated submissions. Too large, and a student can evade detection by inserting one dummy statement every four lines.” — Schleimer et al., 2003

Here is a Python function to generate k-grams from a tokenized list:

def kgrams(tokens, k):
    return [tokens[i:i+k] for i in range(len(tokens) - k + 1)]

Step 3: Hash Each K-gram to a Numeric Fingerprint

K-grams are variable-length lists of tokens. Comparing them directly is memory-heavy and slow. Instead, we hash each k-gram to a fixed-size integer (e.g., a 32-bit hash). Many systems use rolling hash functions (like Rabin-Karp) for speed, but a simple polynomial hash works for moderate-sized corpora.

Why hash? Hashing turns a sequence of tokens into a compact fingerprint. Two identical k-grams produce the same hash (assuming no collision). Two different k-grams — even differing by one token — produce different hashes.

A straightforward Python implementation:

def hash_kgram(kgram, base=101, mod=2**32):
    h = 0
    for token in kgram:
        h = (h * base + hash(token)) % mod
    return h

For a typical programming assignment (500–2000 tokens) and k=10, we get several hundred hashes. That’s easily stored in memory for thousands of submissions.

Step 4: Apply Winnowing to Select Representative Fingerprints

If we kept every hash, comparing two submissions would still be expensive, and adjacent hashes are nearly identical (because k-grams overlap). We need a fingerprint selection strategy. Enter winnowing.

The winnowing algorithm slides a window of size w over the list of hashes. Inside each window, it selects the minimum hash. If multiple minima occur, it picks the rightmost one. This guarantees that any match of length k tokens will be detected, while also guaranteeing that the distance between selected fingerprints is at most w tokens.

“Winnowing guarantees that no matching substring of length ≥ k is missed, while still producing a sparse set of fingerprints.” — original paper

In practice, w is often set to 2k – 1 or slightly larger. For k=10, w=20 works well. The algorithm:

def winnow(hashes, w):
    fingerprints = []
    for i in range(len(hashes) - w + 1):
        window = hashes[i:i+w]
        min_hash = min(window)
        # Pick rightmost occurrence of the minimum
        pos = window[::-1].index(min_hash)
        fingerprints.append((i + w - 1 - pos, min_hash))
    return fingerprints

Each fingerprint is now a pair (position, hash_value). Position matters for alignment later but is not used for matching itself — only the hash is compared across documents.

Step 5: Compare Fingerprints Across Two Submissions

To detect plagiarism between a query submission Q and a source submission S, we compute the fingerprint set for each, then find common hashes. Each common hash indicates that a k-gram of tokens appears in both documents — a potential copied passage.

But a single common k-gram could be coincidental (e.g., a common idiom like for (i=0; i<10; i++)). To reduce noise, most systems require a minimum match length — for example, at least m fingerprints that are consecutive or within a small positional offset. MOSS uses a “newline window” heuristic: after matching fingerprints, it expands the match region to the nearest newline boundaries and reports entire lines.

Here’s a simple comparison function:

def compare_submissions(tokens_q, tokens_s, k, w):
    hashes_q = [hash_kgram(g) for g in kgrams(tokens_q, k)]
    hashes_s = [hash_kgram(g) for g in kgrams(tokens_s, k)]
    fp_q = {h for _, h in winnow(hashes_q, w)}
    fp_s = {h for _, h in winnow(hashes_s, w)}
    common = fp_q & fp_s
    return len(common)

The raw number of common fingerprints is then normalized by total fingerprints (e.g., Jaccard similarity) to compute a similarity score. Real systems like Codequiry apply additional weighting: a match of 50 fingerprints in a 200-fingerprint document is far more significant than the same 50 in a 5000-fingerprint document.

Step 6: Map Matches Back to Source Lines for Reporting

Fingerprint hashes are opaque. For a useful plagiarism report, the system must map each matched hash back to the original source lines. This requires storing an inverted index: for each fingerprint hash, record the document ID, token position, and corresponding line numbers.

When a match is found, the system retrieves the line ranges from both documents and highlights them side by side. This is what you see in MOSS’s HTML output or Codequiry’s side-by-side comparison view: blue and red highlighted line pairs showing exactly where code was copied.

A minimal implementation stores:

{
   hash_value: [
        (doc_id, start_line, end_line),
        ...
   ]
}

Then, when a fingerprint is shared between two documents, you can immediately display the overlapping lines to a professor or reviewer.

Scaling to Hundreds of Thousands of Submissions

Pairwise comparison of all submissions is O(n²) — impracticable for large classes. The winnowing-based approach enables a linear-time corpus-wide comparison: hash all fingerprints into a global dictionary. For each hash that appears in more than one document, record the pair. This reduces the complexity to O(N) where N is total tokens across all submissions.

A real deployment — like Stanford’s MOSS handling 50,000 submissions per term — uses this trick. After fingerprint generation for each file, the server builds an inverted index. All document pairs sharing any fingerprint are then analyzed with a more detailed comparison (often using suffix arrays for exact matching of long runs).

Codequiry’s cloud infrastructure takes this further, adding parallel processing and caching for repeat reports. A 200-student Java assignment can be fully analyzed in under 30 seconds, including tokenization, fingerprinting, and report generation.

Limitations and Practical Workarounds

K-gram fingerprinting is not magic. It fails or degrades under certain conditions:

Full logic restructuring — swapping an if-else for a ternary operator or restructuring loops changes the token stream fundamentally. No token-level method catches this without cross-language AST alignment.
Insertion of dead code — sprinkling unrelated statements every four lines breaks k-grams if the base code is long enough. Winnowing helps because it selects minima, but determined students can beat it.
Cross-language copying — copying Python from Java produces completely different token sets. Language-specific tokenizers don’t help here; only AST-based or normalized intermediate representations work.
Token-level abstraction — if the tokenizer is too aggressive (e.g., remapping all identifiers to VAR), common code patterns like if (VAR) { ... } else { ... } become universal and generate false positives.

In practice, production systems combine k-gram fingerprinting with additional signals: string literal matching, commentary similarity, submission metadata timestamps, and — increasingly — perplexity analysis for AI-generated code. Codequiry layers all three: token-based fingerprinting, web-source cross-referencing, and AI-detection heuristics, giving instructors a three-way check.

Frequently Asked Questions

What is the difference between k-gram fingerprinting and AST comparison?

K-gram fingerprinting works on linear token sequences; it’s fast and captures local similarity. AST comparison builds parse trees and can detect deeper structural equivalence (e.g., reordered statements that produce the same tree structure). AST comparison is slower but more robust against certain restructurings.

How does winnowing guarantee no match is missed?

Winnowing guarantees that for any substring of length ≥ k that appears in both documents, at least one fingerprint hash from that substring will be selected in both. This is a formal property proven in the original 2003 paper — as long as the winnowing window size w ≥ k, the guarantee holds.

Can k-gram fingerprinting detect AI-generated code?

Partially. AI-generated code often contains identical boilerplate patterns (e.g., common imports, standard loop structures). K-gram fingerprinting compares against known AI training outputs or large code corpora. However, AI detectors that analyze perplexity and burstiness (statistical style) are more effective for differentiating AI from human writing. Codequiry offers both detection methods in a unified report.

What k value should I use for Python assignments?

For Python, where statements are often short, k=8 is a common starting point. For Java or C++, k=10–12 works well. A good practice is to test against a known-clean corpus of previous submissions and adjust k to maximize the gap between the highest inter-submission similarity and the lowest plagiarized-pair similarity.

K-gram Fingerprinting for Source Code Similarity Analysis