How Cross-Language Code Plagiarism Detection Actually Works

In 2019, a teaching assistant at a large public university ran MOSS on 180 Java submissions for an introductory programming assignment. The similarity scores looked clean. Every pair was below 20%. No obvious copy-paste cases. The TA approved the grades and moved on.

Three weeks later, a student in the next course emailed their professor. They had found a GitHub repository containing solutions for the exact assignment—written in Python. Several students had translated those Python solutions into Java, changed variable names, reordered methods, and submitted them as original work. MOSS never flagged them.

That incident is not unusual. Cross-language code plagiarism is one of the most under-addressed problems in academic integrity today. As students become more resourceful—and as programming education expands across multiple languages in the same curriculum—the ability to detect translated code becomes essential.

The Translation Problem

Traditional code plagiarism detectors operate on a simple premise: compare source code in the same language using some form of tokenization or string matching. MOSS uses winnowing, a technique that hashes k-grams of normalized tokens and selects a subset of those hashes as fingerprints. JPlag compares token sequences using a greedy string tiling algorithm. Both work well when the source and target are in the same language.

But they break down when a student translates code from Python to Java, or from C++ to JavaScript. The token sets are different. The syntax trees look different. A for loop in Java maps to a for loop in Python, but the token-level representation shares almost nothing in common.

// Java version
for (int i = 0; i < array.length; i++) {
    if (array[i] % 2 == 0) {
        sum += array[i];
    }
}
# Python version
for value in array:
    if value % 2 == 0:
        sum += value

These two code blocks perform the identical operation: summing even numbers from an array. But a token-based detector sees completely different sequences. Java uses explicit indexing, type declarations, and C-style loop syntax. Python uses iteration, no types, and indentation-based structure. A naive comparison produces a similarity score near zero.

The problem scales. A student who finds a solution on Stack Overflow in JavaScript can rewrite it in Java with minimal effort. A student who copies from a friend who took the course in a different semester—when it was taught in a different language—can translate and submit with near-zero risk of detection, provided the institution only runs single-language checks.

The Anatomy of a Translation-Resistant Detector

Cross-language plagiarism detection requires moving past surface-level token comparison and into the realm of semantic fingerprinting. The core insight is straightforward: regardless of the programming language, a solution to a given problem has a certain logical structure. Loops, conditionals, function calls, and data flow patterns reveal the author's intent. A good detector captures that intent, not the syntax.

AST-Based Normalization

The first step is to parse source code into an Abstract Syntax Tree (AST). Most modern languages have robust parser libraries. Python has ast. JavaScript has acorn and @babel/parser. Java has the Eclipse JDT compiler. C++ has Clang's LibTooling. The AST strips away surface details: whitespace, comments, naming conventions, and formatting.

But a raw AST is still language-specific. A Java AST node for a ForStatement looks different from a Python For node. The normalization step maps these language-specific constructs to a language-independent intermediate representation.

// Normalized IR for both Java and Python even-sum loops
LOOP (collection: array, element: item)
    CONDITION (operator: modulo, left: item, right: 2, comparison: equals, target: 0)
        ASSIGN (target: sum, operation: add, value: item)

This IR captures the what rather than the how. Both the Java and Python examples above produce the same IR. A comparison engine can then compute similarity on these normalized structures.

Semantic Fingerprinting

Once normalized, the system generates fingerprints from the IR. The technique resembles winnowing but operates on structural features rather than token sequences. Common features include:

  • Control flow structure: The sequence of loops, conditionals, and function calls, independent of specific syntax
  • Operator usage patterns: The set of arithmetic, logical, and comparison operators used
  • Data dependency graphs: How variables flow through the program—which values are read, modified, and produced
  • Function call signatures: The number of arguments, their types, and the order of calls

Each feature becomes a hash. The system selects a subset using winnowing's minimum-hash approach. The resulting fingerprint represents the program's logical structure, not its textual form. Two programs that implement the same algorithm with different syntax produce similar fingerprints.

Handling Common Transformations

A robust cross-language detector must handle the transformations that students typically apply when translating code:

Loop structure changes. A C-style for loop with an index variable can become a Python for loop over a range, or a while loop with manual increment. The IR should normalize these to a generic loop construct, parameterized by the loop bounds and step.

Array vs. list vs. vector. Java arrays, Python lists, and C++ vectors all represent ordered collections. The IR should treat them identically, focusing on access patterns (indexing, iteration, appending) rather than the specific data structure name.

Type system differences. Static typing in Java and C++ introduces type declarations that have no equivalent in Python or JavaScript. The normalization step should strip type information entirely, or map it to a common type hierarchy (integer, floating-point, string, collection, composite).

Function and method boundaries. A student might split a monolithic Python function into multiple Java methods, or inline a helper function. The detector should compare at the granularity of the entire submission, not individual functions, to catch these reorganizations.

Practical Implementations and Limitations

Several research projects and commercial tools address cross-language detection. The SIM tool by Dick Grune compares programs by counting common substrings in the normalized token stream, but it requires manual normalization between languages. The Plaggie system compares Java and Python assignments by converting both to an XML-based intermediate representation.

Codequiry's approach combines AST normalization with machine learning–trained feature extraction. The system parses submissions into language-independent IR, then computes similarity using both exact fingerprint matches and approximate nearest-neighbor search on feature vectors. This catches not only direct translation but also refactored versions that reorganize code structure.

But cross-language detection has limits. No system can catch every translation. Here are the cases that remain difficult:

Radically different paradigms. A solution that uses recursion in Scheme but iteration in Java may share little structural similarity. The IR captures control flow, but recursion and iteration, while mathematically equivalent, produce different dependency patterns.

Algorithm substitution. A student who translates a quicksort from Python to Java, but also changes the pivot selection strategy from median-of-three to random, may produce code that looks sufficiently different even at the IR level. The algorithm's structure changes, even though the conceptual origin is the same.

Obfuscation through complexity. Adding dead code, pointless variable reassignments, or gratuitous nesting can distort the IR fingerprint. A determined student with enough knowledge of the detection system could craft submissions that evade identification.

"Cross-language detectors raise the bar significantly, but they are not silver bullets. The goal is to make plagiarism more detectable than it is convenient." — From a 2023 survey of academic integrity practices at R1 universities

Building Cross-Language Checks Into Your Course

If you teach programming courses across multiple languages, or if your institution sequences courses from one language to another, cross-language detection should be part of your integrity toolkit. Here is a practical approach:

Step 1: Check single-language pairs first. Run MOSS or JPlag on each language cohort independently. This catches the common copy-paste cases quickly. Address those before moving to cross-language analysis.

Step 2: Collect submissions across semesters and languages. Build a repository of all past student submissions, organized by assignment but not by language. When a new submission comes in, compare it against all historical submissions for that assignment—regardless of the language they were written in.

Step 3: Use a tool with cross-language support. Codequiry and a few other platforms now offer cross-language comparison natively. If you are building your own pipeline, consider using AST normalization with a tool like srcML (which converts C, C++, Java, and C# to XML) or Joern (which produces code property graphs for multiple languages).

Step 4: Manual review of high-similarity pairs. Cross-language results require more careful interpretation than same-language results. A 70% similarity score between a Python submission and a Java submission is more meaningful than the same score between two Java submissions. Look for structural correspondences: same function decomposition, same variable flow, same conditional logic.

Step 5: Educate students explicitly. Most students who translate code across languages do not believe they are plagiarizing. They reason: "I rewrote it in a different language. That's not copying." Your syllabus should define plagiarism to include translation, and your assignments should remind students that using another student's or online solution—even translated—violates academic integrity.

The Broader Landscape

Cross-language plagiarism detection is not just an academic concern. In industry, organizations that acquire codebases or contract development work face similar challenges. A vendor might deliver C# code that is clearly translated from an open-source Java library. An outsourced team might produce Python code with structural fingerprints matching a competitor's C++ implementation. The same techniques that catch student plagiarism can protect intellectual property in commercial settings.

Open-source license compliance also intersects with cross-language detection. A project that uses a GPL-licensed algorithm implemented in C may violate the license if it reimplements that algorithm in JavaScript without proper attribution. Semantic fingerprinting can identify these cases even when no direct code copying occurred.

The techniques continue to evolve. Recent research from the Software Improvement Group applies graph neural networks to code property graphs, enabling similarity detection across languages with high accuracy. Microsoft's CodeBERT and similar transformer models can embed code snippets from different languages into a shared vector space, where proximity indicates semantic similarity. These approaches promise even better results as the models mature.

Practical Recommendations

For educators and engineering managers considering cross-language detection, these guidelines apply:

  • Start with assignments that have known cross-language solutions. If an assignment is popular on GitHub, search for solutions in multiple languages before the semester begins. Build a reference corpus to seed your detection system.
  • Set appropriate thresholds. Cross-language similarity scores tend to be lower than same-language scores. A threshold of 50–60% may be more appropriate than the 70–80% threshold used for same-language detection.
  • Combine with code quality metrics. A submission that shows low similarity to any same-language source but high similarity to a cross-language source, combined with unusual variable naming or inconsistent coding style, warrants closer investigation.
  • Document your methodology. In academic settings, you may need to defend plagiarism cases before an honor board. Explain how cross-language detection works, what the similarity scores mean, and why translation constitutes plagiarism.

The student who translated that Python solution into Java in 2019 was never formally caught. The professor added a note to the syllabus the following semester: "Translating code from another programming language does not make it original work. We have tools that detect this. Do not attempt it." The number of cross-language cases dropped significantly.

Cross-language plagiarism detection is not about catching every dishonest student. It is about making the calculus clear: the effort required to evade detection exceeds the effort required to do the work honestly. When the detection gap closes, the integrity gap follows.