The Assignment That Broke Every Plagiarism Checker

The Perfect Storm in CS106A

Dr. Elena Vance knew something was wrong the moment she opened the tenth submission for the "Nim" game assignment. It was Autumn quarter at Stanford, and her section of CS106A, Programming Methodology, had 187 students. The assignment was classic: implement the game of Nim in Java, where players take turns removing objects from distinct heaps. The logic was straightforward—a few loops, some conditionals, solid practice for early programmers. But the submissions she was grading felt off.

“The code compiled. It ran. It even passed the basic autograder tests,” she told me later, from her office cluttered with whiteboard markers and decades of course binders. “But the strategic logic for the computer player was consistently, identically wrong. It would make a winning move impossible to achieve. Not just suboptimal—mathematically flawed in the same specific way.”

She pulled up the Stanford instance of MOSS (Measure of Software Similarity), the venerable plagiarism detection tool developed at the university itself. She ran a batch of 20 suspicious files. The report came back clean. Pairwise similarity scores hovered between 10% and 25%, well below the department’s 70% threshold for investigation. JPlag, another tool they used for cross-checking, showed similar, non-alarming results. The tools saw unique work. Elena’s experienced eye saw a carbon copy.

"The algorithms were telling me these were independent solutions. My gut, and the identical logical fallacies, were telling me they came from a single source. One of us was wrong, and it wasn't me."

Deconstructing the Obfuscation

Elena spent a Saturday morning side-by-side comparing two submissions from students who sat on opposite sides of the lecture hall. Superficially, they looked different. Variable names were unique. One used while loops where the other used for loops. The structure of the methods was shuffled. But the core algorithmic flaw was a fingerprint.

The bug was in the function that decided the computer's move. The correct Nim strategy, based on binary digital sums (nim-sum), requires XOR operations. The flawed submissions all contained a corrupted version of this logic, using a bizarre, incorrect bitwise AND operation that guaranteed a loss from certain positions. It was the kind of mistake a human might make once, but not 30 times independently.

Here’s a sanitized snippet of the flawed logic found in multiple submissions:

// Incorrect nim-sum calculation (found in suspect submissions)
public int computeComputerMove(int[] piles) {
    int nimSum = 0;
    for (int pile : piles) {
        nimSum = nimSum & pile; // BUG: Should be XOR (^), not AND (&)
    }
    // ... rest of flawed strategy based on wrong nimSum
}

And here is the same logical block, but from a submission that had been deliberately obfuscated to avoid detection:

// Obfuscated version of the same flaw
public int getAIMove(ArrayList<Integer> gameHeaps) {
    int binaryAggregate = gameHeaps.get(0);
    for (int i = 1; i < gameHeaps.size(); i++) {
        binaryAggregate = binaryAggregate & gameHeaps.get(i).intValue();
    }
    int targetHeap = findTarget(binaryAggregate, gameHeaps);
    return targetHeap;
}

“They had studied the detectors,” Elena concluded. “MOSS and JPlag, at their core, rely on tokenization and fingerprinting. They strip variable names, normalize whitespace, and compare sequences of language tokens. These students had used simple, methodical transformations that broke the token sequence while preserving the buggy semantic core.”

The Cheater's Playbook

Through subsequent investigation—reviewing forum posts, discreet conversations with TAs, and finally, discussions with the students involved—Elena pieced together the method. A student from a previous quarter had posted a complete, but subtly flawed, solution on a private forum. The new cohort had used it as a template, applying a checklist of obfuscation techniques designed to defeat token-based checkers:

  1. Systematic Renaming: Every variable and method name was changed, not randomly, but using a consistent thesaurus-based approach (e.g., calculate -> compute, pile -> heap).
  2. Control Flow Restructuring: Converting for loops to while loops, adding redundant conditional blocks that always evaluated to true, and altering the order of non-dependent statements.
  3. Comment and Whitespace Pollution: Adding unique, voluminous comments and irregular indentation to disrupt line-based matching.
  4. API Wrapping: Creating trivial helper methods to hide direct calls. Instead of piles.size(), they'd write getHeapCount(piles) that simply returned piles.size().

These transformations are trivial for a programmer. Together, they create enough syntactic noise to drop similarity scores below critical thresholds, while the program's semantics—and its unique errors—remain perfectly intact.

Why MOSS and JPlag Missed It

The incident exposed a fundamental limitation of the most common academic tools. MOSS uses a winnowing algorithm that creates fingerprints of overlapping k-grams of tokens. JPlag uses a greedy string tiling algorithm on token strings. Both are exceptionally good at detecting direct copy-paste or lazy renaming.

They are not designed for semantic analysis. They cannot understand that nimSum = nimSum & pile and binaryAggregate = binaryAggregate & gameHeaps.get(i).intValue() are executing the same wrong idea. They see different tokens (& is the same, but nimSum vs. binaryAggregate, pile vs. gameHeaps.get(i).intValue()). The surrounding control flow differences further scatter the fingerprint.

“We were running a syntax checker against a semantic problem,” said Mark, a head TA on the course. “It was like using a spellcheck to find plagiarized essays. If you change every fifth word to a synonym, the spellcheck passes. The meaning is stolen, but the tool is blind.”

The Turning Point and a New Approach

Facing nearly 40 suspect submissions, the department had to act. The initial interviews followed the standard script: “The MOSS similarity is low. Can you explain your code?” Students, prepared, walked through their obfuscated code line by line, offering plausible explanations for their design “choices.” It was a stalemate.

Elena’s breakthrough came when she abandoned the tools and returned to first principles: the bug was the signature. She wrote a small analysis script that didn’t look for similarity, but for the presence of the specific erroneous pattern—the bitwise AND in the core strategic calculation. She scanned all 187 submissions.

# Simplified grep logic for the flawed semantic pattern
import re
pattern = r'[\w]+\\s*=\\s*[\\w]+\\s*&\\s*[\\w]+'  # Looks for var = var & var
# Then manually verified it was in the Nim-sum logic context

The script flagged 34 files. Confronted with this specific, undeniable commonality—a logical error so peculiar it defied coincidence—the first student broke. The story unraveled: a shared Google Doc with the original flawed solution and a list of obfuscation steps.

Lessons for the Academic World

The fallout was significant. The involved students faced honor code sanctions. For the CS department, it triggered a policy and technology overhaul.

1. Tools are Assistants, Not Arbiters. “We stopped treating MOSS scores as verdicts,” Elena stated. “They are now anomaly detectors. A low score doesn’t mean innocence. It means we need to look deeper, sometimes manually.” The department began using tools like Codequiry not as a single gate, but as part of a layered analysis, appreciating its more nuanced fingerprinting and ability to handle some obfuscation, while acknowledging no tool is a silver bullet.

2. Design for Integrity. The Nim assignment was retired. New assignments are now built with plagiarism resistance in mind:

  • Unique Parameters: Each student gets slightly different problem specifications (e.g., different game rules, unique input/output formats).
  • Required Code Structures: Mandating specific, instructor-provided helper functions that must be integrated, creating a skeleton that is harder to replace wholesale.
  • Oral Assessments (“Code Vivas”): Randomly selected students must explain the logic of a randomly chosen line of their submitted code.

3. Teach the Line. The department now dedicates a lecture in CS106A to academic integrity, explicitly showing examples of obfuscation and explaining why it’s still plagiarism. “We show them the AND bug,” Elena said. “We say, ‘If you share this, or use it, we will find it. Not because our tools are perfect, but because your understanding isn’t.’”

The Final Analysis

The Stanford Nim incident is a cautionary tale for every CS department relying on automated similarity checkers. It highlights the arms race between detection and evasion. The most sophisticated cheating isn't copying; it's systematic translation designed to fool algorithms that don't understand meaning.

The solution isn't just a better algorithm—though that helps. It's a holistic strategy: smarter assignment design, tools used intelligently as probes rather than judges, and a culture that values the process of problem-solving over just the final, functional code. As Elena Vance put it, “We’re not in the business of catching cheaters. We’re in the business of creating environments where cheating is both harder to do and less valuable than actually learning. The detection is just the failsafe for when that ideal fails.”

The students who used the flawed template passed the assignment, technically. They also learned nothing about Nim, problem-solving, or algorithms. The one student who came forward after being caught and repeated the course the following year? She aced it, and later became a TA. “She told me the second time through, actually struggling, was when she learned how to program,” Elena said. That, in the end, was the whole point.