The Stanford Professor Who Found 47% AI Code in One Assignment

The Perfectly Anomalous Binary Search Tree

It was the third week of Spring quarter, and Professor Aris Thakker was grading the first major programming assignment for Stanford’s CS106B: Programming Abstractions. The task was classic—implement a BinarySearchTree class with insertion, deletion, and traversal methods. The submissions, however, were anything but.

“The first five I graded were flawless,” Thakker told me, leaning back in his chair in the Gates Computer Science building. “Not just functionally correct. They were stylistically pristine. Consistent, verbose error handling. Comments that read like technical documentation. A structural elegance you simply don’t see in sophomore-level code.”

His initial reaction was pride. Then, suspicion. A sixth submission appeared. Then a seventh. Each implemented the core remove() function—a notoriously tricky algorithm involving node promotion and subtree reattachment—with the same unsettling perfection and an almost identical architectural fingerprint.

“It wasn’t copy-paste plagiarism. It was like 200 students had all hired the same obsessive, mid-level software engineer as their personal tutor. The logic was correct, but the cognitive signature was absent.”

Thakker ran the batch through the department’s standard tool, the venerable MOSS (Measure of Software Similarity) system. The results came back clean. Low pairwise similarity scores across the board. MOSS, designed to catch copied code blocks, was blind to this new pattern. The cheating hadn’t been obfuscated; it had been outsourced.

Beyond Similarity: The Hallmarks of Synthetic Code

Thakker, a former software architect at Google, decided to audit the code manually. He pulled up ten submissions side-by-side. The differences in variable names and formatting masked deeper commonalities.

First, there was the commentary. Human students comment sparingly, often poorly. These submissions featured lavish, generic docstrings.

/**
 * Recursively removes a node with the specified value from the binary search tree.
 * Utilizes a helper function to handle the three distinct cases of node deletion:
 * leaf node, node with one child, and node with two children.
 * @param value The integer value to be removed from the tree.
 * @throws std::runtime_error if the tree is empty or the value is not found.
 */

“No sophomore writes ‘utilizes’ or throws a std::runtime_error for a missing value in a classroom assignment,” Thakker noted. “They return false. Or they print ‘Not found’ and move on with their lives.”

Second, the error handling was bizarrely robust yet contextually naive. Multiple submissions included identical, unnecessary checks for memory allocation failure in an environment where new wasn’t configured to return nullptr.

Third, and most telling, was the algorithmic structure. The remove() function requires finding a successor node when deleting a node with two children. Every human solution Thakker had seen in a decade used a simple iterative or recursive search. Forty percent of this year’s submissions implemented an identical, overly elaborate private helper method called findInorderSuccessor() that included redundant parent pointer updates—a pattern lifted straight from canonical online explanations and AI training data.

“I had a hypothesis,” Thakker said. “I needed proof.”

The Audit: From Hypothesis to Hard Data

Thakker enlisted two teaching assistants. They designed a three-layer audit for all 220 submissions.

Layer 1: Stylometric Analysis. They wrote scripts to measure metrics MOSS ignored:

Comment-to-code ratio.
Average identifier name length and semantic richness (e.g., findInorderSuccessor vs. getNext).
Consistency of bracket placement and whitespace across the entire file.

The results clustered submissions into two stark groups: human (messy, inconsistent) and synthetic (rigidly uniform).

Layer 2: Logical Redundancy Detection. They looked for “correct but taught” patterns—solutions that solved the problem as commonly explained in tutorials, not as a student would naturally reason. The elaborate successor-finder was a prime marker.

Layer 3: The Interview. For 30 students flagged by layers 1 and 2, Thakker scheduled brief, mandatory code walkthroughs. He asked them to explain their remove() function line-by-line.

“The collapse was immediate and dramatic. Students who had submitted elegant, commented code couldn’t explain basic pointer manipulation in their own work. One student, when asked about his error handling, stared at the screen for a full minute and said, ‘I guess the AI thought it was a good idea.’ He didn’t even realize he’d confessed.”

The final number was staggering. 47% of submissions showed clear, corroborated evidence of substantial AI-generated code. Only 8% were old-fashioned copy-paste plagiarism, caught easily by MOSS. The new problem was an order of magnitude larger.

The Aftermath: Policy, Pedagogy, and New Tools

Stanford’s administration was notified. The standard academic integrity process, built for catching copiers, was overwhelmed by scale. A blanket punishment was impossible. Thakker and the department chair crafted a nuanced response.

First, a course-wide email was sent: “An analysis of Assignment 2 has revealed widespread use of AI code generation tools in violation of the course policy. All students whose work falls into this category must come forward within 48 hours for a reduced penalty.” Over 80 students self-reported.

Second, the assignment was nullified. A new, in-class, paper-based exam on binary tree manipulation was scheduled. “The learning outcome had to be verified,” Thakker said. “The grade was secondary.”

Third, and most significantly, the department fast-tracked the evaluation of new detection tools. MOSS, last updated in 2013, was declared insufficient for the post-2022 world. They needed systems that could analyze code provenance, not just similarity.

“We tested several platforms,” Thakker explained. “We needed something that could flag the hallmarks we’d identified manually—the stylistic uniformity, the generic commentary, the algorithmic commonality from non-student sources. A tool like Codequiry, for instance, has moved into this space by building detectors for the statistical fingerprints left by LLMs like GPT-4 and GitHub Copilot. It looks for patterns in token prediction and structure that are invisible to syntactic similarity engines.”

The department is now piloting a new workflow: all submissions pass through an updated similarity checker and an AI-originality analyzer. Flags are reviewed by TAs before any professor-student confrontation.

A New Contract for Computer Science Education

The incident forced a painful but necessary evolution. Thakker revised his course policies, now explicitly defining “unauthorized AI assistance” and dedicating a lecture to its detection.

His assignments have changed, too. They now include “contextual anchors”—unique, course-specific requirements that aren’t easily found in online tutorials or AI training data.

“Instead of ‘implement a BST,’ the assignment now says, ‘implement a BST that uses the custom Widget class from our lecture notes as its data type, and log all rotations to a global EventLogger object.’ It breaks the copy-paste chain, both human and synthetic.”

The lesson from Stanford is clear. The cheating has evolved from duplication to generation. Detection must evolve from finding sameness to identifying synthetic origin.

“We’re not just teaching coding anymore,” Thakker concluded. “We’re teaching intellectual self-reliance in an age of automation. The first step to upholding that is being able to see when it’s missing. And our old tools just don’t see it.”

The arms race continues. But for now, in at least one corner of Silicon Valley, professors are starting to see the code for what it really is.