What Makes AI-Generated Code Detectable
When a student opens a blank editor and writes a Python function to compute the Fibonacci sequence, their code will contain a specific fingerprint: variable names chosen from personal habit, inconsistent indentation between methods, and the occasional stray print statement left over from debugging. When ChatGPT or GitHub Copilot generates that same function, the output is statistically cleaner—and that statistical cleanness is measurable.
Over the past three semesters, my colleagues and I collected 1200 Python submissions from a CS1 course (introductory programming) at a large public university. All submissions were anonymized, and we had manual TA reviews plus Codequiry scans to establish a ground truth set of 150 known-AI submissions (confirmed through student interviews or source-provenance logs). We then ran the entire corpus through three detection methods:
- Perplexity scoring using a fine-tuned GPT-2 model on Python token sequences
- Burstiness analysis measuring variance in token repetition probability
- Token-frequency z-scoring comparing submission-level n-gram distributions against a reference corpus of 50,000 human-written Python files from GitHub
Here is what we found.
Perplexity: The Flatness Signal
Perplexity measures how well a probability model predicts a sequence of tokens. For natural-language text, LLM-written content tends to have lower perplexity because the model produces highly probable token sequences. For source code, the pattern is similar but the magnitudes are different—code has more rigid structure than prose, so even human-written code has relatively low perplexity. Still, the gap is real.
# Example human-written Python snippet (perplexity = 4.7)
def fibonacci(n):
if n < 0:
raise ValueError("n must be non-negative")
a, b = 0, 1
for _ in range(n):
print(a)
a, b = b, a + b
# Example AI-generated equivalent from GPT-4 (perplexity = 2.9)
def fibonacci(n):
a, b = 0, 1
for i in range(n):
print(a)
a, b = b, a + b
return
# Same task, low-perplexity AI output
The AI version omits the parameter validation, uses the loop variable i (which is ignored—a common GPT habit), and adds a superfluous return statement. Its token sequence is more predictable because it follows the most common patterns in its training data.
Across our dataset, the mean perplexity for human-written submissions was 5.1 (σ=1.8), while for AI-written submissions it was 2.8 (σ=0.9). If we set a detection threshold at perplexity ≤ 2.5, we caught 67% of AI submissions with a 4% false-positive rate. At a more conservative threshold of ≤ 2.0, the false-positive rate dropped to 1.1% but recall fell to 41%.
The takeaway: Perplexity alone is insufficient for a production detector—the distributions overlap significantly—but it provides a strong baseline signal.
Burstiness: The Variance That Gives Them Away
Burstiness measures the tendency for rare tokens to cluster. In human-written code, developers frequently reuse their own unusual variable names, type annotations, or comment patterns. This introduces token-level burstiness that LLM-generated code lacks. LLMs tend to sample from a wide vocabulary uniformly relative to the probability distribution, so rare tokens appear less concentrated.
We computed burstiness using the method of Goh & Barabási: B = (σ²/μ - 1) / (σ²/μ + 1). A value of 0 indicates a Poisson process; 1 indicates extreme clustering. Our results showed a clear separation:
| Submission Type | Mean Burstiness | Std Dev |
|---|---|---|
| Human-written (n=1050) | 0.47 | 0.12 |
| AI-written (n=150) | 0.22 | 0.08 |
Human submissions showed twice the burstiness on average. A student who names their temporary variable temp123 in one loop is likely to reuse temp123 in the next—a classic bursty pattern. LLMs, operating character-level or subword-level with sampling, rarely repeat long idiosyncratic identifiers.
Combining burstiness with perplexity improved detection significantly. Using a logistic regression model with these two features, we achieved an AUC of 0.91. At a threshold yielding 5% false positives, recall reached 78%.
Token-Frequency Z-Scores: Catching the Web Copy
Many students don't just use ChatGPT—they paste code directly from Stack Overflow, GitHub gists, or tutorial sites. Web-source plagiarism is distinct from AI generation but often co-occurs. We built a reference corpus of 50,000 Python files from public GitHub repositories (filtered to educational projects) and computed term frequency–inverse document frequency (TF-IDF) vectors for each submission. Then we calculated a z-score for each submission by comparing its vector to the mean and standard deviation of the reference corpus.
# Z-score formula applied per n-gram
z = (freq_submission - mean_reference) / std_reference
Submissions with many high-z-score n-grams (unusually common tokens compared to the reference) correlated strongly with both web-plagiarized and AI-generated code. The reasoning: LLMs are trained on enormous web corpora, so their outputs overrepresent the most common coding idioms and snippet patterns. A human might write a recursive quicksort with distinct variable names; GPT outputs the classic less = [x for x in arr[1:] if x <= pivot] pattern that appears in thousands of tutorials.
Of the 150 AI-written submissions in our set, 84% also showed elevated z-scores (z > 2.5) for at least 10 token trigrams. That overlap means a combined detector—AI + web-source—can catch more than either alone. Codequiry's approach stacks these signals, which is why it flags assignments that are both AI-generated and sourced from the web.
Three Specific Signatures We Observed
Beyond global statistics, we cataloged three recurring patterns unique to AI-generated Python submissions from this corpus.
1. The Over-Explained Comment Block
LLMs often insert lengthy docstrings that explain the algorithm in plain English even when the assignment doesn't require documentation. In one assignment (implementing the Sieve of Eratosthenes), 73% of AI submissions included a docstring of 50+ words, compared to 18% of human submissions. The human versions were typically one-line comments or absent.
# Common AI docstring pattern
def sieve(n):
"""Sieve of Eratosthenes algorithm to find all prime numbers up to n.
Make a list of consecutive integers from 2 to n. Start with p=2.
Mark all multiples of p as composite. Then find the next unmarked number.
This is the classic algorithm attributed to the ancient Greek mathematician."""
# ... code ...
2. The Uniform Error Handling
AI models, especially GPT-4, tend to add defensive try/except blocks that catch generic exceptions. In a simple string-reversal assignment, 24% of AI submissions wrapped the entire function body in a try/except Exception clause, while fewer than 3% of human submissions did. The humans either omitted error handling or used specific exception types like TypeError.
3. The Missing Edge Cases
Ironically, while AI code often over-handles errors, it frequently misses domain-specific edge cases. For a function that processes student grade data, human submissions would test for empty lists, negative numbers, and non-numeric inputs. AI submissions performed well on typical cases but failed on the outliers. This pattern—good on main path, weak on boundary conditions—appeared in 61% of AI submissions vs. 22% of human ones.
Where the Methods Break Down
No detector is perfect. We observed three classes of false positives and false negatives:
- High-skill, concise human code – Top-performing students who write clean, idiomatic code often produce low-perplexity, low-burstiness output that looks AI-like. In our dataset, 8 of the 23 false positives (at the 5% FPR threshold) were from students with A grades who submitted correct, well-structured solutions.
- Heavily edited AI code – Students who take an AI-generated draft and manually restructure it, rename variables, and add comments can shift their submission's statistical profile back into the human range. We saw this in about 12% of the confirmed AI submissions—these were the cases our combined detector missed.
- Boilerplate-heavy assignments – When the assignment requires a fixed code template (e.g., fill-in-the-blank in a provided scaffold), human and AI submissions become nearly indistinguishable because the token space is constrained. Our detector's accuracy dropped to 60% on such assignments.
"The cleanest students look like bots, and the cleverest cheaters look like students." — one TA's summary after the study
Practical Recommendations for Instructors
Based on these findings, I suggest a tiered approach to AI code detection in CS1 courses:
- Use a combined detector that stacks perplexity, burstiness, and web-source similarity. A single metric will miss too many cases. Stacking raises AUC above 0.9 in our tests.
- Apply different thresholds for different assignment types. Boilerplate-heavy assignments need a higher threshold (tolerate more false negatives to avoid false positives). Open-ended assignments can use a tighter threshold.
- Require annotated code or in-person walkthroughs for flagged submissions. Low perplexity alone isn't a violation. Ask the student to explain a specific line or variable naming decision. If they can't, escalate.
- Rotate detection methods each semester. Students share detection strategies online. If you rely solely on perplexity, they will learn to add burstiness by injecting random variable names. A moving target is harder to beat.
Codequiry's detection pipeline incorporates these principles—stacking statistical signals with web-source matching and offering configurable sensitivity levels. For a semester with 400 students, the platform typically flags 15-25% of submissions for further review, with a follow-up confirmation rate of 60-70%.
Frequently Asked Questions
Can a student defeat perplexity-based detection by adding comments?
Partially. Adding comments increases the token count and can shift perplexity upward, but the underlying code structure remains low-perplexity. A detector that evaluates function bodies separately from comments will still catch the signal. In our study, students who added long explanatory comments reduced their detection rate by only 8 percentage points.
What about code written with Copilot inline completions rather than full prompts?
Copilot-assisted code falls in a gray area. The per-line perplexity of Copilot completions is very close to human-written lines because Copilot adapts to the developer's style. A global perplexity check often fails here. Instead, we found that token-frequency z-scores for rare n-grams were more effective—Copilot tends to insert idioms from its training data that the student wouldn't naturally use. Codequiry's AI detection module includes a separate Copilot-tuned model that compares submission-level n-gram profiles against the student's own previous work, when available.
Is there a risk of false positives for ESL students or non-traditional learners?
Yes. Our study did not control for English proficiency or prior programming experience. Students who learn from different sources (e.g., non-English tutorials) may exhibit unusual token-frequency patterns that mimic AI generation. We strongly recommend that any detection output be treated as evidence for a conversation, not a verdict. Instructors should manually review flagged submissions and consider the student's historical work.