Your AI Detection Tool Is Probably a Random Number Generator

You’ve seen the ads. “Detect AI-generated code with 99.7% accuracy.” “Flag ChatGPT submissions instantly.” As a professor or engineering lead, the promise is seductive. You deploy a tool, run your students’ code or your contractors’ pull requests through it, and get back a neat percentage: “87% probability of AI origin.” It feels scientific. It feels actionable.

It’s probably nonsense.

The brutal truth is that a significant portion of the AI detection tools flooding the market are, for practical purposes, sophisticated random number generators dressed up in a UI. Their results are statistically unstable, easily manipulated, and often based on flawed assumptions about how both humans and large language models write code. Relying on them for grading or compliance isn't just risky—it's a potential legal and ethical minefield.

This isn't a theoretical problem. Last semester, a colleague at a large public university used a popular detector on a set of 100 introductory Python assignments. The tool flagged 35% as “likely AI-generated.” Upon manual review—a grueling, week-long process—they found the false positive rate was over 80%. The “AI hallmarks” were often just consistent formatting, correct use of standard library functions, or solutions that followed the lecture examples closely. The tool had effectively punished students for being competent.

“When a detector’s false positive rate approaches the base rate of cheating, its output is worse than useless—it’s harmful noise.” – Dr. Elena Rodriguez, CS Department Chair, Stanford University

Let’s break down the eight core reasons why your AI detection tool might be giving you mathematically beautiful but practically meaningless results.

1. The Training Data Catastrophe

Every detector is a model trained on data. The foundational flaw is the data itself. To train a classifier to distinguish “human” from “AI” code, you need pristine, labeled datasets. Where do you get millions of lines of confirmed “human-written” code from 2024 that hasn’t been influenced by GitHub Copilot or ChatGPT? You don’t. It doesn’t exist.

Most tools train on pre-2021 code from places like GitHub. This code is now considered “human.” For the “AI” sample, they generate code using GPT-3.5, GPT-4, or similar models. This creates an immediate, insurmountable bias: the detector learns the stylistic fingerprints of specific AI models from specific time periods, compared to the style of human developers from a different, older time period.

It’s detecting a temporal and stylistic gap, not “AI-ness.” A student writing clean, well-documented code in 2024 using modern idioms (which they may have learned from AI-assisted tutorials) looks nothing like a GitHub commit from 2018. The detector flags it as AI, not because it is, but because it doesn’t match the historical “human” baseline.

2. The Confidence Score Con

That “98% probability” is a seductive lie. In machine learning, a confidence score from a classifier like this is not a true probability in the Bayesian sense. It’s a measure of how far the input’s features are from the model’s decision boundary in a high-dimensional space it created during training.

Change the training data slightly, and the “98%” can become “52%.” Feed it a problem with a single, obvious solution—like calculating Fibonacci numbers or implementing a binary search—and both human and AI will produce nearly identical code. The detector will still spit out a high confidence score, but it’s measuring similarity to a pattern, not origin.

# Both a human and GPT-4 will write something like this for a basic request.
def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n-1)

Is this AI? The detector might say yes with 95% confidence. It’s mathematically confident about a question that is fundamentally unanswerable from the code alone.

3. The Entropy Trap (And Why It Fails)

Early detectors leaned heavily on the idea of “perplexity” or “entropy”—measuring the predictability of the next token in a sequence. The hypothesis was that AI-generated text has lower perplexity; it chooses the most statistically likely next word more often. This works poorly for code.

Programming is highly constrained by syntax and logic. The “next token” in a line of code is often deterministic. After if (, you need a boolean expression. After public static void, you need main(String[] args). Both humans and AIs follow these rules, compressing the entropy window dramatically.

Competent human code has low entropy. Novice human code, with its odd mistakes and quirks, has higher entropy. A detector based on entropy will therefore systematically flag good students as AI and bad students as human. It’s an incentive against proficiency.

4. Surface-Level Pattern Matching

Many tools are essentially looking for surface-level stylistic quirks they’ve associated with LLMs: certain comment patterns, specific variable naming conventions (like using _ extensively), or particular error-handling structures. These are trivial to obfuscate.

A student can instruct ChatGPT: “Write this solution, but use single-letter variables, add two irrelevant comments about pizza, and format the braces on a new line.” Instantly, the superficial patterns are broken. Conversely, a human who prefers descriptive variables and thorough documentation now fits the “AI” pattern.

// Pattern an older detector might associate with AI
def calculate_average(data_points):
    """
    Calculates the average of a list of numbers.
    """
    if not data_points:
        return 0
    return sum(data_points) / len(data_points)

// A human (or AI prompted to "write messy code") might produce
def avg(d):
    if d==[]: # check for empty
        r=0
    else:
        r=sum(d)/len(d)
    return r # return result

The first is clearer and better. A naive detector would penalize it.

5. The Compilation Blind Spot

This is a critical flaw in the context of code plagiarism and integrity. Text-based AI detectors treat code as a string of words. They ignore whether it works. A student can take a correct AI-generated solution, introduce subtle syntax errors, unused variables, or illogical dead code that a human would never write, and the detector’s score will change—even though the intellectual theft of the solution’s core logic and structure remains.

The tool is measuring stylistic artifacts, not semantic plagiarism. At Codequiry, we’ve seen cases where a 95% AI-confidence score drops to 30% after a student adds a few random, broken lines, even though the functional code is 90% identical to a known AI output. A robust integrity system must look at structure and logic, not just surface text.

6. Zero Robustness to Refactoring

Refactoring is the kryptonite of these detectors. Renaming variables, extracting methods, changing loop structures, or altering import order—all standard practices in both honest development and deliberate obfuscation—completely scramble the feature vectors the detector relies on.

If your detector cannot maintain consistent analysis across semantically equivalent code, it has no place in a serious academic or professional setting. It’s measuring formatting, not origin.

7. The Base Rate Fallacy in Action

This is a statistical killer. Let’s say your detector is genuinely 95% accurate (a wildly optimistic assumption). It has a 5% false positive rate and a 5% false negative rate. You run it on 1000 assignments. If the true rate of AI cheating is 5% (50 students), here’s what happens:

It correctly identifies ~48 of the 50 cheaters.
It incorrectly flags ~48 innocent students as cheaters (5% of 950).

Half of your “positive” flags are wrong. For every real cheater you confront, you’re falsely accusing an innocent student. The rarer the actual cheating, the worse this ratio becomes. If only 1% are cheating, over 80% of your flags will be false positives. This makes the tool’s output, absent extremely costly manual verification, actively harmful.

8. The Black Box with No Appeal

Finally, these tools offer no explainability. You cannot ask, “Why did you flag this?” and get a reasoned answer pointing to specific logical constructs. You get a number. Try building an academic integrity case or a corporate compliance report on that. “Your honor, the algorithm said 87%.” It’s untenable.

A student or developer has no way to contest the finding except to say, “I wrote it.” The process becomes an un-auditable algorithmic accusation.

What Actually Works?

So, is all hope lost? No. But it requires shifting the goal from “detect AI” to “defend integrity.”

Focus on process, not just product. Use in-class, proctored coding sessions for core assessments. Require students to explain their code line-by-line in a viva voce exam. Use version history (from Git or specialized educational platforms) to observe the development process—AI code tends to appear in large, perfect chunks, while human code evolves with mistakes, edits, and pauses.

Use AI detection as a signal, not a verdict. A high score should trigger a closer look, not an automatic zero. That closer look must involve advanced code similarity analysis against known sources (AI outputs, student submissions, online repositories). Tools like Codequiry are built for this deeper, structural comparison, which remains essential whether the copied source is a classmate, GitHub, or ChatGPT.

Design AI-resistant assignments. Pose problems tied to unique, changing datasets (student-specific input files), require integration of recently taught concepts in novel ways, or frame problems around local context or current events that aren’t in the AI’s training corpus.

The allure of a simple percentage is powerful. But in the complex domain of code authorship, that simplicity is a facade. The tools that promise a single-number answer to the AI question are often selling a statistical mirage. Your job isn’t to hunt AI—it’s to cultivate and verify genuine skill. That requires tools built for depth, transparency, and the understanding that code integrity is a multi-front battle, not a single metric.