Your AI Detection Tool Is Lying to You About False Positives

The Confidence Score Trap

You get the report. A student's Python assignment for a basic binary search tree implementation flags at 92% likelihood of being AI-generated. The tool highlights generic variable names, consistent commenting style, and a standard algorithmic structure as evidence. The interface suggests "High Confidence - Recommend Review." Now what? Do you send the standard academic integrity violation email? For a growing number of CS professors and TAs, the answer is a hesitant "maybe," followed by a sinking feeling. That feeling is your expertise clashing with a statistical model.

The most dangerous tool is one you trust blindly. AI detectors don't find plagiarism; they find patterns. Your job is to determine if those patterns are sinister or simply sensible.

This guide is for when the detector's verdict doesn't sit right. We'll move from automated alert to human verdict. The goal isn't to prove the tool wrong, but to use it as a starting point for a real investigation. We'll use three concrete examples, complete with code, to build your forensic checklist.

Step 1: Isolate the "Evidence" and Contextualize It

Never start with the accusation. Start with the code. Pull the flagged submission and the detector's highlighted "key indicators." Your first task is to ask: are these indicators actually suspicious, or just signs of competent coding?

Example 1: The Overly Clean Solution. A student submits this function for a mid-level data structures course:

def find_kth_smallest(root, k):
    """
    Finds the k-th smallest element in a BST.
    
    Args:
        root: The root node of the BST.
        k: The ordinal position of the element to find (1-indexed).
    
    Returns:
        The value of the k-th smallest element, or None if not found.
    """
    stack = []
    current = root
    
    while stack or current:
        while current:
            stack.append(current)
            current = current.left
        
        current = stack.pop()
        k -= 1
        if k == 0:
            return current.val
        
        current = current.right
    
    return None

The detector flags: "Excessive docstring formatting," "Canonical in-order traversal pattern," "Generic variable names (stack, current)."

Your Analysis:

  1. This is the exact iterative in-order traversal algorithm taught in every standard textbook (CLRS, Skiena) and on GeeksforGeeks. A student who studied would reproduce it.
  2. The docstring is good, not excessive. It matches the style explicitly required by the course's style guide (publicly available on the course website).
  3. "Stack" and "current" are the conventional, correct names for this algorithm. Suggesting a student rename them to be "unique" would be pedagogically unsound.

Action: Check the student's previous submissions. Do they have a history of clean, well-documented code? If this is a sudden shift from messy to pristine, that's a data point. If it's consistent, you're likely looking at a good student, not a cheater. Compare against the assignment's provided skeleton code. Is the structure identical? If so, the "pattern" is from the course, not ChatGPT.

Step 2: Hunt for Logical Inconsistencies, Not Stylistic Ones

AI models are proficient at surface-level syntax and common patterns but often fail at deeper, task-specific logic, especially on novel problems. This is your most powerful filter.

Example 2: The Conceptually Dissonant Submission. An assignment asks: "Write a function that processes a list of experimental sensor readings. Filter out any reading where the value is more than 3 standard deviations from the mean, but only if the preceding reading was not also an outlier."

The flagged submission:

def filter_sensor_outliers(readings):
    import statistics
    if not readings:
        return []
    
    mean_val = statistics.mean(readings)
    stdev_val = statistics.stdev(readings) if len(readings) > 1 else 0
    threshold = 3 * stdev_val
    
    filtered = []
    prev_was_outlier = False
    
    for i, val in enumerate(readings):
        is_outlier = abs(val - mean_val) > threshold
        if not (is_outlier and prev_was_outlier):
            filtered.append(val)
        prev_was_outlier = is_outlier
    
    return filtered

The detector reports: "High syntactic consistency, optimal library use, clean control flow."

Your Forensic Breakpoint: The logic is subtly wrong. Read the problem again: "filter out any reading... only if the preceding reading was not also an outlier." The double negative is key. The code filters out a reading if it IS an outlier AND the previous one WAS. The correct logic should be: filter out a reading if it IS an outlier AND the previous one WAS NOT an outlier. The AI has mis-parsed the complex conditional.

Action: This is a critical moment. A human cheater (copying from a peer) would likely have the correct or similarly incorrect logic. An AI cheater has this specific misreading. Confront the student in a one-on-one meeting. Do not show them the code. Say: "Walk me through your logic for the sensor filter problem, especially how you handled the condition about the preceding reading." If they articulate the flawed logic from the code, they may have written it. If they seem confused or articulate the correct logic, the source of the code is external. This investigative interview is more definitive than any detector score.

Step 3: The "Embedded Fingerprint" Check

Students using AI aren't always sophisticated. They often leave traces—comments, variable names, or structural quirks from the prompt they used. Look for anomalies that wouldn't make sense in your course context.

Example 3: The Ghost of a Prompt. A final project for a web development class includes a utility module. The detector gives a middling 65% score, but something feels off.

// utils/helpers.js
/**
 * Calculates the total price of items in a cart with tax and shipping.
 * @param {Array} cartItems - The items in the shopping cart.
 * @param {number} taxRate - The applicable tax rate (e.g., 0.08 for 8%).
 * @param {number} shippingCost - The base shipping cost.
 * @return {number} The final total cost.
 */
function calculateFinalTotal(cartItems, taxRate = 0.08, shippingCost = 5.99) {
    // As per the user's request, handling edge case for empty cart.
    if (!cartItems || cartItems.length === 0) {
        return 0;
    }
    const subtotal = cartItems.reduce((sum, item) => sum + (item.price * item.quantity), 0);
    const tax = subtotal * taxRate;
    // Applying shipping only if subtotal is less than 50 as per standard e-commerce rule.
    const shipping = subtotal < 50 ? shippingCost : 0;
    const total = subtotal + tax + shipping;
    // Rounding to two decimal places for currency display.
    return Math.round(total * 100) / 100;
}


The Tell: The comments. "As per the user's request..." and "...as per standard e-commerce rule." These are narrative explanations directed at an imaginary "user" (the person writing the ChatGPT prompt), not comments for a fellow developer on this specific project. No student in this course has been asked to write comments in this style. Furthermore, the default `taxRate=0.08` and `shippingCost=5.99` are arbitrary specifics never mentioned in the project spec.

Action: This is the closest thing to a smoking gun you'll get. Search the student's other files for similar narrative comments. Confront them directly: "Can you explain the business logic behind the `shippingCost` default of 5.99 and the 50-dollar free shipping threshold? I don't see that in the project requirements." Their inability to justify these embedded, project-specific decisions is highly indicative of external generation.

Step 4: Establish a Baseline with Cross-Cohort Analysis

This step requires a tool like Codequiry that can handle bulk analysis. The goal is not to find the single cheater, but to understand the distribution of solutions. Genuine collaborative learning or common resource use (like a popular YouTube tutorial) creates clusters.

Workflow:

  1. Run the entire class's submissions (200+ students) through the similarity detection system.
  2. Look at the resulting similarity matrix or cluster diagram.
  3. Key Insight: Is the flagged submission in a cluster of 5-10 highly similar solutions? That suggests shared human resources (a study group, a forum solution). Is it a lone outlier with high AI-detection score but near-zero similarity to any peer? That strongly suggests an AI source. Is it paired with one other near-identical submission? That's classic peer plagiarism.

The platform's ability to visualize this landscape is crucial. It moves the conversation from "This looks AI-ish" to "This submission is an anomaly in three distinct dimensions: style, logic, and cohort similarity."

Building Your Institutional Process

Detection is the start, not the end. Based on this forensic review, your department should standardize its response.

  • Stage 1: Automated Flag. Tools scan all submissions. Reports are generated for any submission above a calibrated threshold (e.g., >85% AI confidence AND <30% similarity to any peer).
  • Stage 2: TA Forensic Review. The TA performs the 4-step check above, focusing on Step 2 (Logic) and Step 3 (Fingerprints). They write a brief memo.
  • Stage 3: Professor Interview. If the memo suggests concern, the professor meets the student for a technical walkthrough of the problem, not the code.
  • Stage 4: Committee Review. Only if the interview is damning does the case go to the academic integrity committee, with the detector's report, the forensic memo, and interview notes as evidence.

This process protects everyone. It prevents knee-jerk accusations based on a number. It gives students a chance to explain. It builds a robust, defensible case when cheating is real. The tool's role is to triage, not to judge. Your role as an educator is to investigate, not to prosecute. By mastering this forensic layer, you reclaim the authority that the promise of AI detection threatened to outsource. You stop fearing the false positive because you have the skills to prove it.