What 4,200 Python Submissions Tell Us About Code…

Running a Semester-Scale Code Similarity Audit

Most code plagiarism checks happen assignment by assignment: run a tool, get a percentage, confront the student. That workflow catches individual cases but misses the bigger picture. What if you could aggregate similarity scores across an entire course, across multiple semesters, and answer questions like: Which assignment encourages the most copying? Do scores spike near deadlines? Are certain language features more commonly shared?

Over three semesters at a mid-sized university, we collected similarity data from 4,200 Python submissions using Codequiry’s similarity engine. We then wrote a Python analysis pipeline to extract patterns. This article shows you exactly how to replicate that process and what we found.

Step 1: Exporting Similarity Data from Codequiry in Bulk

Codequiry provides per-assignment similarity reports with pairwise scores and matched segment maps. To aggregate across many assignments, we wrote a small Python script that uses the Codequiry API to pull all reports for a given course ID. The endpoint returns JSON like this:

{
  "assignment_id": "cs101_lab4",
  "submissions": 147,
  "pairs": [
    {
      "student_a": "s_johnson",
      "student_b": "m_nguyen",
      "similarity": 0.87,
      "matches": [
        {"type": "token", "start_line": 23, "end_line": 40},
        {"type": "ast", "start_line": 55, "end_line": 67}
      ]
    },
    ...
  ]
}

We collected 38 assignment runs over three semesters — 14 individual labs, 8 group projects, and 16 exams. The script saved each run as a separate CSV with columns: assignment_id, student_a, student_b, similarity_score, num_token_matches, num_ast_matches.

Key tip: Filter out pairs with similarity below 0.30 early. We found that typical random similarity between unrelated submissions in Python hovers around 0.10–0.20 due to shared boilerplate (imports, function stubs). Setting a floor of 0.30 removed 85% of noise.

Step 2: Aggregating per-Assignment Statistics

With 4,221 submissions and 2.4 million pairwise comparisons, raw data is useless. We aggregated per assignment: median similarity, max similarity, number of pairs above 0.70, and standard deviation. Here’s the Python code we used:

import pandas as pd

df = pd.read_csv("all_pairs.csv")
agg = df.groupby("assignment_id").agg(
    median_sim=("similarity_score", "median"),
    max_sim=("similarity_score", "max"),
    high_pairs=("similarity_score", lambda x: (x > 0.70).sum()),
    total_pairs=("similarity_score", "count")
).reset_index()
agg["high_ratio"] = agg["high_pairs"] / agg["total_pairs"]
print(agg.sort_values("high_ratio", ascending=False).head(10))

The output immediately highlighted three assignments with high_pairs ratios above 0.12 — meaning more than 12% of all pairwise comparisons exceeded 0.70 similarity. Those assignments were: a string-manipulation lab due at midnight on a Sunday, a group recursion project, and a take-home exam that was identical across two sections.

Step 3: Normalizing by Submission Count and Language Boilerplate

Raw similarity scores are misleading without normalization. A class with 200 submissions will have exponentially more high-similarity pairs than a class with 40, simply due to combinatorial chance. We adjusted by computing a z-score for each assignment against the expected distribution from shuffled submissions.

We also accounted for Python’s high boilerplate in certain tasks. For simple file I/O, nearly every student writes with open(filename) as f:. Codequiry’s AST-based matching helps here — it compares structural similarity rather than token sequences — but we still needed to subtract a baseline. We ran a synthetic baseline by taking the first 50 submissions from each assignment, randomly pairing them, and measuring median similarity. That baseline ranged from 0.04 (refactoring tasks) to 0.28 (template-heavy labs).

baseline = df_random_pairs["similarity_score"].median()
adj_median = assignment_median - baseline

After adjustment, the same three assignments still led the pack, but the gap narrowed. The string lab’s adjusted median dropped from 0.48 to 0.23 — still elevated, but less dramatic.

Step 4: Time Series Analysis Across the Semester

We wanted to see if similarity scores trend upward as deadlines approach. For each assignment, we recorded the number of days before the due date that each pair of submissions were uploaded (Codequiry timestamps every submission). Then we binned by “days before deadline” and computed the median similarity for each bin.

Days Before Deadline	Median Similarity	Pairs Count
7+	0.12	1,205
4–6	0.14	3,087
2–3	0.19	8,920
1	0.27	11,344
0 (due day)	0.33	9,612

The pattern is unmistakable: similarity rises sharply in the last 24 hours. We found that 65% of all pairs above 0.70 were submitted within the final 12 hours before the deadline. This isn’t necessarily cheating — students working in close collaboration often finish together — but it’s a clear flag for instructors to investigate.

Step 5: Distinguishing Plagiarism from Legitimate Collaboration

High similarity doesn’t always mean plagiarism. We used Codequiry’s segment-level matching to differentiate: token-level matches across many short segments suggest copy-paste, while a few long AST-structural matches suggest independent solutions to a well-defined problem. We also cross-referenced with group assignments — if two students were in the same team, similarity above 0.70 was expected.

We built a simple classifier rule:

If student A and B were in the same group project: ignore pairs
If similarity > 0.80 AND token matches cover > 30% of code: flag for human review
If similarity 0.60–0.80 AND they submitted within 6 hours of each other: flag as suspicious
Otherwise: low priority

This reduced the false positive rate from 14% to 3% when we manually reviewed a random sample of 200 flagged pairs.

Key Findings from 4,200 Submissions

Group projects had the highest raw similarity (median 0.52), but almost all of it was legitimate collaboration. Only 4% of group-project pairs exceeded our suspicious threshold after removing group members.
Take-home exams attracted 3× more high-similarity pairs than in-class exams. One exam was reused across two sections without changes — similarity shot to 0.90 among students who had a friend in the earlier section.
Assignments requiring external libraries (e.g., NumPy) showed lower similarity overall (median 0.14) because students had more ways to structure solutions. Boilerplate-only tasks (e.g., “read a CSV and compute average”) had median 0.35.
Late submissions on the due date had a median similarity 0.15 higher than early submissions even after controlling for assignment difficulty.

These numbers don’t mean your students are cheating en masse. They mean you can design assignments that naturally reduce the temptation. For example, we changed the string lab to require a custom algorithm step — similarity dropped 40% the next semester.

Turning Analysis into Action: A Repeatable Workflow

You can run this analysis on your own courses with minimal scripting. Here’s the workflow we recommend:

Set up a Codequiry course and import all submissions. Ensure each student has a unique identifier.
Export all pairwise reports via the API or download CSV. We used the Codequiry API documentation to automate this.
Run the aggregation script provided in Step 2. Adjust the similarity threshold based on your language and assignment templates.
Flag suspicious assignments where the high_ratio exceeds, say, 0.08. Investigate those first.
Review flagged pairs manually, focusing on token-level vs. AST-level matches and submission timestamps.
Track across semesters to see if new assignment designs reduce reuse.

The full Python analysis notebook (including visualization) is available on our GitHub — contact support for access. We run this after every semester now, and the insights have reshaped how our department builds programming assignments.

Frequently Asked Questions

What is a normal similarity baseline for Python assignments?
For typical intro CS tasks (loops, conditionals, basic I/O), expect a median pairwise similarity of 0.10–0.15 after excluding students from the same group. Higher than 0.25 warrants attention.

How does Codequiry differentiate token matching from AST matching?
Token matching catches verbatim copy-paste (including renamed variables). AST matching compares the syntax tree structure, so it can detect reordered code or changed control flow. We used both metrics in our analysis — token-matched segments were better indicators of copy-paste, while AST matches often indicated shared problem-solving approaches.

Should I penalize students for high similarity detected by this analysis?
No. The aggregate analysis is a diagnostic tool for course design, not an honor code enforcement mechanism. Always verify individual cases with manual review and give students the benefit of the doubt. The goal is to reduce opportunities for plagiarism, not to catch more cheaters.

Can I do this analysis without an API key?
Codequiry offers CSV exports for each assignment. You can manually download them and use the same Python scripts. The API just saves time when you have dozens of assignments.

What 4,200 Python Submissions Tell Us About Code Reuse