Independent Performance Study

Detecting more real plagiarism, faster.

A large-scale evaluation comparing Codequiry against legacy code-similarity tools and manual workflows across languages and assignment types.

4.8/5 Trusted by instructors and integrity teams worldwide
99.3%
Detected Matches
< 30m
Typical Turnaround
4B+
Source Corpus
200
ZIPs per Batch

Executive summary

Across a broad, multi-institution evaluation spanning introductory through capstone computer science courses, Codequiry surfaced substantially more actionable plagiarism signals than legacy approaches, while reducing investigator time through higher-quality clustering, collusion indicators, and verifiable source attribution. Evaluations were conducted on assignments covering algorithms, data structures, systems, web development, databases, and ML tooling, with submissions in Python, Java, C/C++, JavaScript/TypeScript, and mixed-language projects.

Key outcome: Institutions reported fewer appeals and faster resolutions due to transparent evidence and reproducible similarity explanations.

In aggregate, Codequiry achieved the highest match discovery rate while maintaining low false positive rates across multiple obfuscation strategies (identifier renaming, whitespace formatting, control-flow restructuring, dead code injection) and collaboration patterns. The platform7s graph-based collusion analytics reliably highlighted tightly connected clusters even when individual pairwise similarities were modest, allowing reviewers to prioritize review sets with the highest expected yield.

Operationally, batch processing throughput consistently met instructional timelines. Median turnaround for cohorts under 5,000 files was under thirty minutes, enabling same-day feedback cycles and timely academic integrity interventions. Where legacy methods frequently required multi-day processing and manual triage of noisy matches, Codequiry consolidated evidence into a coherent, explainable report, lowering the cognitive load on faculty and academic integrity staff.

Background and motivation

Academic integrity in computing education is a persistent challenge. Code reuse and collaboration are part of professional practice, yet many courses require individual authorship to fairly assess learning. Public repositories, code sharing platforms, AI code assistants, and large language models have further blurred lines between legitimate reference and misrepresentation of authorship. Traditional string- and token-based similarity tools, while valuable, often struggle to balance recall and precision in modern contexts: small surface-level edits can mask near-identical logic, and conversely, shared scaffolding or framework code can inflate similarity between independent submissions.

This study was initiated in partnership with instructors and academic integrity teams who needed three things: higher recall on genuinely copied logic (even after significant cosmetic edits), lower false positive rates on boilerplate and instructor-provided code, and practical workflows that shorten time-to-resolution without sacrificing due process. The evaluation emphasizes actionable matches with source attribution and explanation quality over raw similarity percentages.

Study design

Scope and institutions

The study covered 24 months of coursework across multiple institutions (community colleges, four-year universities, and professional bootcamps). Courses ranged from CS1/CS2 to upper-division systems and capstones, ensuring representation of differing assignment sizes, structures, and grading rubrics.

Dataset construction
  • Submissions were anonymized and de-identified to comply with FERPA/GDPR standards.
  • Ground truth labels were produced via dual, blind reviewer annotation with adjudication on disagreements.
  • Public-source matches were seeded by querying GitHub, Q&A archives, tutorial code, and known paste sites.
  • Adversarial sets were synthesized to reflect real tactics: identifier renaming, control-flow reordering, comment/style changes, dead code insertion, and partial copying across files.
Baselines and systems compared

We compare Codequiry against widely cited academic similarity detectors and manual review workflows. Each system was configured according to recommended guidelines or defaults, with minor parameter tuning when suggested by maintainers or documentation. Where tools produced scores rather than binary judgements, we evaluated across thresholds to compute precision-recall curves and F1.

Reviewer workflow measurement

In addition to accuracy metrics, we measured investigator time per case, number of documents reviewed per resolved case, and appeal rates post-notification. These operational metrics reflect institutional priorities beyond raw detection, capturing downstream impact on staff workload and student outcomes.

Evaluation metrics

  • Precision / Recall / F1: Actionable-match precision emphasizes correctly attributed, policy-relevant matches.
  • Top-k precision: Quality of the first k matches presented to reviewers (k d 5 by default).
  • False positive rate: Over-flagging risk on boilerplate/instructor code and coincidental similarities.
  • Time to first evidence: Median time from submission to first reviewer-visible evidence.
  • Investigative effort: Median minutes spent to reach a confident decision with documentation.
  • Appeal rate: Fraction of notified cases that resulted in formal appeal.

All metrics were computed per-assignment and then aggregated with stratification by course level. Confidence intervals were estimated using bootstrap resampling. Inter-rater reliability for ground truth labels achieved Cohen7s ba d 0.84 overall, indicating strong agreement.

Systems compared

We included the following approaches:

  • Codequiry: Hybrid AI + structural similarity with collusion graph analytics, source attribution, and explainable evidence pages.
  • Academic similarity baseline A: Token-based detector commonly referenced in CS pedagogy literature.
  • Academic similarity baseline B: Tree/AST-oriented approach with configurable thresholds.
  • Manual review: Independent TA/instructor review with keyword/search heuristics.

Baselines were executed on identical corpora and provided with the same ignore lists for instructor scaffolding to ensure fairness.

Per-language results

Language Codequiry F1 Baseline A F1 Baseline B F1 Manual Precision
Python 0.92 0.78 0.74 0.81
Java 0.90 0.75 0.72 0.79
C/C++ 0.88 0.70 0.69 0.77
JavaScript/TypeScript 0.89 0.73 0.71 0.76

Values represent medians across assignments within each language; variance was lowest in Python-heavy courses (narrow library surface and consistent style guides) and highest in mixed front-end stacks with generated artifacts.

Robustness to obfuscation

To simulate intentional evasion, we evaluated under five common obfuscation tactics: identifier renaming, whitespace/style reformatting, control-flow reordering, dead-code injection, and mixed-file copying. Codequiry maintained high recall by operating on normalized structural representations supplemented by semantic features and source alignment heuristics. Precision was protected through de-duplication of shared scaffolding and library code suppression.

  • Identifier renaming: Minimal impact on recall due to AST-level matching and token normalization.
  • Reformatting: No observable impact; formatting is removed in preprocessing.
  • Control-flow reordering: Moderate impact mitigated by structural similarity and sequence alignment; typical F1 reduction < 0.03.
  • Dead code injection: Negligible effect due to pruning of unreachable blocks and frequency penalties.
  • Mixed-file copying: Addressed by graph-based multi-file linkage and path-consistency checks.

Collusion analytics

Pairwise similarity alone often misses structured collaboration that spreads fragments across a cohort. Codequiry builds a submission similarity graph and applies community detection to surface clusters indicative of coordinated copying. Reviewers receive cluster-level summaries with the strongest exemplars, saving time otherwise spent inspecting marginal pairs.

Observed effect: In cohorts with suspected collaboration, cluster-first review reduced time-to-resolution by 31% on average while increasing the share of high-confidence findings.

Investigator effort and student outcomes

We measured end-to-end investigator effort from queue intake to documented outcome. With Codequiry7s consolidated evidence (side-by-side alignment, line-level diffs, trust scores, and source citations), reviewers reached confident conclusions with fewer document hops and less manual alignment. Importantly, clarity of evidence corresponded with lower appeal rates and faster remediation plans for first-time violations.

  • Median reviewer time per resolved case: down 42% compared to baseline workflows.
  • Documents opened per case: down 36%, driven by integrated evidence pages.
  • Appeal rate after notification: down 21%, attributed to transparent, reproducible evidence.

Throughput, cost, and scalability

Performance was assessed on real-world batch sizes (50d5,000 files) with mixed archive structures. Typical cohorts completed within a single class session, enabling same-day triage. Resource utilization scaled linearly within expected bounds, with backpressure mechanisms during peak periods to preserve latency for interactive use.

  • Median time to first evidence: < 10 minutes for cohorts under 1,000 files.
  • Median batch completion: < 30 minutes for cohorts under 5,000 files.
  • Cost-per-submission: remained within instructional budgets due to incremental analysis and smart deduplication.

Exact timings depend on code size, language mix, and public-source lookup volume. Institutions with very large cohorts may benefit from scheduled batch windows.

Error analysis

False positives primarily stemmed from shared scaffolding, assignment-provided helper code, and common idioms in introductory courses. Suppression lists and instructor-provided ignore patterns mitigated most of these. False negatives occurred in cases of concept-level paraphrasing with substantial algorithmic re-implementation. While reviewers often considered these pedagogically concerning, they did not always meet policy thresholds for plagiarism, underscoring the value of human-in-the-loop judgement.

We observed that richly commented student code occasionally triggered spurious matches when identical comments circulated online. Treating natural language comments separately from code structure reduced this effect without harming recall.

Limitations

  • Datasets are drawn from participating institutions and may not reflect all course designs or grading policies.
  • Results vary with assignment structure; projects with heavy framework code require careful ignore configuration.
  • Cross-language plagiarism (e.g., Python to Java translations) remains challenging for all methods.
  • AI-generated code raises nuanced authorship questions; detection focuses on similarity and provenance rather than intent.

These limitations argue for transparent evidence, instructor context, and student dialogue, rather than automated adjudication.

Ethics and privacy

All participating institutions adhered to local policy and legal frameworks. Submissions were de-identified, access-controlled, and processed under data protection agreements. Evidence pages are designed to support fair process by providing specific, inspectable matches and enabling students to understand the basis of findings. Logs and audit trails are retained per institutional policy to support review and appeal.

Reproducibility

We emphasize reproducibility through versioned detectors, pinned model checkpoints, consistent preprocessing, and configuration snapshots stored with each analysis. Institutions can re-run cohorts with identical settings to verify outcomes. For external researchers, we provide detailed configuration manifests upon request, subject to data sharing agreements that protect student privacy.

Acknowledgments

We thank instructors, TAs, and academic integrity officers across our partner institutions for collaboration on assignment curation, rubric interpretation, and label adjudication. Their feedback directly shaped improvements to evidence presentation, clustering defaults, and ignore pattern tooling.

References

  • Schleimer et al. Winnowing: local algorithms for document fingerprinting.
  • Prechelt et al. JPlag: finding plagiarisms among a set of programs.
  • Roy & Cordy. A survey on software clone detection research.
  • Faidhi & Robinson. An empirical approach for detecting program plagiarism.
  • Burrows–Wheeler and suffix array methods for string similarity at scale.

Citations are representative of the literature informing our evaluation. Full bibliographic entries available upon request.

Methodology at a glance

  • Multi-language corpus spanning Python, Java, C/C++, JavaScript/TypeScript, and more
  • Mixture of clean, lightly modified, and deliberately obfuscated plagiarized submissions
  • Cross-cohort collusion scenarios and public-source matches (repos, Q&A, paste sites)
  • Blind manual review baseline for precision/recall comparison
Courtroom-ready evidence Transparent scoring Reproducible results

Results overview

Approach Detected matches False positives Turnaround Collusion signals
Codequiry (AI+Similarity) Highest Low < 30 min typical Yes
Legacy similarity tools Moderate Moderate Hours to days Limited
Manual review only Low Variable Days+ No

Note: Aggregated from multiple cohorts and assignment types; specific numbers vary by class design and dataset.

Real-world impact

Research University
  • Time-to-resolution reduced from weeks to days
  • Clearer evidence lowered appeals volume
  • Faculty adoption increased across CS sequence
Bootcamp Network
  • Higher match discovery on public sources
  • Improved hiring outcomes via integrity policy
  • Streamlined intake with batch ZIP processing

See Codequiry on your coursework

Try a free pilot with real assignments and get a customized report for your institution.

Start free