Automated Code Similarity Checks in a CI Lab Pipeline

Why Run Plagiarism Checks Inside Your CI Pipeline?

For the last three years I’ve been helping a large public university (6,000+ CS students per semester) move plagiarism detection out of the end-of-term panic and into the everyday lab workflow. The traditional model — a TA downloads a zip of submissions mid-semester, runs MOSS on the command line, and emails suspected cases — has two problems: it’s too slow to catch iterative copying, and it places the cognitive burden entirely on human reviewers.

Running similarity checks inside a continuous integration pipeline flips that. Every student push to a private GitHub Classroom or GitLab repo triggers a plagiarism scan against all other submissions in the same assignment cohort. Within minutes the instructor sees a ranked list of flagged pairs, complete with side-by-side comparisons. No zip files. No Friday-night MOSS sessions.

Automation doesn’t replace judgment — it surfaces patterns that human reviewers can’t see when buried under hundreds of files. The CI pipeline is the early-warning system; the professor or TA still makes the final call.

This article covers the architecture, tool choices, and honest failure modes of building that system, based on deployments at two R1 universities and one mid-size enterprise internship program.

Architecture of an Automated Similarity Check Pipeline

Any CI-based plagiarism detector boils down to three stages: collection (grab the source files from the student’s repo), comparison (run similarity analysis against the assignment pool), and reporting (deliver results to graders). The key design decision is how you handle the comparison pool — the set of submissions against which each new push is compared.

Pool Strategies

Full cohort scan: Compare every new submission against all previous ones for that assignment. Simple, but O(n²) per push. For a 200-student lab this means ~19,900 comparisons per push. Fine for once-a-day batch runs, not for real-time per-push.
Fingerprint cache: Pre-compute a hash or token vector for each submission. New submissions are compared only against the cache. Codequiry’s similarity engine does this internally, returning results in seconds for a cohort of 500+.
Incremental update: On each push, compare only the diff against the entire pool. Works well when assignments evolve across multiple commits but requires careful handling of partial submissions.

We settled on the fingerprint-cache approach using Codequiry’s REST API inside a GitHub Actions workflow. The YAML configuration looks like this:

name: Plagiarism Check
on: [push]
jobs:
  similarity:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Codequiry Check
        uses: codequiry/action@v2
        with:
          api-token: ${{ secrets.CODEQUIRY_API_KEY }}
          class-id: "cs101-lab3"
          file-glob: "src/**/*.java"
          threshold: 0.75
      - name: Upload Report
        uses: actions/upload-artifact@v4
        with:
          name: plagiarism-report
          path: report.json

This fires on every push, uploads the new source files, and receives a JSON report listing all similar pairs with scores above 75%. The report is stored as a build artifact that instructors can download from the Actions tab or from a custom dashboard we built on top of the API.

Which Similarity Engine Fits a CI Workflow?

Most professors know MOSS and JPlag. Both are excellent for batch processing but were not designed for per-push automation. MOSS requires web form submissions or a Perl script that hits a rate-limited endpoint. JPlag’s CLI works locally but becomes unwieldy when you have to manage 50+ assignment pools across multiple terms.

For CI pipelines we found that tools with a REST API — Codequiry, and to a lesser extent some custom Sherlock integrations — integrate cleanly. The API lets you create assignment pools, upload submissions individually, and retrieve results in a structured format (JSON/XML) that a grading script can consume.

Tool	API Available?	Per-Push Latency (200 students)	False Positive Rate (student data)
MOSS (Stanford)	No (web form only)	N/A for per-push	~12% at default sensitivity
JPlag (KIT)	No (CLI only)	~4 min batch	~8% at 0.6 threshold
Codequiry	Yes	<30 s	~5% at 0.75 threshold

These false-positive numbers come from a 2023 internal audit across 1,400 real student submissions. We define false positive as two submissions flagged as similar that an instructor judged to be legitimate independent work (e.g., using the same textbook function implementation). No automated tool is perfect, but a CI pipeline with a lower threshold plus human review beats a manual process every time.

Handling Feedback Loops and False Positives

A common objection we hear: “Won’t this overwhelm TAs with false alarms?” Yes — if you configure it badly. The trick is to separate alerting from reporting.

Our pipeline produces two outputs:

A summary badge on the student’s repo showing whether their submission is “clean,” “reviewing,” or “flagged.” This is visible to the student — it nudges them toward honesty without explicitly accusing.
A ranked instructor report showing only pairs that exceed a configurable threshold. We set that threshold to 0.85 for production alerts, and maintain a separate 0.65 threshold for weekly trend reports that instructors browse at their leisure.

We also added a resubmission blacklist: if a student’s code is flagged and cleared by an instructor, a cached fingerprint prevents re-flagging unless the student modifies more than 30% of their code. This avoids repeat wasted review cycles on the same file.

Real Results from Two Semesters

In Fall 2023, the CI pipeline ran on 47 assignments across two introductory Java courses. Compared to the previous semester’s manual batch process, we observed:

2.3× more flagged cases — because the pipeline caught mid-semester copying that would have been hidden by final-submission cleanup.
70% reduction in TA hours spent on plagiarism detection (from ~120 hrs/semester to ~36 hrs).
Student self-reporting improved — several students admitted to copying after seeing the “reviewing” badge on their repo because they knew the pipeline would escalate.

The pipeline became a deterrent, not just a detector. One department head told us that the mere existence of per-push scanning reduced overall similarity by 18% compared to the previous year.

Where It Breaks Down

CI-based plagiarism detection isn’t a silver bullet. Three failure modes remain:

Refactoring-resistant obfuscation: Students who rename variables, split functions, and reorder lines will still fool token-based detectors. You need AST-level comparison or run-length encoding to catch this — Codequiry adds a structural similarity pass that helps, but it’s not 100%.
Cross-language copying: If a student translates Python to Java via LLM, most pipeline detectors miss it. At a 2024 workshop we’re experimenting with intermediate-representation fingerprints, but it’s early.
Coordination overhead in large courses: Maintaining assignment pools for 2,000+ students requires cleaning old pools each term. We wrote a cron job that archives pools after final grades are posted — without it, the API costs would balloon.

Frequently Asked Questions

Do I need to run the CI checklist on every push, or just on final submissions?
Every push is better for deterrence, but you can limit it to the main branch. We use a GitHub Actions trigger on: [push, pull_request] but ignore pushes from dependabot or bots via path filtering.

How do I handle group projects where sharing is legitimate?
Whitelist team membership. Our pipeline allows tagging a repo with a group label; comparisons are run only across groups, not within them.

What about privacy and FERPA?
We never store student names in the similarity cache — only anonymous submission IDs. The report is accessible only to instructors on the course roster.

Can I use this for code quality scanning too?
Yes — but that’s a separate pipeline stage. We run pylint and bandit in the same workflow after the plagiarism check. The two results are combined into a single grading dashboard.

Automated similarity checks won’t replace the thoughtful conversation with a student who copied. But they will make sure that conversation happens before the final exam, not after. That alone is worth the YAML.