One Community College's Web Code Plagiarism Strategy

The Web Code Problem in CS 101

In spring 2023, the computer science department at a medium-sized community college (let’s call it “Riverton Community College”) faced a growing problem. Instructors for CS 101: Introduction to Programming noticed that more than a third of final projects contained blocks of code that were verbatim matches to Stack Overflow answers or GitHub repositories. The assignments were simple – build a command-line grade calculator, a tic-tac-toe game, or a basic file parser – but students were lifting whole functions instead of learning to write them.

The department had been using MOSS for intra-class similarity checks, but MOSS only compares submissions within the same course. It couldn’t detect code copied from the open web. Turnitin’s code support was limited, and manual Google searches for each suspicious snippet quickly became unsustainable for a course with 240 students across five sections.

“We had students copy-paste a 60-line ‘is_prime’ function that was identical down to the variable names ‘def is_prime(num):’ – exactly the formatting from a 2019 Stack Overflow answer. The student couldn’t explain what a prime number even means.” — department chair, Riverton CC

The department decided to trial an automated web-source plagiarism detection tool during fall 2023, settling on a combination of Codequiry’s web-source matching and their existing MOSS pipeline.

Setting Up the Two-Stage Detection Pipeline

The workflow had two stages. Stage one was a pre-submission web scan: before any grades were assigned, every student submission was run through Codequiry’s web-source database, which indexes public code from Stack Overflow, GitHub, GeeksforGeeks, and other common tutorial sites. Any match above 80% character-level similarity was flagged and reviewed by the instructor.

Stage two was a follow-up intra-class similarity check with MOSS. The theory: students who didn’t copy from the web might still copy from each other, and a web match early in the term reduced the pool of available answers for later copies.

# Example of a flagged snippet (student submission, fall 2023):
def is_prime(n):
    if n <= 1:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True

# MOSS and Codequiry both flagged this – the variable name 'i'
# and the exact spacing match a Stack Overflow answer from June 2022.

The integration was straightforward. Submissions were collected via Canvas, exported as ZIP files, and uploaded to Codequiry’s dashboard. The tool returned a similarity report with direct links to the original web sources. Instructors could review and export a report within the same session.

Calibrating the Threshold

Initially, setting the similarity threshold too low (50%) generated dozens of false positives from common boilerplate – import statements, function signatures, or standard loops. After two weeks of trial and error, the department settled on an 80% similarity threshold for web sources, combined with a minimum match length of 15 consecutive tokens. This eliminated most clean-boilerplate matches while still catching clearly lifted code.

Results After One Semester

The fall 2023 semester produced clear numbers. Out of 238 student submissions, Codequiry flagged 52 (21.8%) as having a significant web-source match. After manual review, instructors confirmed plagiarism in 38 of those cases (16.0% of total submissions). That compared to a 40% reduction from the previous semester’s estimated copy-paste rate of 27% (based on the department’s earlier manual sampling).

More importantly, the false-positive rate was low: only 7 out of 52 flagged submissions (13.5%) were false alarms, typically from students who had heavily commented a common algorithm or used an instructor-provided starter template. Instructors could usually dismiss these in under 30 seconds.

MetricSpring 2023 (manual checks)Fall 2023 (automated)
Total submissions215238
Confirmed web-source plagiarism~58 (est. 27%)38 (16.0%)
Manual review time per flagged submission12–15 min2–3 min
False-positive rateN/A (manual only)13.5%

The department also saved roughly 60 hours of manual googling and code-comparison labor over the semester – time instructors redirected toward designing better, more plagiarism-resistant assignments.

Challenges and Adjustments Along the Way

No deployment is seamless. The biggest challenge was student pushback during the first two weeks. Several students whose code was flagged argued that “everyone uses Stack Overflow for help” and that the tool was unfair. The department responded by updating its syllabus to explicitly define what constitutes acceptable web use: citing a one-line syntax example is fine, but copying a full algorithm without attribution is not. They also added a required “Sources Used” section to each submission header, which further reduced incidents.

Another issue: cross-semester reuse. A small number of students had friends who had taken the course the previous term and shared old working code. Since Codequiry’s web database didn’t include the previous semester’s submissions (unlike MOSS, which does intra-course comparison), this kind of sharing went undetected. The department now runs a MOSS check for all submissions going back two years in addition to the web scan.

Finally, the tool occasionally flagged code that was not from the web but happened to match a repository owned by the instructor or a teaching assistant. A quick blacklist feature – marking certain known GitHub repos as “exempt sources” – solved that in one afternoon.

Recommendations for Other Institutions

Based on Riverton’s experience, three pieces of advice stand out:

  1. Combine web-source detection with intra-class similarity. Web-only detection misses peer-to-peer sharing. Intra-class detection misses web copies. Together, they cover the vast majority of plagiarism vectors.
  2. Set a clear policy early. Students are less likely to copy if they know exactly what is and isn’t allowed, and that automated tools will catch them. Publish the threshold and the attribution requirement in Week 1.
  3. Budget for false positives. No tool is 100% accurate. Reserve at least 15 minutes per 100 submissions for human review of flagged items. The time saved from not doing manual searches more than compensates.
“We don’t pretend the tool catches everything. But it caught enough to change the culture. Students now know they can’t just copy-paste their way through this course.” — instructor, Riverton CC

Frequently Asked Questions

Can web-source detection find code copied from private repositories?
Only if those repos are indexed. Most tools, including Codequiry, only scan publicly accessible code (Stack Overflow, public GitHub repos, etc.). Private code sharing requires intra-class comparison.

What languages does it support?
Codequiry supports a wide range of common teaching languages: Python, Java, C++, JavaScript, C#, and more. Our case study focused on Python, but the department also used it for a Java-based data structures course.

Is web-source detection different from AI-generation detection?
Yes, completely. Web-source detection checks for exact or near-exact copying from known public sources. AI-generation detection looks for statistical signatures of LLM-written code, such as uniform comment spacing, unusually clean variable names, and low perplexity. Some modern platforms combine both signals.

How do I integrate this with my existing LMS?
Codequiry offers a Canvas integration and a general-purpose API. Riverton used a simple export-upload workflow, but the direct integration further reduces overhead.