Can Dev Teams Trust Code Similarity for IP Theft Detection

Every week, a new story surfaces about a developer who forked a private repository and pushed it to a personal GitHub account, or a contractor who reused proprietary code in a competitor's product. These aren't academic honor code violations. They’re intellectual property disputes that can cost companies millions in legal fees and lost competitive advantage.

Code similarity detection—the same technique universities use to catch students copying assignments—is increasingly being pitched to enterprises as a solution. But the gap between catching a CS101 student who renamed variables in a bubble sort and detecting a senior engineer who ported an entire trading algorithm from C++ to Rust is vast.

We need to understand where similarity detection works, where it breaks down, and what engineering teams should actually be measuring.

Why String Matching Fails in Enterprise Codebases

The naive approach to code similarity—simple token or string comparison—falls apart almost immediately in professional settings. Consider this real scenario: an ex-employee of a fintech startup joins a competitor and rewrites the company's order matching engine in a different programming language. The logic is identical. The architecture mirrors the original. But the syntax is completely different.

A token-based tool like MOSS (Measure Of Software Similarity) will flag near-zero similarity because the lexemes—the actual programming language tokens—have changed entirely. MOSS works well for Java-to-Java comparisons where students are working in the same language and environment. Cross-language reuse renders it blind.

AST-based comparators, which analyze the abstract syntax tree structure, fare better but still struggle when control flow is restructured. If a developer converts a recursive function into an iterative one, or flattens a deeply nested condition into a guard-clause pattern, the AST transforms substantially even though the intent remains identical.

The fundamental problem: similarity tools measure syntactic proximity, but IP theft is about semantic copying. We're trying to catch a thief by fingerprinting their handwriting when they've learned to type.

What Enterprises Actually Need to Detect

Through our work with enterprise clients at Codequiry, we’ve observed three distinct patterns of IP theft that similarity tools must address:

Wholesale file copying – The most trivial case. A contractor uploads a complete module to a public repo or a competing organization. Standard hash-based deduplication catches this.
Deliberate obfuscation – Variable renaming, whitespace changes, comment removal, and function reordering. Token-based tools with winnowing (like MOSS) handle this reasonably well.
Semantic porting – Rewriting the same logic in a different language or different paradigm. This is where nearly all existing tools fail, and it's the most dangerous form of theft because it's hardest to prove.

For category 3, no off-the-shelf code similarity tool provides strong guarantees. What works better is a combination of dependency analysis, data flow tracking, and behavioral profiling. If a new codebase uses the same third-party APIs in the same sequence, processes the same data structures, and produces identical outputs from identical inputs, the probability of independent development approaches zero—even if the code looks nothing alike.

Measuring What Matters: Dependency Graphs and Call Patterns

Instead of comparing source text, consider comparing the structural fingerprint of a codebase. A dependency graph analysis extracts every external call, library import, and service endpoint invocation. Two codebases that implement the same payment processing pipeline will have nearly identical dependency graphs, regardless of whether they're written in Python or Go.

We've seen this work in practice. A mid-sized SaaS company recently discovered that a former lead engineer had built the core of their recommendation engine for a competitor. The code was in a different language (Java vs. the original Python) and used different libraries. But the dependency graph—calls to specific API endpoints, the sequence of database queries, the data transformation pipeline order—was a 94% match.

The dependency graph approach is not yet standard in most similarity tools. MOSS ignores it entirely. JPlag looks at structural elements but not at the call-graph level. This is an area where we're seeing significant development, and Codequiry's enterprise tier has begun incorporating call-pattern fingerprints as an additional similarity dimension.

The False Positive Problem in Repo-Scale Analysis

In a university setting, a 40% similarity score between two assignments raises eyebrows. In an enterprise with a monorepo containing millions of lines of code, a 40% similarity score between two modules is often just normal. Common design patterns, idiomatic library usage, and even mandated coding standards produce mechanical similarity that is not evidence of copying.

// Example: two different teams implementing the same API client
// Team A
public class PaymentService {
    private final HttpClient client;
    private final String apiKey;
    
    public PaymentService(HttpClient client, String apiKey) {
        this.client = client;
        this.apiKey = apiKey;
    }
    
    public PaymentResponse process(PaymentRequest request) {
        // ... implementation
    }
}

// Team B (six months later, different department)
public class PayoutHandler {
    private final HttpClient http;
    private final String authToken;
    
    public PayoutHandler(HttpClient http, String token) {
        this.http = http;
        this.authToken = token;
    }
    
    public PaymentResult execute(PaymentRequest req) {
        // ... implementation
    }
}

A token-based tool might flag 60-70% similarity between these classes. The names differ. The method signatures differ. But the structural pattern—a class with an HttpClient field, a constructor injecting dependencies, a single public method taking a request object—is ultimately the Java idiom for any HTTP client wrapper. You'd need to set thresholds carefully to avoid drowning in false positives.

In practice, enterprise teams should tune similarity tools to look for >85% similarity at the file level, and >90% at the module or package level. Lower thresholds produce noise that overwhelms investigation capacity.

License Violation Detection: A Different Beast

One area where code similarity does work well in enterprise settings is detecting open-source license violations. If an engineer copies a GPL-licensed library into a proprietary codebase without respecting the license terms, the source is usually copied verbatim or with minimal modification. Token-based similarity catches this reliably.

The growing threat here isn't intentional IP theft but dependency confusion and copy-paste from Stack Overflow. A developer might paste a code snippet from Stack Overflow into their production codebase without checking the license. The snippet is often MIT-licensed (or occasionally CC BY-SA, which creates complications for proprietary software). A similarity scanner that cross-references code against known open-source repositories can flag these snippets.

We've seen enterprise security teams scan their entire codebase quarterly against a corpus of open-source projects, looking for unacknowledged code. The typical hit rate is 2-5% of scanned files containing some amount of unattributed open-source code. Most of it is harmless utility functions, but every now and then a core module turns out to be a direct copy of a GPL library.

Practical Recommendations for Engineering Teams

If you're evaluating code similarity tools for enterprise IP protection, here's what we've learned works:

Run multiple comparators in parallel. No single tool catches everything. Use a token-based approach for within-language detection (MOSS or Codequiry's standard engine), an AST-based tool for obfuscated code, and a dependency-graph analyzer for cross-language porting. Aggregate results and look for overlap.
Build a baseline of normal similarity. Run the tool against your entire codebase and record the distribution of similarity scores. What's "normal" for your organization? A startup using a single framework will have much higher baseline similarity than a large company with diverse tech stacks. Set thresholds based on your data, not vendor defaults.
Invest in manual review processes. Every similarity match needs human interpretation. The false positive rate for cross-module comparisons in a monorepo can exceed 90% at the 50% similarity threshold. You need engineers who understand the codebase to examine flagged matches and determine whether they represent theft, coincidence, or standard practice.
Combine with behavioral data. Code similarity alone is rarely enough to prove IP theft. Pair it with commit log analysis, access logs, and deployment records. If a developer checked out a repository at 2 AM, committed code that matches another module with 88% similarity, and resigned the next week, you have a much stronger case than similarity alone provides.

Where the Industry Is Going

The next generation of code integrity tools is moving toward semantic fingerprints—representations of what code does rather than how it looks. Graph neural networks trained on execution traces can identify functionally equivalent programs even when their source code is completely different. These techniques are still research-stage, but they'll eventually make cross-language IP theft detection practical.

For now, enterprises should not expect a single tool to solve the problem. Code similarity is a useful signal, not a definitive judgment. It works best as part of a layered detection strategy that includes dependency analysis, behavioral monitoring, and manual review.

The real lesson from the academic world is not about tool capabilities. It's about process: similarity detection is a triage mechanism, not a verdict. Universities learned this the hard way after false accusations from over-reliance on MOSS scores. Enterprises should take note before they make the same mistakes with much higher stakes.