The Due Diligence That Unraveled Everything
“The code is clean, modern, and scalable. The team is sharp. The metrics are hitting all our targets.”
Sarah Chen, CTO of ApexLedger, recited the talking points to her co-founder as they prepped for their final Series B pitch. The $20 million round would let them triple their engineering team and launch in three new markets. Their platform—a real-time blockchain transaction layer for institutional finance—was complex, but their internal metrics showed low defect rates and high deployment frequency. They felt bulletproof.
The lead investor, a prestigious Silicon Valley firm, sent over their standard technical due diligence checklist two weeks before the term sheet signing. Item 17 read: “Complete audit of all open-source software dependencies, including full license compliance and provenance documentation.”
“Standard boilerplate,” their VP of Engineering shrugged. “We use a lot of open source. Who doesn’t? We’ll run a scanner and send them the report.”
They ran a basic dependency scanner. It generated a 40-page PDF listing hundreds of libraries. The engineering team skimmed it, noted the common licenses (MIT, Apache 2.0), and attached it to the data room. They missed the warning buried on page 32: “Scan limited to declared package manager dependencies. Does not analyze inlined, copied, or modified source code.”
The investor’s external audit firm didn’t miss it.
The Copied Function That Wasn't MIT Licensed
One week later, Sarah’s phone buzzed with a calendar invite titled “Urgent: Due Diligence Findings.” The lead auditor, a grim-faced lawyer named David, joined the video call with two technical analysts.
“We’ve completed our preliminary scan of your codebase,” David began, his tone neutral. “We used a combination of tools, including FossID and a custom scanner, to perform a full-text and AST-based similarity analysis against known open-source repositories. We’ve identified what we call ‘non-packaged incorporations.’”
“Incorporations?” Sarah asked.
“Code copied directly into your codebase. Not included via a package manager. Some of it is from Stack Overflow, which is generally fine. But a significant portion is from licensed projects.”
He shared his screen. On the left was a file from ApexLedger’s core transaction engine, crypto_utils.py. On the right was a file from a GitHub repository called `fast-ecc`, a library for elliptic-curve cryptography.
# In ApexLedger's crypto_utils.py
def double_point(x, y, p, a):
"""Double a point on an elliptic curve."""
if y == 0:
return None, None
s = ((3 * x * x + a) * pow(2 * y, -1, p)) % p
x_r = (s * s - 2 * x) % p
y_r = (s * (x - x_r) - y) % p
return x_r, y_r
# In the fast-ecc library (GPLv3 licensed)
def _point_double(x1, y1, p, a):
"""Point doubling for short Weierstrass curves."""
if y1 == 0:
return (None, None)
lam = ((3 * x1 * x1 + a) * modular_inverse(2 * y1, p)) % p
x3 = (lam * lam - 2 * x1) % p
y3 = (lam * (x1 - x3) - y1) % p
return (x3, y3)
“The logic is identical,” David pointed out. “Variable names are changed, and you’ve used Python’s built-in `pow` for the modular inverse instead of a helper function. But the structure, the algorithm, the comments—it’s a direct derivative. The `fast-ecc` library is licensed under GPLv3.”
Sarah felt a cold knot form in her stomach. The GPLv3 is a “copyleft” license. It stipulates that if you incorporate GPLv3 code into your project, your entire project must be released under GPLv3. ApexLedger’s core transaction engine was their crown jewel, their proprietary secret sauce. Releasing it as open source would vaporize their valuation.
“How much is there?” she asked, her voice tight.
“We’ve found 47 separate code segments, ranging from 15 to 400 lines, copied or heavily derived from GPLv2, GPLv3, and LGPL projects. They’re embedded across your core modules. Our estimate is that 8% of your business-critical logic is built on code with incompatible licenses.”
“You don’t have a dependency problem. You have a provenance problem. Your codebase contains undocumented, unlicensed software DNA.”
The lead technical analyst explained the gap. Standard dependency scanners look at `package.json`, `requirements.txt`, or `pom.xml`. They see `fast-ecc` isn’t listed, so they report no issue. But they are blind to code that was copied and pasted years ago, perhaps by a founding engineer under a deadline, who found a perfect function on GitHub and didn’t think to check the LICENSE file.
The Scramble and the True Cost
The investment was put on hold indefinitely. The term “material misrepresentation” was used. Sarah’s board was in panic mode.
Her engineering team argued. “It’s just a math function! The algorithm is public domain! You can’t copyright an equation!”
They were right about the math, but wrong about the law. The expression of that algorithm—the specific code—is absolutely copyrighted. The clean-room implementation, where you rewrite the logic from scratch without looking at the original code, is legal. A renamed-variable copy-paste is not.
They faced three terrible options:
- Re-license under GPLv3: Business suicide. Their IP would be public.
- Negotiate with copyright holders: For 47 different code segments? From individual developers, universities, and corporations? The process would take years and cost millions in legal fees with no guarantee of success.
- Find and replace every violating segment: A massive, error-prone engineering undertaking that would freeze feature development for months.
They chose door number three. They brought in a specialist firm. The process was brutal.
First, they needed a complete bill of materials for their entire codebase, not just the packages. They used advanced code similarity scanning to build a map. Tools like Codequiry, which are built for plagiarism detection, proved unexpectedly vital for this forensic provenance work. The scanners compared their code against massive databases of open-source projects, using fingerprinting and AST analysis to find matches even after refactoring.
The results were plotted on a massive dashboard. Red dots—license violations—clustered in their cryptographic and data serialization modules.
The rewrite was a lesson in humility. Senior engineers, who thought they were writing pristine code, had to admit they’d taken “inspiration” too far. Junior developers had copied entire utility classes from tutorials that were, unknown to them, snippets from GPL projects.
Six months and $1.8 million in direct costs later, they had a “clean” codebase. They lost their lead investor. They burned through half their runway. Morale was shattered. They launched no new features that year. Two key senior engineers quit, frustrated by the “legal scavenger hunt” that replaced real engineering work.
The Lessons That Every Startup and Enterprise Must Learn
ApexLedger survived. They closed a smaller, down-round six months later with investors who appreciated the now-airtight compliance posture. But their story is a canonical warning.
1. Dependency Scanners Are Not Enough. The modern software supply chain has two layers: the packaged layer (npm, PyPI, Maven) and the inline layer (copied code, snippets, legacy internal libraries). You must scan both. This requires tools that perform deep code similarity analysis, not just package manifest parsing.
2. Provenance is a Feature, Not an Afterthought. Every significant code block should have a traceable origin. Was it written in-house? Generated by an approved tool? Inspired by an Apache 2.0 licensed project? This metadata needs to be tracked, ideally automatically at commit time.
3. Compliance is a Pipeline, Not a Report. You can’t fix this at the end. It must be integrated into the development workflow. New pull requests should be automatically scanned for code similarity against known open-source repositories. The pipeline should flag a match with a restrictive license before the code is merged.
4. The Cultural Fix is Harder Than the Technical One. Engineers optimize for solving problems. “Just copy the code that works” is a powerful instinct. Companies must train engineers on software IP from day one. Make it as fundamental as security. An honor code for developers: know the origin of every line you don’t write.
Sarah Chen implemented a new rule at ApexLedger. Every Friday, engineers spend one hour reviewing the automated code provenance report. It shows new similarity matches, their licenses, and their locations.
“It’s not about blame,” she says now. “It’s about visibility. That report is our insurance policy. We found out the hard way that in the eyes of investors and acquirers, your code’s legal integrity is just as important as its functional integrity. Maybe more important. A bug can be fixed. A GPL violation in your core can kill the company.”
The startup world is littered with corpses of companies that moved fast and broke things. ApexLedger’s story is a reminder that sometimes, the thing you break isn’t your product—it’s the legal foundation it sits on. And that foundation is made of code, every line of which has an owner.