How Open Source License Auditing Actually Works

Most engineering teams understand that using open-source software comes with obligations. The fear isn't ignorance—it's the sheer scale and complexity of modern dependency trees. A typical Node.js or Python project can pull in hundreds, even thousands, of transitive dependencies. Each carries its own license. Manually tracking this is impossible. The solution is a systematic, automated audit.

An open-source license audit isn't a one-time legal exercise. It's an integrated part of the software development lifecycle. When done correctly, it provides a continuous bill of materials for your software, highlighting potential conflicts before they become legal or operational problems. This guide provides a concrete, step-by-step methodology for conducting such an audit.

"License compliance is a build-time problem, not a lawyer-time problem. The earlier you find an incompatible license, the cheaper it is to fix." – Senior Engineer, FinTech Platform

The Foundation: Understanding License Types and Obligations

Before scanning a single line of code, you must know what you're looking for. Open-source licenses generally fall into three broad categories, each with different requirements.

Permissive (MIT, Apache 2.0, BSD): These impose minimal conditions, typically requiring only attribution and preservation of copyright notices. They are generally safe for commercial use.
Weak Copyleft (LGPL, MPL, EPL): These require that modifications to the licensed library itself be released under the same license. They often allow linking with proprietary code without triggering broader copyleft.
Strong Copyleft (GPL, AGPL): These are the most restrictive. Using a GPL-licensed component typically requires that the entire combined work be released under the GPL. The AGPL extends this requirement to network use—if you run a service that uses AGPL code, you may need to release your source code.

The primary goal of an audit is to identify strong copyleft licenses in your codebase and assess whether your usage complies with their terms. A secondary goal is to ensure you're meeting the attribution requirements of permissive licenses.

Step 1: Generating a Software Bill of Materials (SBOM)

The audit begins with a complete inventory. You need a machine-readable list of every direct and transitive open-source component in your project, including their versions and licenses. This is your Software Bill of Materials (SBOM).

For most language ecosystems, dedicated Software Composition Analysis (SCA) tools automate this. Let's look at a practical example using a Node.js project and a command-line SCA tool like scancode-toolkit or ort (OSS Review Toolkit).

First, ensure you have a lockfile (package-lock.json, yarn.lock) that pins your dependency tree. Then, run a scan:

# Using OSS Review Toolkit (ORT) analyzer
./ort analyze -i /path/to/your/nodejs-project -o /path/to/scan-results

# The analyzer produces an .yml file containing the SBOM.
# You can then use the ORT scanner to fetch license texts.
./ort scan -i /path/to/scan-results/analyzer-result.yml -o /path/to/scan-results

This process creates a detailed inventory file. The output will look something like this in structured YAML or JSON:

dependencies:
  - id: "npm:[email protected]"
    licenses:
      - license: "MIT"
        source: "package.json"
  - id: "npm:[email protected]"
    licenses:
      - license: "MIT"
        source: "LICENSE"
  - id: "npm:[email protected]"
    licenses:
      - license: "GPL-3.0-only"
        source: "declared in source header"

For compiled languages like Java or C++, the process involves scanning both dependency manifest files (pom.xml, build.gradle) and the compiled binaries or JAR files, as licenses declared in manifests can be inaccurate. Tools like FOSSA or Black Duck specialize in this binary analysis.

Step 2: License Identification and Normalization

License identification is not always straightforward. A component might have a LICENSE file containing full text, a SPDX license identifier in its package.json, or only a vague reference like "This project is licensed under the GPL."

The audit tool must normalize these findings to standard SPDX license identifiers (e.g., GPL-3.0-or-later, Apache-2.0). This is critical for automated policy checking. Manual review is often required for ambiguous cases. Look for these common pitfalls:

Dual Licensing: A library may be offered under "MIT OR GPL-2.0". This means you can choose which terms to comply with. Your audit must record which license path you are following.
License Incompatibility: You cannot combine code under certain licenses. The classic example is combining GPL-2.0-only code with Apache-2.0 code; the licenses are incompatible. Your audit must flag these combinations.
Undetected Licenses: Code snippets copied from Stack Overflow or GitHub gists have no explicit license, defaulting to restrictive copyright. A comprehensive audit, like those enabled by platforms such as Codequiry, goes beyond declared dependencies to scan the actual source code for unlicensed or mismatched snippets.

Step 3: Policy Definition and Conflict Detection

With a normalized SBOM, you apply your organization's policy. This is typically a set of rules codified in a configuration file. A simple policy for a proprietary SaaS application might be:

# .fossa.yml or similar policy file
policy:
  rules:
    - name: "Block Strong Copyleft"
      license: ["GPL-2.0-only", "GPL-3.0-only", "AGPL-3.0-only"]
      action: "block"
      message: "Strong copyleft licenses require legal review."
    - name: "Review Weak Copyleft"
      license: ["LGPL-2.1-only", "LGPL-3.0-only", "MPL-2.0"]
      action: "flag"
      message: "Ensure linking compliance for weak copyleft."
    - name: "Allow Permissive"
      license: ["MIT", "Apache-2.0", "BSD-2-Clause", "BSD-3-Clause", "ISC"]
      action: "allow"

Run your SBOM against this policy. The tool will generate a report listing violations. A block violation might halt your CI/CD pipeline. A flag violation requires manual review.

Step 4: The Manual Review and Remediation Workflow

Automation finds problems; humans fix them. When a policy violation is flagged, you enter a remediation workflow.

Assess the Context: Is the violating dependency a direct import or a transitive one five levels deep? How is it used? Is it statically linked (problematic for LGPL) or dynamically linked?
Explore Alternatives: Can you replace the GPL library with a permissively-licensed alternative? For example, replace a GPL JSON parser with one licensed under MIT.
Seek Exceptions: If replacement isn't feasible, you may need a legal exception. This requires documenting the component, its use case, and the business justification. This exception should be recorded in your audit trail.
Fulfill Obligations: For allowed copyleft or permissive licenses, ensure you meet their requirements. This often means updating a NOTICES.txt file in your distribution. Create a process to aggregate required copyright notices and license texts from all dependencies.

Here's a Python script snippet that could be part of a build process to generate a basic notices file:

#!/usr/bin/env python3
import json
import yaml

def generate_notices(sbom_file_path, output_path):
    """Parse SBOM and generate a NOTICES.txt file."""
    with open(sbom_file_path, 'r') as f:
        sbom = yaml.safe_load(f)  # or json.load

    notice_lines = ["THIRD-PARTY SOFTWARE NOTICES\n", "="*30 + "\n"]

    for dep in sbom['dependencies']:
        name_ver = dep['id']
        for lic_info in dep.get('licenses', []):
            license_id = lic_info.get('license', 'NO-LICENSE')
            # Only include permissive/weak copyleft where attribution is needed
            if license_id in ['MIT', 'Apache-2.0', 'BSD-3-Clause', 'ISC']:
                notice_lines.append(f"\nComponent: {name_ver}\n")
                notice_lines.append(f"License: {license_id}\n")
                # In a real script, you would fetch the full license text here.
                notice_lines.append("---\n")

    with open(output_path, 'w') as f:
        f.writelines(notice_lines)

# Example call
generate_notices('ort-scan-result.yml', 'dist/NOTICES.txt')

Step 5: Integration and Continuous Compliance

A one-off audit is of diminishing value. The real goal is to integrate license scanning into your development workflow.

Pre-commit Hooks: Use lightweight scanners to check new dependencies added to manifest files before they are committed.
CI/CD Pipeline Gates: Integrate a full SCA scan into your pull request builds or main branch builds. Fail the build on "block" license violations.
Regular Full Scans: Schedule weekly or monthly deep scans of your entire codebase, including scanning for license texts within source files themselves, to catch copied code or updated dependencies.
Attribution Automation: Automate the generation and inclusion of attribution files in every release artifact and container image.

A robust audit process transforms license compliance from a source of anxiety into a documented, manageable aspect of software development. It provides clarity for your engineers, satisfies legal requirements, and protects the organization from unexpected obligations. The tools and steps exist; the work lies in committing to the process and integrating it into the daily flow of building software.

The final output is not just a report, but a living compliance posture. You have a verifiable SBOM for every release, a clear policy, and an automated system to prevent new violations from being introduced. That is how open-source license auditing actually works.