SPDX license identification using hashes, fingerprints, and semantic similarity.
Supports command-line usage as well as Python module integration.
pip install license-analyzerTo install from source (e.g., for development):
git clone https://github.com/envolution/license-analyzer.git
cd license-analyzer
pip install .Once installed, the CLI tool is available as:
license-analyzer [OPTIONS] FILE [FILE...]| Option | Description |
|---|---|
--top-n N |
Return top N matches per file. If omitted, returns all matches tied for highest score. |
--format {text,json,csv} |
Output format. Default is text. |
--min-score FLOAT |
Filter out matches with a score below this threshold (default: 0.0). |
--spdx-dir DIR |
Path to SPDX license text files. Defaults to ~/.cache/license-analyzer/spdx/text. |
--cache-dir DIR |
Path to cache directory for license database. |
--embedding-model NAME |
SentenceTransformer model (default: all-MiniLM-L6-v2). |
--update, -u |
Force update of SPDX license data from GitHub. |
--verbose, -v |
Show progress and debug logs. |
license-analyzer LICENSElicense-analyzer license1.txt license2.txtlicense-analyzer --format json --top-n 3 LICENSElicense-analyzer --updateYou can also use license-analyzer directly in your Python code:
from license_analyzer.core import LicenseAnalyzer
analyzer = LicenseAnalyzer()
matches = analyzer.analyze_file("LICENSE")
for match in matches:
print(match.name, match.score, match.method)Or, if you want to analyze text (rather than a file):
text = open("LICENSE").read()
matches = analyzer.analyze_text(text)
for match in matches:
print(match.name, match.score, match.method)Use top_n=None to get all tied top-scoring matches:
matches = analyzer.analyze_text(text, top_n=None)Analysis results for: LICENSE
------------------------------------------------------------
MIT score: 1.0000 method: sha256
{
"LICENSE": [
{
"name": "MIT",
"score": 1.0,
"method": "sha256"
}
]
}file_path,license_name,score,method
"LICENSE","MIT",1.0,"sha256"By default, license data is stored under:
~/.cache/license-analyzer/spdx
To update the SPDX license texts (from GitHub):
license-analyzer --updateThis refreshes cached licenses and triggers database rebuild if needed.
- β SHA256 Hash Match
- β Canonical Fingerprint Match
- β Semantic Embedding Match (via sentence-transformers)
SPDX-License-Identifier: Apache-2.0