SparseFlow

GPU acceleration for LLaMA inference using NVIDIA 2:4 sparse tensor cores

🚀 What is SparseFlow?

SparseFlow leverages NVIDIA's sparse tensor cores to accelerate LLaMA-70B inference with:

1.2-1.4× speedup on production workloads (batch size ≥ 512)
Zero accuracy loss - validated across all production shapes
308-334 TFLOPS peak throughput on A100
Drop-in replacement for torch.matmul

📊 Performance Results

Benchmarked on NVIDIA A100 80GB:

Shape	Batch	Speedup	TFLOPS
LLaMA FFN gate	2048	1.42×	308.4
LLaMA FFN gate	512	1.34×	286.8
LLaMA FFN down	2048	1.32×	310.9
LLaMA attn	2048	1.24×	308.9
LLaMA attn	512	1.17×	237.4

Best for production workloads with batch size ≥ 512

💰 ROI Example

At 1B tokens/day on A100 GPUs:

Monthly savings: ~$7,300
Yearly savings: ~$87,600

✅ Correctness Validation

11/11 production shapes passing
Max error: 0.031-0.194 (FP16-appropriate)
Validated against FP32 ground truth

📦 Quick Start

import torch

# Prune to 2:4 sparsity
A_pruned = sparseflow.prune_24(A)

# Convert to sparse format
A_sparse = torch.sparse.to_sparse_semi_structured(A_pruned)

# Fast sparse matmul
C = torch.matmul(A_sparse, B)  # 1.2-1.4× faster

🛠️ Requirements

GPU: NVIDIA Ampere or newer (A100, H100, RTX 30/40)
PyTorch: 2.0+
CUDA: 11.8+ or 12.x
Batch size: ≥ 512 for optimal speedup

📚 Documentation

Integration Guide - Production deployment
Demo Notebook - Interactive demo
Benchmarks - Performance data
ROI Calculator - Cost savings

🧪 Validate & Benchmark

# Correctness validation
python tests/validate_sparseflow_production.py

# Performance benchmarks  
python benchmarks/run_benchmarks.py

# ROI calculation
python tools/roi_calculator.py throughput \
  --tokens-per-day 1000000000 \
  --dense-rps-per-gpu 3.0 \
  --speedup 1.42 \
  --gpu-hourly-cost 2.50

⚡ When to Use

✅ Use for:

LLaMA/Transformer inference
Batch sizes ≥ 512
Ampere+ GPUs
FP16 workloads

❌ Not for:

Small batches (< 256)
Training (no gradient support)
Pre-Ampere GPUs

📄 License

MIT License - see LICENSE

Built by Maple Silicon Inc. for production LLaMA inference at scale.

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
.github		.github
benchmarks		benchmarks
cutlass		cutlass
demo		demo
docs		docs
experiments		experiments
kernels		kernels
sparseflow		sparseflow
src		src
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
demo_production.py		demo_production.py
policy_efficient.json		policy_efficient.json
“Can		“Can
“This		“This

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SparseFlow

🚀 What is SparseFlow?

📊 Performance Results

💰 ROI Example

✅ Correctness Validation

📦 Quick Start

🛠️ Requirements

📚 Documentation

🧪 Validate & Benchmark

⚡ When to Use

📄 License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

MapleSilicon/SparseFlow

Folders and files

Latest commit

History

Repository files navigation

SparseFlow

🚀 What is SparseFlow?

📊 Performance Results

💰 ROI Example

✅ Correctness Validation

📦 Quick Start

🛠️ Requirements

📚 Documentation

🧪 Validate & Benchmark

⚡ When to Use

📄 License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages