26 Mar 01:03

dhuangnm

fb8c429

Speculators v0.4.0.1 Latest

Latest

What's Changed

Minor update to allow hotfix for 0.4.0.1. by @dhuangnm in #357
Fix conversion error by @shanjiaz in #359
reconfigure run_vllm by @shanjiaz in #363

Full Changelog: v0.4.0...v0.4.0.1

Contributors

shanjiaz and dhuangnm

Assets 4

04 Mar 19:48

dhuangnm

v0.4.0

d2914dc

Speculators v0.4.0

Speculators v0.4.0 Release Notes

This release expands the Speculators framework with enhanced algorithm flexibility, improved model support, and critical bug fixes. Key additions include response regeneration capabilities for on-policy training, extensible architecture supporting multiple speculative decoding algorithms beyond Eagle3, comprehensive Vision Language model integration, and updated infrastructure dependencies including PyTorch 2.10 and vLLM v0.16.0.

Key Features

Response Regeneration Scripts: New scripts enable on-policy training workflows by regenerating model responses for question-answer and chat-style datasets using vLLM
Extensible Algorithm Framework: Training infrastructure generalized to support multiple speculative decoding algorithms beyond Eagle3 through registry-based architecture
Vision Language Model Support: Complete training pipeline and speculative decoding support for Vision Language models, including MoE variants
PyTorch 2.10 Compatibility: Framework updated to support PyTorch 2.10 for latest performance optimizations
Normalization Layer Fix: Critical bug resolved, improving acceptance rates on gpt-oss-20b from 25% to 40%
Enhanced Evaluation Framework: Configurable base models and speculative decoding parameters with explicit model separation

Detailed Features

Response Regeneration Scripts

The update introduces response regeneration capabilities to facilitate on-policy training workflows. Draft model performance improves substantially when training on responses generated by the target model itself rather than pre-existing dataset responses.

To enable this workflow, new scripts are available in the response_regeneration directory that regenerate model responses for question-answer and chat-style datasets using vLLM. Complete usage documentation and examples are provided in the accompanying README within the same directory.

Extensible Algorithm Framework

The training infrastructure has been generalized to support multiple speculative decoding algorithms beyond Eagle3. This architectural refactoring introduces a registry-based pattern using @register decorators, allowing algorithms to own their training logic through classmethods. The implementation adds a new base_components.py module providing shared transformer definitions for Llama and Qwen3 architectures, while removing algorithm-specific hardcoded values from core training utilities. Users can now specify --speculator-type as the main argument to train.py, with the training script dynamically looking up the corresponding model class in the registry. This design eliminates code duplication and enables new algorithms such as DFlash and FastMTP (both coming soon) to integrate without modifying core training infrastructure. Developer documentation is available here with detailed implementation guidance.

Vision Language Model Support

Comprehensive support for Vision Language models has been integrated across the training pipeline. The framework now includes data generation capabilities, full training support for both standard and Mixture-of-Experts (MoE) architectures, and speculative decoding inference support in vLLM. Users can generate training data and train vision language models directly within Speculators, with MoE variants supported for production deployment with speculative decoding acceleration.

With this functionality, a Qwen3-VL-235B-A22B-Instruct speculator was trained and released.

PyTorch 2.10 Compatibility

The framework has been updated to support PyTorch 2.10, enabling users to leverage the latest performance optimizations and enhanced training capabilities from this release.

vLLM v0.16.0 Integration

Data generation support has been updated to vLLM v0.16.0, ensuring compatibility with the latest vLLM features and performance improvements for hidden-state generation.

Improvements and Fixes

Normalization Layer Correction

A critical bug in the Eagle3 model architecture has been resolved. The original research code applied a final layer normalization before the language model head, but vLLM-based data generation omitted this step, causing training targets to be computed incorrectly. This was particularly impactful on gpt-oss models. The fix introduces a verifier_norm layer in Eagle3DraftModel to apply normalization before the language model head, properly loading the final normalization weights from the verifier model. Results demonstrate substantial performance improvements, with acceptance rates for gpt-oss-20b on math reasoning tasks improving from 25% to 40% even when trained on a small 20k sample ultrachat dataset. The data_format_version parameter has been removed, and a new embed_requires_grad configuration option controls whether embedding layer weights update during training.

Enhanced Evaluation Framework

The evaluation framework has been restructured to support configurable base models and speculative decoding parameters. The command structure now uses separate -b BASE_MODEL -s SPECULATOR_MODEL flags instead of a single model parameter, with added --num-spec-tokens (default: 3) and --method (default: eagle3) parameters for flexible testing. All environment configuration files have been updated with explicit base/speculator model pairs. Users can now test different base models against the same speculator and easily adjust speculation depth to optimize speed/accuracy tradeoffs, while explicit model separation clarifies the architecture.

Exact Sample Length Tracking

The training data pipeline now implements exact sample length tracking. Previously, sequence lengths were estimated by comparing file sizes, which occasionally produced inaccurate results with approximately 10% failure rates for large datasets. The data generation script now collects sample lengths and stores exact sequence lengths in a sample_lengths.json file alongside generated data. The dataloader first attempts to load exact lengths from this file when available, falling back to the original file-size approximation method for backward compatibility with existing datasets.

Full Verifier Vocabulary Support

The t2d and d2t tensor parameters in Eagle3DraftModel are now optional, allowing training with either limited vocabulary mappings or the full verifier vocabulary. When vocabulary mapping paths are not provided, the training script loads the verifier configuration and uses its full vocabulary size.

New Contributors

@momo609 made their first contribution in #232
@Vishnu-sai-teja made their first contribution in #262
@gDINESH13 made their first contribution in #261
@svlandeg made their first contribution in #289
@guan404ming made their first contribution in #291
@VincentG1234 made their first contribution in #317

Full Changelog: v0.3.0...v0.4.0

Contributors

svlandeg, momo609, and 4 other contributors

Assets 4

10 Dec 18:18

dhuangnm

v0.3.0

be6e86e

Speculators v0.3.0

Speculators v0.3.0 Release Notes

This Speculators v0.3.0 release provides end-to-end training support for Eagle3 speculative decoding draft models.

Key new features include:

Offline training data generation support using vLLM
Single- and multi-layer draft model training for MoE and non-MoE models
End-to-end scripts to generate data, train your draft model, and validate performance in vLLM
Examples highlighting training for Llama3, Qwen3, and gpt-oss

Offline Training Data Generation Support

Offline training data generation is now supported through a new hidden-states generator using vLLM. The generator provides support for MoE and non-MoE models. Vision-language support will be added in a future release.
Generated data is saved as individual data_{index}.pt files. Each data point contains input_ids, hidden_states, and loss_mask. Along with the hidden states, a token_freq.pt file is also generated, containing information about token frequencies that is used to build the target-to-draft and draft-to-target vocabulary files required for training. Finally, a data_config.json is produced, containing metadata about the data generation process.

The hidden-states generator includes the following features:

Multiprocess executor for efficient batch inference
Tensor parallelism support
Automatic KV-cache and memory management

The following scripts can be used to enable offline data generation:

data_generation_offline.py: preprocesses data, saves token-frequency distribution, and generates hidden states
build_vocab_mapping.py: builds t2d and d2t tensors

Draft Model Training Support ✨

Full training support is now available for single- and multi-layer Eagle3 draft models for both Mixture of Experts (MoE) and non-MoE target models.

Training support includes:

Updated Eagle3 draft model definitions with all features required for efficient Eagle3 model training
Added logic for Eagle3 algorithm's train-time-testing, now integrated into the Eagle3DraftModel forward method. The forward method now supports dynamic step counts and computes per-step loss and accuracy.
New document-masking support enabling fast, memory-efficient Eagle3 draft model training. This approach exploits sparsity in train-time-test attention masks, providing faster performance and lower memory usage compared to a naive full attention matrix.

The following script can be used for training:

train.py

End-to-End Scripts and Examples

New E2E script for data generation and training speculative draft models

A summary of the new scripts added to run each of the individual steps in the workflow is listed below:

A new end-to-end script has also been added that runs the full workflow mentioned above under a single configuration. The script provides a simplified interface for configuring a full training run that can be launched once. Internally, the script runs each step of the process and ensures data flows correctly from one step to the next.

Training examples have been added for Llama3, Qwen3, and gpt-oss:

Testing and validation

New vLLM benchmarking framework

A new automated evaluation framework that benchmarks Eagle3 speculator models using vLLM and GuideLLM has been added.
Preconfigured evaluation configurations are available for the following models:

Llama-3.1-8B
Llama-3.3-70B
gpt-oss-20B
Qwen3-8B
Qwen3-32B

The framework can be reviewed in the examples/evaluate/eval-guidellm folder.

To run an evaluation:

./run_evaluation.sh -c configs/llama-3.1-8b-eagle3.env

This command automatically handles vLLM server startup, runs GuideLLM benchmarks, extracts acceptance-rate metrics from logs, and cleans up when complete.

The framework supports multiple dataset types, including HuggingFace datasets with colon syntax for specific files (e.g., org/dataset:file.jsonl), local files, and directories. It includes modular bash scripts following best practices, with proper error handling and process management, configurable sampling parameters (temperature, top_p, top_k), and outputs detailed metrics including weighted per-position acceptance rates and conditional acceptance probabilities.

Configuration precedence for the evaluation run is as follows and can be easily changed:

CLI arguments
Config file
Framework defaults

Deprecations

Previously supported training code under research has been removed.

New Contributors

@SwekeR-463 made their first contribution in #212

Full Changelog: v0.2.0...v0.3.0

Contributors

SwekeR-463

Assets 4

03 Nov 15:10

dhuangnm

v0.2.0

02212fa

Speculators v0.2.0

Speculators v0.2.0 Release Notes

This Speculators v0.2.0 release introduces the following new features and enhancements:

Support for Draft Models with Multiple Decoder Layers: Previously, only draft models with a single decoder layer were supported. The Eagle3 converter now sets the num_hidden_layers from the config instead of always assuming one layer.
Added Support for eagle_aux_hidden_state_layer_ids Argument: This new argument allows users to toggle the layer IDs of the hidden state layers that are fetched during inference time. This enables support for converting Llama4 Maverick draft models to the Speculators format and running the converted model in vLLM.

Updates and Deprecations:

Python 3.9 Support Removed: Support for Python 3.9 has been removed and will no longer be provided. Python 3.10+ will be supported going forward.
Default Number of Speculative Tokens Changed: The default number of speculative tokens has been changed from 5 to 3 for all Eagle and Eagle3 models.
Override tie_weights() in Eagle3Speculator: This override prevents vocabulary corruption and supports Transformers 4.54.1.
Updated head_dim Calculation in Eagle3 Converter: The head_dim value is now used from the config if provided; otherwise, it is calculated using the formula hidden_size // num_heads.
Eagle3 Draft Models Retain Original Dtype: All Eagle3 draft models now keep their original dtype after being converted to the Speculators format. Previously, all converted draft models were cast to FP32.
Extended Logic for target_vocab_size: The system defaults to using the "t2d" length, but if not available recursively search the verifier model's config file for vocab_size.
Full End-to-End vLLM Smoke Testing: Extended and added full end-to-end vLLM smoke testing for both converted and unconverted models.

Full Change Log

Update README install commands now that Speculators is live on PyPi by @markurtz in #89
override transformer tie_weights to prevent shape mismatch by @shanjiaz in #74
[Testing][vLLM] Add vLLM Eagle3 Test Cases by @dsikka in #91
Adding .readthedocs.yaml by @aireilly in #92
[Tests][Eagle3] Extend vLLM test cases with conversion step by @dsikka in #93
Model architectures by @anmarques in #90
Fix type annotation override in SpeculatorModel.generate method by @rahul-tuli in #111
Update mkdocs by @aireilly in #115
Update README.md with badges by @dsikka in #108
Update ReadME feature content by @dsikka in #109
Fix broken links by @aireilly in #125
Update README with new models and their links by @eldarkurtic in #135
Fix for Eagle attention arch when head_dim is given in config.json by @eldarkurtic in #134
Fix for draft models always being in fp32 datatype by @eldarkurtic in #136
Fix install command for dev by @eldarkurtic in #137
Fix 'test_download_with_cache_dir' by @dbarbuzzi in #141
Update link checker so that it comments on existing issue by @fynnsu in #129
Prevent forced casting to fp16 dtype by @eldarkurtic in #145
Set default num of spec tokens to 3 by @eldarkurtic in #146
Update speculator config & converter to support hidden states indexing by @shanjiaz in #142
add num_hidden_layers by @shanjiaz in #147
Update CI Testing by @dsikka in #150
added loading util for specific layers by @shanjiaz in #144
Remove PyPI publishing steps from nightly workflow by @dsikka in #151
Refactor e2e tests to support external vLLM by @dbarbuzzi in #153
Remove remaining python 3.9 usages by @fynnsu in #152
Fix a typo in docs by @eldarkurtic in #107
Added loading util tests by @shanjiaz in #155
Extend E2E Tests for EAGLE3 Models by @rahul-tuli in #156
Remove nightly in favour of testing repo by @dsikka in #159
Remove nightly tests badge from README by @fynnsu in #163
add back link-checks by @dhuangnm in #162
Fix dev link checker workflow to comment directly on PRs by @markurtz in #164
Only load Verifier model if attachment_mode is 'full' by @fynnsu in #154
Fix EAGLE3 vLLM tests by disabling torch compile cache by @rahul-tuli in #166
bump up version for last release by @dhuangnm in #167

New Contributors

@aireilly made their first contribution in #92
@anmarques made their first contribution in #90
@eldarkurtic made their first contribution in #135
@dbarbuzzi made their first contribution in #141
@dhuangnm made their first contribution in #162

Full Changelog: v0.1.0...v0.2.0

Contributors

dbarbuzzi, eldarkurtic, and 8 other contributors

Assets 4

08 Aug 01:45

markurtz

v0.1.0

8a49095

Speculators v0.1.0 -- First Public Release

Overview

This first public release publishes the complete initial codebase for Speculators — a unified library for building, evaluating, converting, and serving speculative decoding algorithms for LLMs. It delivers the core framework, CI/CD and developer workflow, model/config implementations (EAGLE v1/HASS/EAGLE‑3), converter CLIs from external research repos, a Hugging Face–compatible model format with vLLM serving support, and prototype training code.

What’s New (Highlights)

Unified, extensible framework for speculator models (build, evaluate, convert, store)
Hugging Face–compatible speculator format with serving support landed in vLLM
Models/configs for EAGLE v1 (HASS-style), HASS, and EAGLE‑3 (multi-layer types)
Checkpoint converter CLIs (Eagle, Eagle‑3) from external research repositories
Prototype training code and scripts (EAGLE‑1-style drafter, HASS) + requirements
Production readiness: CI/CD, tests, style, docs, examples, and benchmarks

Use Cases Enabled

Register and configure new speculator algorithms via a standardized configuration and registry system
Convert external checkpoints (EAGLE/EAGLE‑3/HASS variants) into the Speculators format with CLI tools
Serve Speculators models directly in vLLM for low‑latency inference
Evaluate and benchmark speculators (e.g., with GuideLLM), including quantized verifier swaps
Prototype‑train drafters using provided research code and scripts

Getting Started

Install (Python 3.9–3.13 on Linux or macOS):

pip install git+https://github.com/neuralmagic/speculators.git

Serve with vLLM (requires v1 API):

VLLM_USE_V1=1 vllm serve RedHatAI/Qwen3-8B-speculator.eagle3

Explore examples and research: examples/, research/eagle3/, research/hass/

Compatibility Notes

Python: 3.9–3.13
OS: Linux and macOS
Transformers pinned to avoid mypy regressions (PR #73)
vLLM v1 API required for serving (set VLLM_USE_V1=1)

Full Changelog (v0.1.0)

First public release of Speculators. This release publishes the complete initial codebase and enables the first set of core use cases for speculative decoding with LLMs.

Added

Base configuration and registry system with tests: Speculator, Token Proposal, and Model Speculator configs; EagleSpeculatorConfig for EAGLE v1/HASS; config serialization/loading (PRs #26, #27, #28, #29, #34, #36)
Eagle speculator model and support for multiple transformer layer types (PRs #37, #49)
Eagle‑3 speculator model and Qwen support (PRs #50, #55)
Checkpoint converter CLIs: Eagle and Eagle‑3; standardized converter interface (PRs #39, #53, #72)
vLLM serving documentation and Qwen benchmark assets (PRs #77, #78, #82, #83)
Examples directory and README for getting started (PR #81)
Branding assets (icons, logos, user‑flow diagrams) (PR #87)

Changed

Standardized converter CLI UX and flags (PR #72)
Documentation/readme formatting and content updates (PRs #70, #75, #83, #85)

Fixed

Missing embeddings in converted checkpoints/workflows (PR #65)
CLI flags and norm_before_residual toggle (PRs #57, #58)
Compatibility: pin transformers to resolve mypy/typing regressions (PR #73)

CI/CD and Tooling

GitHub Actions: migrated link checks to lychee and updated workflows (PRs #3, #45)
PR comment behavior refinements (PR #47)

Research and Training

Training code for EAGLE‑1‑style drafter with multi‑step training (PR #35)
HASS/EAGLE‑3 research updates, requirements, and DeepSpeed dependency (PRs #64, #67, #69)

Documentation

vLLM serving instructions, Qwen benchmark results, examples README, and research readmes (PRs #64, #70, #77, #78, #81, #83, #85)

New Contributors

@fynnsu made their first contribution in PR #47
@shanjiaz made their first contribution in PR #53
@MeganEFlynn made their first contribution in PR #55

Thanks also to continuing contributors: @markurtz, @rahul-tuli, @dsikka

Contributors

markurtz, dsikka, and 4 other contributors

Assets 2

Releases: vllm-project/speculators

Speculators v0.4.0.1

What's Changed

Contributors

Uh oh!