DwarfStar x ktransformers

Fork: kt-kernel integration for MoE on CPU/NUMA w/QAT MXFP4 weights

MoE FFN is sparse and for single/low user systems DDR memory can be much more economically viable than VRAM.

However, concurrency is needed to extract the full potential of multi- channel (DDR) / multi-socket (NUMA) systems.

ktransformers' kt-kernel solves this problem by providing NUMA-aware (tensor parallel) MoE/FFN kernels for x64 CPUs. (AMX/AVX512)

This fork integrates kt-kernel into the DwarfStar engine allowing Intel/AMD systems to run hybrid inference w/routed experts on CPU/DDR memory.

Behold:

Original vendor's QAT safetensors MXFP4 MoE weights, no PTQ
Efficient TP MoE kernels with explcit NUMA support
Minimize VRAM usage to embed/attn/shexp tensors, allowing
500K context w/30GB VRAM, or
1M context w/50GB VRAM.
Up to 320tps prefill (100Kctx), 30tps decode (1K ctx)

The integration is transparent: pass --cpu-moe --kt-weight-path <dir> and the engine routes every routed-MoE matmul through kt-kernel instead of the GPU, while attention, shared experts, and HC mixing remain on the GPU.

Quick start

## Prerequisites
sudo apt install libnuma-dev


## Create a workarea
mkdir ds4_numa
cd ds4_numa


## Acquire and Build: kt-kernel / kt-bridge
git clone -b kt-bridge --single-branch --recursive https://github.com/usrlocalben/ktransformers
mkdir ktransformers/kt-kernel/build
pushd ktransformers/kt-kernel/build
cmake .. -DCMAKE_BUILD_TYPE=Release -DKTRANSFORMERS_CPU_USE_AMX_AVX512=ON
cmake --build . --target kt_bridge -j$(nproc)
popd
# Observe: libkt_bridge.so created in ktransformers/kt-kernel/build/


# Optional: e.g. sudo install ktransformers/kt-kernel/build/libkt_bridge.so /usr/local/lib/


## Acquire and Build: ds4 w/kt-kernel support0
git clone -b numa-moe https://github.com/usrlocalben/ds4
cd ds4
make \
  KT_BRIDGE_INC=../ktransformers/kt-kernel/bridge \
  KT_BRIDGE_LIB=../ktransformers/kt-kernel/build/libkt_bridge.so \
  cuda-generic


## Acquire antirez GGUF for embed/attn/output etc.
# As of 2026-05-28 the 2-bit and 4-bit GGUFs are the same wrt. embed/attn/output etc.
# from huggingface.co/antirez/deepseek-v4-gguf
# DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
# (or use the ds4 provided downloader)


## Acquire OEM satetensors for _exp (~150GB)
# huggingface.co/deepseek-ai/DeepSeek-V4-Flash
# uvx --with 'huggingface_hub[cli]' hf download 'deepseek-ai/DeepSeek-V4-Flash' --local-dir /path/to/ds4f/safetensors/


# use LD_LIBRARY_PATH or e.g. copy the lib to /usr/local/lib etc.
export LD_LIBRARY_PATH=/path/to/ds4_numa/ktransformers/kt-kernel/build:$LD_LIBRARY_PATH

# important: ds4 assumes there isn't enough VRAM for 1M context
#            and will switch to managed CUDA memory. avoid this!
export DS4_NO_MANAGED_KV=1

## Run the server
# Example: 2S 9B14(96c x2) NPS4 = 8 NUMA nodes
# 128 / 8 threads per node
# tip: probe/sweep thread count to find optimal perf.
# Important: Use DS4_NO_MANAGED_KV=1 to avoid long-context managed mem policy
./ds4-server \
  --host 0.0.0.0 \
  --port 8888 \
  -m /path/to/antirez/gguf/DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2-imatrix.gguf \
  -c 1000000 \
  --cuda --cpu-moe \
  --kt-weight-path /path/to/safetensors/deepseek-ai/DeepSeek-V4-Flash \
  --kt-cpuinfer 128 \
  --kt-threadpool-count 8

## observe: ~300t/s prefill @ 0-100K context length
##            30t/s decode  @ 1K context length
##        nvidia-smi: 49,794MiB VRAM usage w/1M ctx

Legacy GGUF files are still available if you specifically need the older non-imatrix quants:

./download_model.sh q2           # 96/128 GB RAM machines, legacy non-imatrix
./download_model.sh q4           # >= 256 GB RAM machines, legacy non-imatrix
./download_model.sh pro          # 512 GB RAM machines, legacy non-imatrix PRO

Name		Name	Last commit message	Last commit date
Latest commit History 272 Commits
dir-steering		dir-steering
gguf-tools		gguf-tools
metal		metal
misc		misc
speed-bench		speed-bench
tests		tests
.gitignore		.gitignore
AGENT.md		AGENT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MODEL_CARD.md		MODEL_CARD.md
Makefile		Makefile
README.md		README.md
download_model.sh		download_model.sh
ds4.c		ds4.c
ds4.h		ds4.h
ds4_agent.c		ds4_agent.c
ds4_bench.c		ds4_bench.c
ds4_cli.c		ds4_cli.c
ds4_cuda.cu		ds4_cuda.cu
ds4_distributed.c		ds4_distributed.c
ds4_distributed.h		ds4_distributed.h
ds4_eval.c		ds4_eval.c
ds4_gpu.h		ds4_gpu.h
ds4_help.c		ds4_help.c
ds4_help.h		ds4_help.h
ds4_iq2_tables_cuda.inc		ds4_iq2_tables_cuda.inc
ds4_kvstore.c		ds4_kvstore.c
ds4_kvstore.h		ds4_kvstore.h
ds4_metal.m		ds4_metal.m
ds4_server.c		ds4_server.c
ds4_web.c		ds4_web.c
ds4_web.h		ds4_web.h
linenoise.c		linenoise.c
linenoise.h		linenoise.h
rax.c		rax.c
rax.h		rax.h
rax_malloc.h		rax_malloc.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DwarfStar x ktransformers

Fork: kt-kernel integration for MoE on CPU/NUMA w/QAT MXFP4 weights

Quick start

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DwarfStar x ktransformers

Fork: kt-kernel integration for MoE on CPU/NUMA w/QAT MXFP4 weights

Quick start

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages