FastAPI + Transformers + 4-bit Mistral: .to() is not supported for bitsandbytes 4-bit models error

Question

I'm deploying a FastAPI backend using HuggingFace Transformers with the mistralai/Mistral-7B-Instruct-v0.1 model, quantized to 4-bit using BitsAndBytesConfig. I’m running this inside an NVIDIA GPU container (CUDA 12.1, A10G GPU with 22GB VRAM), and I keep hitting this error during model loading:

ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. 
Please use the model as it is...

What I’ve Done So Far:

I'm not calling .to(...) anywhere — explicitly removed all such lines.✅
I'm using quantization_config=BitsAndBytesConfig(...) with load_in_4bit=True.✅
I removed device_map="auto" as per the transformers GitHub issue✅
I'm calling .cuda() only once on the model after .from_pretrained(...), as suggested ✅
Model and tokenizer are being loaded via Hugging Face Hub with HF_TOKEN properly set ✅
The system detects CUDA correctly: torch.cuda.is_available() is True ✅

and last, I cleared the Hugging Face cache (~/.cache/huggingface) and re-ran everything ✅

Here’s the relevant part of the code that triggers the error:

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.1",
    quantization_config=quant_config,
    device_map=None,  # I explicitly removed this
    token=hf_token
).cuda()  # This is the only use of `.cuda()`

tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)

Yet I still get the same ValueError.

Amirhossein Ghanipour · Accepted Answer · 2025-04-03 04:07:43Z

The ValueError error persists because calling .cuda() on a 4-bit quantized model is not allowed with the BitsAndBytes integration in Hugging Face Transformers. When you use load_in_4bit=True in BitsAndBytesConfig, the model is automatically placed on the GPU (if available) during loading, and subsequent calls to .cuda() or .to() are unsupported and will raise this error.

Since you've already removed device_map="auto" and confirmed CUDA is detected (torch.cuda.is_available() == True), the issue lies in the .cuda() call after from_pretrained(). For 4-bit models, you should avoid manually moving the model to the GPU since BitsAndBytes handles this internally.

Update your code to remove the .cuda() call entirely, and you're good then.

Collectives™ on Stack Overflow

FastAPI + Transformers + 4-bit Mistral: .to() is not supported for bitsandbytes 4-bit models error

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related