I'm deploying a FastAPI backend using HuggingFace Transformers with the mistralai/Mistral-7B-Instruct-v0.1
model, quantized to 4-bit using BitsAndBytesConfig
. I’m running this inside an NVIDIA GPU container (CUDA 12.1, A10G GPU with 22GB VRAM), and I keep hitting this error during model loading:
ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models.
Please use the model as it is...
What I’ve Done So Far:
I'm not calling
.to(...)
anywhere — explicitly removed all such lines.✅I'm using
quantization_config=BitsAndBytesConfig(...)
withload_in_4bit=True
.✅I removed
device_map="auto"
as per the transformers GitHub issue✅I'm calling
.cuda()
only once on the model after.from_pretrained(...)
, as suggested ✅Model and tokenizer are being loaded via Hugging Face Hub with
HF_TOKEN
properly set ✅The system detects CUDA correctly:
torch.cuda.is_available()
isTrue
✅
and last, I cleared the Hugging Face cache (~/.cache/huggingface
) and re-ran everything ✅
Here’s the relevant part of the code that triggers the error:
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1",
quantization_config=quant_config,
device_map=None, # I explicitly removed this
token=hf_token
).cuda() # This is the only use of `.cuda()`
tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
Yet I still get the same ValueError
.