Skip to main content

All Questions

-1 votes
1 answer
2k views

I load a float32 Hugging Face model, cast it to float16, and save it. How can I load it as float16?

I load a huggingface-transformers float32 model, cast it to float16, and save it. How can I load it as float16? Example: # pip install transformers from transformers import ...
Franck Dernoncourt's user avatar
1 vote
1 answer
948 views

Can language model inference on a CPU, save memory by quantizing?

For example, according to https://cocktailpeanut.github.io/dalai/#/ the relevant figures for LLaMA-65B are: Full: The model takes up 432.64GB Quantized: 5.11GB * 8 = 40.88GB The full model won't fit ...
rwallace's user avatar
  • 33.7k