“I’m fine-tuning a transformer with batch size 8 and getting CUDA out-of-memory errors. Would gradient checkpointing or mixed precision help?”
Best practices
“I’m fine-tuning a transformer with batch size 8 and getting CUDA out-of-memory errors. Would gradient checkpointing or mixed precision help?”