Skip to main content
5 votes
1 answer
259 views

How do I convert a `float` to a `_Float16`, or even initialize a `_Float16`? (And/or print with printf?)

I'm developing a library which uses _Float16s for many of the constants to save space when passing them around. However, just testing, it seems that telling GCC to just "set it to 1" isn't ...
Coarse Rosinflower's user avatar
1 vote
0 answers
48 views

Flipping a single bit of Floating-points (IEEE-754) mathematically

I'm working on implementing a mathematical approach to bit flipping in IEEE 754 FP16 floating-point numbers without using direct bit manipulation. The goal is to flip a specific bit (particularly in ...
Muhammad Zaky's user avatar
1 vote
0 answers
61 views

Does cuBLAS support mixed precision matrix multiplication in the form C[f32] = A[bf16] * B[f32]?

I'm concerning mixed precision in deep learning LLM. The intermediates are mostly F32 and weights could be any other type like BF16, F16, even quantized type Q8_0, Q4_0. it would be much useful if ...
dentry's user avatar
  • 11
1 vote
1 answer
372 views

Do all processors supporting AVX2 support F16C?

Is it safe to assume that all machines on which AVX2 is supported also support F16C instructions? I haven't encountered any machine that doesn't do that, currently. Thanks
Srihari S's user avatar
2 votes
1 answer
78 views

float16_t rounding on ARM NEON

I am implementing emulation of ARM float16_t for X64 using SSE; the idea is to have bit-exact values on both platforms. I mostly finished the implementation, except for one thing, I cannot correctly ...
Bogi's user avatar
  • 2,638
1 vote
1 answer
61 views

What makes `print(np.half(500.2))` differs from `print(f"{np.half(500.2)}")`

everyone. I've been learning floating-point truncation errors recently. But I found print(np.half(500.2)) and print(f"{np.half(500.2)}") yield different results. Here are the logs I got in ...
Cestimium's user avatar
-2 votes
1 answer
300 views

Why do BF16 models have slower inference on Mac M-series chips compared to F16 models?

I read on https://github.com/huggingface/smollm/tree/main/smol_tools (mirror 1): All models are quantized to 16-bit floating-point (F16) for efficient inference. Training was done on BF16, but in our ...
Franck Dernoncourt's user avatar
3 votes
2 answers
506 views

How can I convert an integer to CUDA's __half FP16 type, in a constexpr fashion?

I'm the developer of aerobus and I'm facing difficulties with half precision arithmetic. At some point in the library, I need to convert a IntType to related FloatType (same bit count) in a constexpr ...
Regis Portalez's user avatar
0 votes
0 answers
34 views

DCGM_FI_PROF_PIPE_FP16_ACTIVE data collect

When I use dcgm-exporter to collect DCGM_FI_PROF_PIPE_FP16_ACTIVE data, I find that the data is as small as 0.001458, and the unit is still %, is this normal?fp16 active data And this is the program I ...
刘润泽's user avatar
0 votes
1 answer
86 views

What is the difference, if any, between model.half() and model.to(dtype=torch.float16) in huggingface-transformers?

Example: # pip install transformers from transformers import AutoModelForTokenClassification, AutoTokenizer # Load model model_path = 'huawei-noah/TinyBERT_General_4L_312D' model = ...
Franck Dernoncourt's user avatar
-1 votes
1 answer
2k views

I load a float32 Hugging Face model, cast it to float16, and save it. How can I load it as float16?

I load a huggingface-transformers float32 model, cast it to float16, and save it. How can I load it as float16? Example: # pip install transformers from transformers import ...
Franck Dernoncourt's user avatar
0 votes
1 answer
489 views

Is there any point in setting `fp16_full_eval=True` if training in `fp16`?

I train a Huggingface model with fp16=True, e.g.: training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=4e-5, ...
Franck Dernoncourt's user avatar
6 votes
1 answer
953 views

AVX-512 BF16: load bf16 values directly instead of converting from fp32

On CPU's with AVX-512 and BF16 support, you can use the 512 bit vector registers to store 32 16 bit floats. I have found intrinsics to convert FP32 values to BF16 values (for example: ...
Thijs Steel's user avatar
  • 1,272
0 votes
1 answer
236 views

Xcode Apple Silicon not comping ARM64 half precision neon instructions: Invalid operand for instruction

To date I have had no issue compiling and running complex ARM Neon assembly language routines in Xcode/CLANG, and the Apple M1 supposedly supports ARMv8.4. But - when I try to use half precision with ...
user2465201's user avatar
0 votes
1 answer
107 views

std::floating_point concept in CUDA for all IEE754 types

I would like to know if CUDA provides a concept similar to std::floating_point but including all IEE754 types including e.g. __half. I provide below a sample code that test that __half template ...
Dimitri Lesnoff's user avatar

15 30 50 per page
1
2 3 4 5
7