Skip to main content

All Questions

5 votes
1 answer
259 views

How do I convert a `float` to a `_Float16`, or even initialize a `_Float16`? (And/or print with printf?)

I'm developing a library which uses _Float16s for many of the constants to save space when passing them around. However, just testing, it seems that telling GCC to just "set it to 1" isn't ...
Coarse Rosinflower's user avatar
1 vote
0 answers
48 views

Flipping a single bit of Floating-points (IEEE-754) mathematically

I'm working on implementing a mathematical approach to bit flipping in IEEE 754 FP16 floating-point numbers without using direct bit manipulation. The goal is to flip a specific bit (particularly in ...
Muhammad Zaky's user avatar
3 votes
2 answers
506 views

How can I convert an integer to CUDA's __half FP16 type, in a constexpr fashion?

I'm the developer of aerobus and I'm facing difficulties with half precision arithmetic. At some point in the library, I need to convert a IntType to related FloatType (same bit count) in a constexpr ...
Regis Portalez's user avatar
2 votes
2 answers
498 views

How do I print the half-precision / bfloat16 values from in a (binary) file?

This is a variant of: How to print float value from binary file in shell? in that question, we wanted to print IEEE 754 single-precision (i.e. 32-bit) floating-point values from a binary file. Now ...
einpoklum's user avatar
  • 133k
0 votes
0 answers
190 views

Clarification on IEEE 754 rounding to nearest, ties to even

I am working on an IEEE 754 16-bit adder, and I am confused at the round to nearest, ties to even logic. The first addition which confuses me is 169.8 (0x594E) + -0.06256 (0xAC01). After shifting and ...
Benjamin Owen's user avatar
0 votes
0 answers
75 views

Precision loss reading from `r16Snorm` texture to `half` variable in Metal

Am I correct in my assumption that reading a value from .r16SNorm texture into Metal Shading Language half data type always unavoidably incur precision loss? It wasn't obvious to me from the start ...
simd's user avatar
  • 2,029
1 vote
3 answers
4k views

How to convert a float to a half type and the other way around in C

How can I convert a float (float32) to a half (float16) and the other way around in C while accounting for edge cases like NaN, Infinity etc. I don't need arithmetic because I just need the types in ...
juffma's user avatar
  • 169
0 votes
0 answers
101 views

16-bit floating point division (half-precision)?

how can I divide a 16-bit float point number by a 16-bit float point number (half-precision)? I did the sign with XOR gate, the exponent with 5bit subtractor, but couldn't do the mantissa. how can I ...
Arthur's user avatar
  • 1
0 votes
1 answer
2k views

List of ARM instructions implementing half-precision floating-point arithmetic

Arm Architecture Reference Manual for A-profile architecture (emphasis added): FPHP, bits [27:24] 0b0011 As for 0b0010, and adds support for half-precision floating-point arithmetic. A simple ...
pmor's user avatar
  • 6,530
2 votes
2 answers
1k views

Double vs Float vs _Float16 (Running Time)

I have a simple question in C language. I am implementing a half-precision software using _Float16 in C (My mac is based on ARM), but running time is not quite faster than single or double-precision ...
YUNBLACK's user avatar
8 votes
1 answer
2k views

Why does bfloat16 have so many exponent bits?

It's clear why a 16-bit floating-point format has started seeing use for machine learning; it reduces the cost of storage and computation, and neural networks turn out to be surprisingly insensitive ...
rwallace's user avatar
  • 33.7k
1 vote
2 answers
2k views

Bit shifting a half-float into a float

I have no choice but to read in 2 bytes that make up a half-float. I would like to work with this in the form of a 4 byte float. Ive done some research and the only thing I can come up with is bit ...
Justin Barren's user avatar
5 votes
3 answers
5k views

How to correctly determine at compile time that _Float16 is supported?

I am trying to determine at compile time that _Float16 is supported: #define __STDC_WANT_IEC_60559_TYPES_EXT__ #include <float.h> #ifdef FLT16_MAX _Float16 f16; #endif Invocations: # gcc trunk ...
pmor's user avatar
  • 6,530
1 vote
1 answer
3k views

Why does converting from np.float16 to np.float32 modify the value?

When converting a number from half to single floating representation I see a change in the numeric value. Here I have 65500 stored as a half precision float, but upgrading to single precision changes ...
Mikhail's user avatar
  • 8,058
0 votes
0 answers
253 views

How to Initialise 16-bit Half Floats (GAS for ARM32)?

When writing an ARM assembly program one can use data type directives to initialise some values. For example, in the example below we are initializing a single float: label: .single 0.0 However, when ...
Werdok's user avatar
  • 181

15 30 50 per page