Newest 'half-precision-float+floating-point' Questions

5 votes

1 answer

259 views

How do I convert a `float` to a `_Float16`, or even initialize a `_Float16`? (And/or print with printf?)

I'm developing a library which uses _Float16s for many of the constants to save space when passing them around. However, just testing, it seems that telling GCC to just "set it to 1" isn't ...

Coarse Rosinflower

145

asked Mar 21 at 2:10

1 vote

0 answers

48 views

Flipping a single bit of Floating-points (IEEE-754) mathematically

I'm working on implementing a mathematical approach to bit flipping in IEEE 754 FP16 floating-point numbers without using direct bit manipulation. The goal is to flip a specific bit (particularly in ...

Muhammad Zaky

11

asked Mar 18 at 10:34

3 votes

2 answers

506 views

How can I convert an integer to CUDA's __half FP16 type, in a constexpr fashion?

I'm the developer of aerobus and I'm facing difficulties with half precision arithmetic. At some point in the library, I need to convert a IntType to related FloatType (same bit count) in a constexpr ...

Regis Portalez

4,870

asked Sep 14, 2024 at 10:04

2 votes

2 answers

498 views

How do I print the half-precision / bfloat16 values from in a (binary) file?

This is a variant of: How to print float value from binary file in shell? in that question, we wanted to print IEEE 754 single-precision (i.e. 32-bit) floating-point values from a binary file. Now ...

einpoklum

133k

asked Feb 1, 2024 at 12:31

0 votes

0 answers

190 views

Clarification on IEEE 754 rounding to nearest, ties to even

I am working on an IEEE 754 16-bit adder, and I am confused at the round to nearest, ties to even logic. The first addition which confuses me is 169.8 (0x594E) + -0.06256 (0xAC01). After shifting and ...

Benjamin Owen

617

asked Dec 11, 2023 at 7:25

0 votes

0 answers

75 views

Precision loss reading from `r16Snorm` texture to `half` variable in Metal

Am I correct in my assumption that reading a value from .r16SNorm texture into Metal Shading Language half data type always unavoidably incur precision loss? It wasn't obvious to me from the start ...

simd

2,029

asked Sep 29, 2023 at 20:47

1 vote

3 answers

4k views

How to convert a float to a half type and the other way around in C

How can I convert a float (float32) to a half (float16) and the other way around in C while accounting for edge cases like NaN, Infinity etc. I don't need arithmetic because I just need the types in ...

juffma

169

asked Jul 30, 2023 at 18:20

0 votes

0 answers

101 views

16-bit floating point division (half-precision)?

how can I divide a 16-bit float point number by a 16-bit float point number (half-precision)? I did the sign with XOR gate, the exponent with 5bit subtractor, but couldn't do the mantissa. how can I ...

Arthur

1

asked May 23, 2023 at 19:16

0 votes

1 answer

2k views

List of ARM instructions implementing half-precision floating-point arithmetic

Arm Architecture Reference Manual for A-profile architecture (emphasis added): FPHP, bits [27:24] 0b0011 As for 0b0010, and adds support for half-precision floating-point arithmetic. A simple ...

pmor

6,530

asked May 15, 2023 at 15:27

2 votes

2 answers

1k views

Double vs Float vs _Float16 (Running Time)

I have a simple question in C language. I am implementing a half-precision software using _Float16 in C (My mac is based on ARM), but running time is not quite faster than single or double-precision ...

YUNBLACK

21

asked Jul 9, 2022 at 16:38

8 votes

1 answer

2k views

Why does bfloat16 have so many exponent bits?

It's clear why a 16-bit floating-point format has started seeing use for machine learning; it reduces the cost of storage and computation, and neural networks turn out to be surprisingly insensitive ...

rwallace

33.7k

asked Jun 2, 2022 at 10:33

1 vote

2 answers

2k views

Bit shifting a half-float into a float

I have no choice but to read in 2 bytes that make up a half-float. I would like to work with this in the form of a 4 byte float. Ive done some research and the only thing I can come up with is bit ...

Justin Barren

23

asked Feb 15, 2022 at 1:29

5 votes

3 answers

5k views

How to correctly determine at compile time that _Float16 is supported?

I am trying to determine at compile time that _Float16 is supported: #define __STDC_WANT_IEC_60559_TYPES_EXT__ #include <float.h> #ifdef FLT16_MAX _Float16 f16; #endif Invocations: # gcc trunk ...

pmor

6,530

asked Nov 15, 2021 at 15:52

1 vote

1 answer

3k views

Why does converting from np.float16 to np.float32 modify the value?

When converting a number from half to single floating representation I see a change in the numeric value. Here I have 65500 stored as a half precision float, but upgrading to single precision changes ...

Mikhail

8,058

asked Jul 8, 2021 at 0:08

0 votes

0 answers

253 views

How to Initialise 16-bit Half Floats (GAS for ARM32)?

When writing an ARM assembly program one can use data type directives to initialise some values. For example, in the example below we are initializing a single float: label: .single 0.0 However, when ...

Werdok

181

asked Dec 18, 2020 at 12:54

Collectives™ on Stack Overflow

All Questions

How do I convert a `float` to a `_Float16`, or even initialize a `_Float16`? (And/or print with printf?)

Flipping a single bit of Floating-points (IEEE-754) mathematically

How can I convert an integer to CUDA's __half FP16 type, in a constexpr fashion?

How do I print the half-precision / bfloat16 values from in a (binary) file?

Clarification on IEEE 754 rounding to nearest, ties to even

Precision loss reading from `r16Snorm` texture to `half` variable in Metal

How to convert a float to a half type and the other way around in C

16-bit floating point division (half-precision)?

List of ARM instructions implementing half-precision floating-point arithmetic

Double vs Float vs _Float16 (Running Time)

Why does bfloat16 have so many exponent bits?

Bit shifting a half-float into a float

How to correctly determine at compile time that _Float16 is supported?

Why does converting from np.float16 to np.float32 modify the value?

How to Initialise 16-bit Half Floats (GAS for ARM32)?

Hot Network Questions

Collectives™ on Stack Overflow

All Questions

Related Tags