Writing the rounding and truncating functions for float point numbers
A float point number could be expressed like this:
FloatPointNumber = BaseFloatPoint x 2^ExponentFloatPoint
By "bits of the float point type," I mean the bits that BaseFloatPoint has in the float type
under consideration. These bits are used to store the digits of the float number. Both BaseFloatPoint and ExponentFloatPoint should be binary numbers type in the end.
The question here requires that the float point type be 64 bits, that is BaseFloatPoint
is 64 bits, and here when I say "float point number" or "float point type" or "float," I mean
a 64-bit float point type, a 64-bit float, that has a 64-bit BaseFloatPoint.
Let's define a function that returns decimal digits must have a float point number w.
lenfld(w)= floor((63-floor(log(2,max(w,1))))/3.322)
which in APL would be written
lenfld←{⌊(63-⌊2⍟⍵⌈1)÷3.322}
Or better, considering the external ⎕fpc, or external precision:
lenfld←{⌊3.322÷⍨⎕fpc-1+⌊2⍟1⌈⍵}
The question requires writing two functions:
One called "truncdgt," with inputs of a non-negative integer dg (for the number of digits) and a float w,
and output of a float res. The output of this function is such that if it is constructed in a string with the float
w truncated at the digit dg—let's call it Str—then the conversion of that string to a float always satisfies this formula:
abs(res-(conversion to float)(Str))< 10^-lenfld(w)
One called "roundgt," with inputs a nonnegative integer dg (for the digit number) and a float w,
and output a float res. The output of this function is such that if it is constructed in a string with the float
w rounded to the digit dg, let's call it Str, then the conversion of that string to a float
always satisfies this formula:
abs(res-(conversion to float)(Str))< 10^-lenfld(w)
The function (conversion to float)(String_trunc dg w) it seems be ok as the function truncdgt, the same for roundgt. But should be many functions of them.
If the digit number for the float number to be truncated or rounded is outside the 64 bits that can be reached for that number,
both functions must return NaN or it is not a number, or something that is clearly not a
float number.
Here on my PC, in the APL language, there is an implementation of these two functions, the float point numbers,
it is possible, them have a "v" suffix; that's how they are written in the exercises.
This question has the codegolf tag, so answers that solve all the exercises will be ranked by the number
of bytes of code in the answer, and the answer with the fewest bytes of code implementing the two functions
"truncdgt" and "roundgt" will win.
The exercises are as follows:
- First line is the input, then the name of the function that prints the result and other info; "tds" mean it call truncdgt, "rds" mean it call roundgt. Then the input digit dg, input float w
- Next line: the float that returns as the result "res" the function "truncdgt" or "roundgt" (I think converted by the system to digits),
the truncated or rounded string of the number w from digit dg, which we will call "Str", delta=abs(res-(conversion to float)(Str))
- Next line: lenfld(w), and delta<10^-lenfdl(w), 1 means true, 0 means false.
Exercises:
tds 1 1234567890123456785v
∅
rds 1 1234567890123456785v
∅
tds 0 123456789012345678.5v
res= 123456789012345678.00v Str= 123456789012345678.v d=∣res-⍎Str= 0v
Lenfld= 2v (d<1e¯ 2v )= 1
rds 0 123456789012345678.5v
res= 123456789012345679.00v Str= 123456789012345679.v d=∣res-⍎Str= 0v
Lenfld= 2v (d<1e¯ 2v )= 1
tds 0 123456789012345678.5v
res= 123456789012345678.00v Str= 123456789012345678.v d=∣res-⍎Str= 0v
Lenfld= 2v (d<1e¯ 2v )= 1
rds 0 123456789012345678.5v
res= 123456789012345679.00v Str= 123456789012345679.v d=∣res-⍎Str= 0v
Lenfld= 2v (d<1e¯ 2v )= 1
rds 4 0.7390851332v
res= 0.739100000000000000v Str= 0.7391v d=∣res-⍎Str= 0v
Lenfld= 18v (d<1e¯ 18v )= 1
tds 4 0.7390851332v
res= 0.739000000000000000v Str= 0.7390v d=∣res-⍎Str= 0v
Lenfld= 18v (d<1e¯ 18v )= 1
rds 2 2.8078v
res= 2.810000000000000000v Str= 2.81v d=∣res-⍎Str= 0v
Lenfld= 18v (d<1e¯ 18v )= 1
tds 2 2.8078v
res= 2.800000000000000000v Str= 2.80v d=∣res-⍎Str= 0v
Lenfld= 18v (d<1e¯ 18v )= 1
rds 6 2.8078v
res= 2.807800000000000000v Str= 2.807800v d=∣res-⍎Str= 0v
Lenfld= 18v (d<1e¯ 18v )= 1
tds 6 2.8078v
res= 2.807800000000000000v Str= 2.807800v d=∣res-⍎Str= 0v
Lenfld= 18v (d<1e¯ 18v )= 1
tds 2 4.349999999999999999v
res= 4.340000000000000000v Str= 4.34v d=∣res-⍎Str= 4.33680869E¯19v
Lenfld= 18v (d<1e¯ 18v )= 1
rds 2 4.349999999999999999v
res= 4.350000000000000000v Str= 4.35v d=∣res-⍎Str= 0v
Lenfld= 18v (d<1e¯ 18v )= 1
tds 2 ¯4.349999999999999999v
res= ¯4.340000000000000000v Str= ¯4.34v d=∣res-⍎Str= 4.33680869E¯19v
Lenfld= 18v (d<1e¯ 18v )= 1
rds 2 ¯4.349999999999999999v
res= ¯4.350000000000000000v Str= ¯4.35v d=∣res-⍎Str= 0v
Lenfld= 18v (d<1e¯ 18v )= 1
tds 19 4.349999999999999999v
∅
tds 18 4.349999999999999999v
res= 4.349999999999999999v Str= 4.349999999999999999v d=∣res-⍎Str= 0v
Lenfld= 18v (d<1e¯ 18v )= 1
tds 17 4.349999999999999999v
res= 4.349999999999999990v Str= 4.34999999999999999v d=∣res-⍎Str= 0v
Lenfld= 18v (d<1e¯ 18v )= 1
tds 16 4.349999999999999999v
res= 4.349999999999999900v Str= 4.3499999999999999v d=∣res-⍎Str= 4.33680869E¯19v
Lenfld= 18v (d<1e¯ 18v )= 1
rds 19 4.349999999999999999v
∅
rds 18 4.349999999999999999v
res= 4.349999999999999999v Str= 4.349999999999999999v d=∣res-⍎Str= 0v
Lenfld= 18v (d<1e¯ 18v )= 1
rds 17 4.349999999999999999v
res= 4.350000000000000000v Str= 4.35000000000000000v d=∣res-⍎Str= 0v
Lenfld= 18v (d<1e¯ 18v )= 1
rds 16 4.349999999999999999v
res= 4.350000000000000000v Str= 4.3500000000000000v d=∣res-⍎Str= 0v
Lenfld= 18v (d<1e¯ 18v )= 1
These two results below have undefined behavior because the input float has more digits than the float can store in its space, so they should be rejected.
tds 2 4.3499999999999999999v
res= 4.350000000000000000v Str= 4.35v d=∣res-⍎Str= 0v
Lenfld= 18v (d<1e¯ 18v )= 1
rds 2 4.3499999999999999999v
res= 4.350000000000000000v Str= 4.35v d=∣res-⍎Str= 0v
Lenfld= 18v (d<1e¯ 18v )= 1
For these last two results, there is an undefined behavior because all the digits of 4.3499999999999999999v
are 20, while the float type can only render significant numbers with a total number of digits less than or equal to 19.
4.3499999999999999999v
4.3499999999999999999v
1 2345678901234567890
is 20 digits
when better result to me that number can hold only lenfld(4.349999999999999999v)=18 decimal digits.
Here for that number I have:
30⍕4.349999999999999999v
4.3499999999999999991v
1 2345678901234567890 is 20 digits.
This prints digits that can't be printed because they don't exist.
So I think that number has enough bits in memory to be >= 4.35.
Translate to English by Google and some my changes.