1,525 questions
8
votes
1
answer
169
views
Does the MSVC implementation of `signaling_NaN` comply with the the latest IEEE floating-point standard?
As far as I can tell, the MSVC implementation of signaling_NaN does not comply with IEEE 754-2019, the latest version of the IEEE floating-point standard.
Unfortunately, I do not have a copy of the ...
0
votes
0
answers
43
views
How to make TypeORM auto-fixing all floating-point values according to their db schema type?
Not sure if that is possible at all?...
It is typical problem - when value in db is 4.725 but in UI it shows 4.7250000000000005.
And there are lot of other value examples which generating such kind of ...
3
votes
2
answers
123
views
How to trigger exactly only *one* SSE-exception
I've written a little test program that tiggers FPU-exceptions through feraiseexcept():
#include <iostream>
#include <cfenv>
using namespace std;
int main()
{
auto test = []( int exc,...
1
vote
3
answers
207
views
Is Math.sqrt(x) and Math.pow(x, 0.5) equivalent?
In ECMAScript, given a non-negative, finite double x, is the following assertion always true?
Math.sqrt(x) === Math.pow(x, 0.5)
I know that both Math.sqrt() and Math.pow() are implementation-...
7
votes
1
answer
131
views
Is it always true that x * y = ((x * y) / y) * y under IEEE 754 semantics?
Given two nonzero, finite, double-precision floating point numbers x and y, is it always true that the equality
x * y == ((x * y) / y) * y
holds under default IEEE 754 semantics?
I've searched ...
1
vote
2
answers
114
views
Java Double Precision - Rounding - %f specifier
Numbers sometimes cannot be expressed exactly when they are represented in double precision or single precision. Of course working with bigdecimal is a solution, I know that.
Let's come to my question:...
2
votes
2
answers
113
views
Floating Point: Why does the implicit 1 change the value of the fractional part?
I was reading about the floating point implementation from the comments of a ziglings.org exercise, and I came across this info about it.
// Floating further:
//
// As an example, Zig's f16 is a IEEE ...
11
votes
1
answer
656
views
How to achieve same double to string conversion rounding results in C++ and C#?
I want to convert a double to a string with a given number of decimal places in C++ as well as in C# and I want the results of those conversions to be the same in both languages. Especially C++ ...
3
votes
2
answers
147
views
Convert floating-point value to cyclic range?
I'm not sure if I'm using the right terminology, but occasionally I find myself needing to canonicalize a floating-point value to a range in a cyclic manner. (This can be useful, for instance, for ...
2
votes
2
answers
182
views
Why does IEEE 754 define 1 ^ NaN as 1, and why do Java and Javascript violate this?
IEEE 754 defines 1 ^ n as 1, regardless of n. (I'm not paying $106 to confirm this for myself, but this paper cites page 44 from the 2008 standard for this claim.) Most programming languages seem to ...
-1
votes
1
answer
106
views
Throw exception when trying to put a number on a float that will be rounded and lose precision
I need to process a CSV. The users define if a column has floats or doubles. The thing is, sometimes they put doubles in a float column, and after it rounds the values and the users only find out ...
5
votes
3
answers
793
views
Why do we need both a round bit and a sticky bit in IEEE 754 floating point implementations?
In my university lecture we just learnt about IEEE 754 arithmetic using the following table:
Guard
Round
Sticky
Result
0
x
x
Round down (do nothing to significand)
1
1
x
Round up
1
0
1
Round up
1
0
0
...
3
votes
1
answer
60
views
IEEE Floating-Point Number Bound for (b-a)+a, where 0=<a<=b
Question
Given two non-negative numbers a and b, where a is less or equal to b, I care in whether y as per the following algorithm is less or equal to b.
Algorithm:
x = b-a;
y = x+a;
Is y<=b in ...
9
votes
2
answers
174
views
How many values can be represented in a range when using 64-bit floating point type in the most efficient manner
Given a 64-bit floating point type (ieee-754), and a closed range [R0,R1] assuming both R0 and R1 are within the representable range and are not NaN or +/-inf etc.
How does one calculate the number of ...
0
votes
3
answers
319
views
Windows on ARM: Math is not trapping
I am using this code to test trapping math on Windows 11, VS2022, amd64 and arm64 systems:
Godbolt
// compile with: cl /O2 /EHa /std:c++20
#include <cfenv>
#include <eh.h>
#include <...
2
votes
1
answer
114
views
Why does this Java float addition example behave like the mantissa is 24 bits long?
Intro:
With Java floats, I noticed that when you add 1.0 to a certain range of tiny negative numbers, it equals 1.0. I decided to investigate this and learned a lot about how floats work in my quest ...
0
votes
1
answer
93
views
Lossy conversion between long double and double
I was cought by suprise that the following code returns false for gcc 13 and clang 18. Why does this happen? Isn't the number 8.1 representable in both formats?
#include <iostream>
#include <...
4
votes
1
answer
262
views
What happens when the integer value with more than 52-bit of mantissa is stored in the double data type?
#include <stdio.h>
int main() {
double a =92233720368547758071;
printf("value=%lf\n", a);
int i;
char *b = &a;
for (i = 0; i < 8; i++) {
printf("...
3
votes
1
answer
132
views
IEEE floating-point rounding in C
I am having trouble to understand a specific IEEE double computation. Take the following C99 program, that runs on a host with IEEE double (8 bytes, 11 bits biased exponent, 52 bits encoded mantissa):...
1
vote
0
answers
74
views
Parsing of floating point numbers with error on truncated precision
I am writing a parser for a LIN Description File(LDF). In a LDF file there may be floats. Currently I have a lexer that produces following question-relevant tokens:
Number: any character sequence ...
4
votes
1
answer
85
views
How can I bitwise-cast a 32-bit float into an integer without using typed arrays?
My Arithmetic Expression Compiler, if run in a modern browser, can target both FlatAssembler and GNU Assembler. GNU Assembler doesn't support specifying float values in decimal notation, so my ...
0
votes
0
answers
48
views
How to create a file with IEEE single/double format data values using python?
I have a .dat file that is a binary file with a regular structure. Length of one
entry - 20 bytes. Each entry contains:
date/time (8 bytes, IEEE double format 12/30/1899 12:00 am)
humidity value (4 ...
2
votes
2
answers
128
views
Losing precision when casting float to double, even for values that have precise binary representations
The canonical example used for explaining the binary vs decimal representation of floating points is the value 0.3:
float asFloat = 0.3; // <-- 0.300000012
double asDouble = 0.3; // <-- 0....
4
votes
3
answers
339
views
Example of Code with and without strictfp Modifier
I know this question might seem overly familiar to the community, but I swear I've never been able to reproduce the issue related to this question even once throughout my programming journey.
I ...
0
votes
2
answers
78
views
How does floating-point addition work in "np.finfo(np.float64).max + 1"?
How does addition work in floating-point for this case:
In [6]: np.finfo(np.float64).max + 1
Out[6]: 1.7976931348623157e+308
Why is there no overflow raised?