-1

I am experiencing an unexpected result from the Trunc() function, due to the lack of precision of floating-point numbers, when the float value stored is just below the positive integer value:

var
 u1, u2, u3, u4, u5, u6: double;
 result1, result2: integer;

begin
 {initialization of variables for a basic equation of which we take the integer part}
 u1 := 4.0;
 u2 := 2.5;
 u3 := 0.05;
 u4 := 0.1;
 {first attempt with the equation as the argument of the Trunc() function}
 result1 := Trunc(((u1 - u2) / 2 - u3) / u4);  // *** unexpected and wrong result = 6
 {second attempt with the intermediate result as the argument of the Trunc() function}
 u5 := ((u1 - u2) / 2 - u3) / u4;
 result2 := Trunc(u5);                         // *** right result = 7
 {check: u6 = -4,16333634234434e-16 means that the equation returns 6.99999999...}
 u6 := (((u1 - u2) / 2 - u3) / u4) - 7.0;
end;

How can this issue be properly managed?

Should the variable u5 here be rounded with the function System.Math.RoundTo() with argument ADigit = -15?

4
  • How to solve this depends on what the actual problem is. You know the answer is 7 so you can hard code it. But I guess that sounds silly, because you want to work with arbitrary values. In which case, I don't actually see that the answer is necessarily wrong, it's just a consequence of how floating point arithmetic works. What is the underlying mathematical problem? Commented Jul 7 at 15:11
  • For me, the unexpected result is the 7. Mechanically, I would have expected u5 to become $401BFFFFFFFFFFFF, too (which would be 1.7499999999999998 * 2^2, or 6.999999999999999), which would correctly truncate to 6 as well. But for some reason Delphi actually makes it $401C000000000000 (which is 1.75 * 2^2, or exactly 7). Are doubles "sanitized" when they're written to a variable? Commented Jul 7 at 21:42
  • @ValerianK. You might be up to something. If you change the variable types to single the end result is not exactly 7 when debugging FPU progression of the calculation. Commented Jul 7 at 23:22
  • @ValerianK. From what I can tell, ((u1 - u2) / 2 - u3) / u4 is computed at full (extended) precision. So, it's not double that is sanitized, it's extended that rounds to nearest double. Commented Jul 14 at 14:56

1 Answer 1

-1

Floating point numbers are stored as approximations in computers due to how binary systems handle decimals so they may not store the exact decimal value. For example, a number like 5.15 might be stored as 5.1499999999999. This can sometimes lead to some unexpected results which may be contributing to this discrepancy. The best approach in this situation is to store the result of the expression in a separate variable and then apply the Trunc() function to the stored result as seen in your second attempt above.

u5 := ((u1 - u2) / 2 - u3) / u4;
result2 := Trunc(u5);
Sign up to request clarification or add additional context in comments.

2 Comments

No, it came from me using a calculator on my phone to calculate and then testing it in Visual Studio to make sure I was correct before I typed it out.
This does not offer anything useful. If we are considering this one equation with one set of inputs then you may be able to find a particular combination that gives a desired answer. But if that's the limit of your goals, just hard code the answer. What about other inputs. Can you give code that gives some desired answer for all inputs? For a range of different equations?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.