Floating-Point Accuracy For Scaling Numbers

Formulas used in computation of special function often contain constant factors that cannot be precisely represented in floating point formats such as the typical “double”. For example, the base-10 log function:

double log10 (double x)
  return log(x)/log(10);

Chances are that you should be using the library version of this function; I am just using this simple function for illustration.

Several things affect how accurate result we will get out of this formula. Most importantly, we need an accurate log function. That is provided by the C library (in glibc, at least). Here, however, I want to talk about the last part, namely the scaling by log(10).

Clearly log(10) is irrational, so it will not have an exact representation as a double. In fact, this constant is just about a worst case for double. Its value is, 2*1.00100…10101|1000001… (where “|” marks the 53-bit cutoff for double), i.e., the value is just a hair above the midpoint between the two nearest representable numbers. If we use that value, we must live with it having a relative error of about 9.4e-17.

But we’re scaling with that number and there are two ways of doing that: dividing by the value directly or multiplying by its inverse 1/log(10). The latter value, when computed with higher accuracy that double allows, has the value (1/4)*1.10111…01110|0011001… which gives us a relative representation error of 2.5e-17, i.e., only about a quarter of the error in the direct case.

double log10 (double x)
  static double l10i = 0.4342944819032518276511289;
  return log(x) * l10i;

In practice this gives log10 several extra correct bits. I noticed this when using test values for Gnu Scientific Library’s complex log10 function for testing Gnumeric’s ditto.

I cannot possibly be the first to look into this, but I don’t recall ever having read about it. I have done the analysis for a set of selected constants and the results are:

For these constants, the direct value is best: pi, EulerGamma, log10(2).

For these constants, the inverse value is best: e, log(2), log(10), sqrt(5), sqrt(pi), sqrt(2pi).

For these constants, it’s a tie: sqrt(2) and sqrt(3). Any integer or half-integer power of two will cause a tie because the mantissa of the power and its inverse will be identical.

Note, that the decision is tied to “double”. For “long double” or “float” the results are going to be different.