44

The sigmoid function is defined as

enter image description here

S(t) = 1 / (1 + e^(-t)) (where ^ is pow)

I found that using the C built-in function exp() to calculate the value of f(x) is slow. Is there any faster algorithm to calculate the value of f(x)?

15 Answers 15

54

you don't have to use the actual, exact sigmoid function in a neural network algorithm but can replace it with an approximated version that has similar properties but is faster the compute.

For example, you can use the "fast sigmoid" function

f(x) = x / (1 + abs(x))

Using first terms of the series expansion for exp(x) won't help too much if the arguments to f(x) are not near zero, and you have the same problem with a series expansion of the sigmoid function if the arguments are "large".

An alternative is to use table lookup. That is, you precalculate the values of the sigmoid function for a given number of data points, and then do fast (linear) interpolation between them if you want.

Sign up to request clarification or add additional context in comments.

4 Comments

Shouldn't it be f(x) = 0.5 * (x / (1 + abs(x)) + 1) to approximate the questioner's sigmoid function f(x) = 1 / (1 + exp(-x))?
@Giffoyle Depends on whether you want the outputs to range from 0 to 1, or from -1 to 1. If you want them to be "signed", Antti's version works fine. If you want them "unsigned" (kind of like a relu but not really), yours does the job. I'm not entirely sure how deep the implications are for neural networks, but oh well :)
@wallabra You sure? Antti's answer for x = -10 for example is negative. As for the function provided by Gilfoyle, I think that works properly.
@eXPRESS what I mean is that Antti's approach has a domain which spans both negative and positive, whereas Gilfoyle's spans only positives. Both are useful in their own given scenarios. The original sigmoid goes from 0 to 1, but if this boundary isn't a hard requirement, approximations can go past it, as long as a sufficiently sigmoidal shape is still had. For integers between -500 and 500, Gilfoyle's has a mean square error of 0.000117 and Antti's of 0.49; for integers between 0 and 500, Gilfoyle's is still 0.000117, Antti's is 0.00119.
29

It's best to measure on your hardware first. Just a quick benchmark script shows, that on my machine 1/(1+|x|) is the fastest, and tanh(x) is the close second. Error function erf is pretty fast too.

% gcc -Wall -O2 -lm -o sigmoid-bench{,.c} -std=c99 && ./sigmoid-bench
atan(pi*x/2)*2/pi   24.1 ns
atan(x)             23.0 ns
1/(1+exp(-x))       20.4 ns
1/sqrt(1+x^2)       13.4 ns
erf(sqrt(pi)*x/2)    6.7 ns
tanh(x)              5.5 ns
x/(1+|x|)            5.5 ns

I expect that the results may vary depending on architecture and the compiler used, but erf(x) (since C99), tanh(x) and x/(1.0+fabs(x)) are likely to be the fast performers.

1 Comment

Also believe you meant to say x/sqrt(1+x^2) instead of 1/sqrt(1+x^2).
14

People here are mostly concerned about how fast one function is relative to another and create micro benchmark to see whether f1(x) runs 0.0001 ms faster than f2(x). The big problem is that this is mostly irrelevant, because what matters is how fast your network learns with your activation function trying to minimize your cost function.

As of current theory, rectifier function and softplus: f(x) = ln(1+ e^x) with ^ being pow.

compared to sigmoid function or similar activation functions, allow for faster and effective training of deep neural architectures on large and complex datasets.

So I suggest to throw away micro-optimization, and take a look at which function allows faster learning (also taking looking at various other cost function).

1 Comment

Optimizing execution of a trained network (e.g. on a microcontroller without floating point unit) and optimization of learning speed (learning algorithm) are two different problems. If you want to increase the neurons on a given limited hardware or if you want to reduce energy consumption on execution, you have to optimize beyond computational time/space complexity.
10

To do the NN more flexible usually used some alpha rate to change the angle of graph around 0.

The sigmoid function looks like:

f(x) = 1 / ( 1+exp(-x*alpha))

The nearly equivalent, (but more faster function) is:

f(x) = 0.5 * (x * alpha / (1 + abs(x*alpha))) + 0.5

You can check the graphs here

When I using abs function the network become faster 100+ times.

3 Comments

Where's the first round bracket supposed to close in the second equation?
Fixed, see inline.
2nd is an excellent option, flexible and fast: only 1 div, 3 muls, 1 add and an abs which is mostly hardware accelerated these days i.e. does not incur the cost of a conditional. P.S. your link is broken.
9

This answer probably isn't relevant for most cases, but just wanted to throw out there that for CUDA computing I've found x/sqrt(1+x^2) to be the fastest function by far.

For example, done with single precision float intrinsics:

__device__ void fooCudaKernel(/* some arguments */) {
    float foo, sigmoid;
    // some code defining foo
    sigmoid = __fmul_rz(rsqrtf(__fmaf_rz(foo,foo,1)),foo);
}

3 Comments

Good. Though would only be if you calculating neurons as a Fully connected Matrix not a Vector for a single row / Sparse Matrix.
Square root has never been fast, because its implementation is highly iterative.
This could be indeed one of the better options, with the infamous reciprocal square root approximation in mind.
6

Also you might use rough version of sigmoid (it differences not greater than 0.2% from original):

    inline float RoughSigmoid(float value)
    {
        float x = ::abs(value);
        float x2 = x*x;
        float e = 1.0f + x + x2*0.555f + x2*x2*0.143f;
        return 1.0f / (1.0f + (value > 0 ? 1.0f / e : e));
    }

    void RoughSigmoid(const float * src, size_t size, const float * slope, float * dst)
    {
        float s = slope[0];
        for (size_t i = 0; i < size; ++i)
            dst[i] = RoughSigmoid(src[i] * s);
    }

Optimization of RoughSigmoid function with using SSE:

    #include <xmmintrin.h>

    void RoughSigmoid(const float * src, size_t size, const float * slope, float * dst)
    {
        size_t alignedSize =  size/4*4;
        __m128 _slope = _mm_set1_ps(*slope);
        __m128 _0 = _mm_set1_ps(-0.0f);
        __m128 _1 = _mm_set1_ps(1.0f);
        __m128 _0555 = _mm_set1_ps(0.555f);
        __m128 _0143 = _mm_set1_ps(0.143f);
        size_t i = 0;
        for (; i < alignedSize; i += 4)
        {
            __m128 _src = _mm_loadu_ps(src + i);
            __m128 x = _mm_andnot_ps(_0, _mm_mul_ps(_src, _slope));
            __m128 x2 = _mm_mul_ps(x, x);
            __m128 x4 = _mm_mul_ps(x2, x2);
            __m128 series = _mm_add_ps(_mm_add_ps(_1, x), _mm_add_ps(_mm_mul_ps(x2, _0555), _mm_mul_ps(x4, _0143)));
            __m128 mask = _mm_cmpgt_ps(_src, _0);
            __m128 exp = _mm_or_ps(_mm_and_ps(_mm_rcp_ps(series), mask), _mm_andnot_ps(mask, series));
            __m128 sigmoid = _mm_rcp_ps(_mm_add_ps(_1, exp));
            _mm_storeu_ps(dst + i, sigmoid);
        }
        for (; i < size; ++i)
            dst[i] = RoughSigmoid(src[i] * slope[0]);
    }

Optimization of RoughSigmoid function with using AVX:

    #include <immintrin.h>

    void RoughSigmoid(const float * src, size_t size, const float * slope, float * dst)
    {
        size_t alignedSize = size/8*8;
        __m256 _slope = _mm256_set1_ps(*slope);
        __m256 _0 = _mm256_set1_ps(-0.0f);
        __m256 _1 = _mm256_set1_ps(1.0f);
        __m256 _0555 = _mm256_set1_ps(0.555f);
        __m256 _0143 = _mm256_set1_ps(0.143f);
        size_t i = 0;
        for (; i < alignedSize; i += 8)
        {
            __m256 _src = _mm256_loadu_ps(src + i);
            __m256 x = _mm256_andnot_ps(_0, _mm256_mul_ps(_src, _slope));
            __m256 x2 = _mm256_mul_ps(x, x);
            __m256 x4 = _mm256_mul_ps(x2, x2);
            __m256 series = _mm256_add_ps(_mm256_add_ps(_1, x), _mm256_add_ps(_mm256_mul_ps(x2, _0555), _mm256_mul_ps(x4, _0143)));
            __m256 mask = _mm256_cmp_ps(_src, _0, _CMP_GT_OS);
            __m256 exp = _mm256_or_ps(_mm256_and_ps(_mm256_rcp_ps(series), mask), _mm256_andnot_ps(mask, series));
            __m256 sigmoid = _mm256_rcp_ps(_mm256_add_ps(_1, exp));
            _mm256_storeu_ps(dst + i, sigmoid);
        }
        for (; i < size; ++i)
            dst[i] = RoughSigmoid(src[i] * slope[0]);
    }

1 Comment

what us slope here ? what are typical inputs on the function
6

Code is based on a C# version previously posted by '@jenkas' with minor modifications.

The following C++ code provides excellent precision that outperforms low-precision approximations by virtue of the fact that it allows compilers to auto-vectorize compiled code onto SIMD instructions when used in simple loops.

GCC will compile code to SIMD (Arm Neon, or Intel AVX) instructions that perform four sigmoid (or tanh) computations in parallel. Auto-vectorization yields performance that is comparable to even very low-precision optimizations while maintaining essentially full precision. Microsoft and Intel compilers also perform auto-vectorization.

A brief discussion of auto-vectorization, compiler optimizations, and practices that produce optimal performance is provided near the end of this post.

The following functions provide a maximum error of +/- 6.55651e-07 over full range as compared to 1/(1+exp(-v)).

// Returns float approximation of 1/(1+exp(-v))
inline float fast_sigmoid(float v)
{
    constexpr float c1 = 0.03138777F;
    constexpr float c2 = 0.276281267F;
    constexpr float c_log2f = 1.442695022F;
    v *= c_log2f*0.5;
    int intPart = (int)v;
    float x = (v - intPart);
    float xx = x * x;
    float v1 = c_log2f + c2 * xx;
    float v2 = x + xx * c1 * x;
    float v3 = (v2 + v1);
    *((int*)&v3) += intPart << 24;
    float v4 = v2 - v1;
    float res = v3 / (v3 - v4); //for tanh change to (v3 + v4)/ (v3 - v4)
    return res;
}

// Returns float approximation tanh(v)
inline float fast_tanh(float v)
{
    const float c1 = 0.03138777F;
    const float c2 = 0.276281267F;
    const float c_log2f = 1.442695022F;
    v *= c_log2f;
    int intPart = (int)v;
    float x = (v - intPart);
    float xx = x * x;
    float v1 = c_log2f + c2 * xx;
    float v2 = x + xx * c1 * x;
    float v3 = (v2 + v1);
    *((int*)&v3) += intPart << 24;
    float v4 = v2 - v1;
    float res = (v3+v4) / (v3 - v4); 
    return res;
}

Benchmark results on Raspberry PI 4 (AARCH64):

-- Sigmoid benchmark --------
fast_sigmoid(x)     5.63 ns
fast_tanh(x)        5.89 ns
Vectorized fast_sigmoid(out,in,count) using Neon intrinsics
                    5.79 ns
atan(pi*/2 * x)/(pi/2)  27.29 ns
atan(x)            24.13 ns
1/(1+exp(-x))      14.92 ns
1/sqrt(1+x^2)       4.26 ns
(erf(sqrt(pi)/2 *x)  20.62 ns
tanh(x)            20.64 ns
x/(1+|x|)           8.93 ns

x (measures loop overhead)   1.62 ns
x*x (for reference)   1.62 ns
1/(1+x) (for reference)   2.64 ns

Raspberry Pi 4, aarch64 Arm Cortex [email protected]. GCC 10.2.1

In the benchmark, GCC vectorizes the fast_sigmoid call into ARM Neon instructions allowing four values to be calculated in parallel.

For optimal performance, you should ensure that input vectors are aligned on 64-byte boundaries. AVX and Neon instructions both allow for unaligned access, but do so with a mild performance penalty.

In addition, you should inform the compiler that input vectors do not alias using the non-standard restrict keyword. The restrict keyword is defined in the C99 standard, but is not standard C++. Fortunately, all major C++ compilers (Intel, Microsoft, GCC, Clang) implement it as a C++ keyword as well. Without alias guarantees, compilers will generate a small code preamble that tests for aliasing at runtime, and executes a slow code-path if aliasing is detected.

To enable vectorization, GCC requires either the -ftree-vectorize option, or -O3 (which includes -ftree-vectorize).

Loops are vectorized as long as there are no operations that prevent vectorization. Including a call to a math intrinsic (exp, sin, cos &c) will prevent loop vectorization, as will if statements within the loop. However, loop bodies can be fairly substantial. For example, in my LSTM implementation, one of the loops contains operations on four separate vector components (more operations in the loop provides more opportunity for interleaved instruction scheduling)

The restrict keyword in the following sample informs the compiler that no part of the input and output vector overlap, allowing the compiler to omit the aliasing check:

void vec_sigmoid(
   int length, 
   restrict float*output, 
   restrict float*input, 
   restrict float *bias)
{
   for (int i = 0; i < length; ++i)
   {
       output[i] = fast_sigmoid(input[i])+bias[i];
   } 
}

Code is a C++ port of @jenkas' C# code posted earlier, adjusted to return 1/(1+exp(-x)) instead of 1/(1+exp(-2*x)) which is what the original code calculates.

1 Comment

Now I know that the function I wrote few years ago not so bad. Thank you😁
2

You can use a simple but effective method by using two formulas:

if x < 0 then f(x) = 1 / (0.5/(1+(x^2)))
if x > 0 then f(x) = 1 / (-0.5/(1+(x^2)))+1

This will look like this:

Two graphs for a sigmoid {Blue: (0.5/(1+(x^2))), Yellow: (-0.5/(1+(x^2)))+1}

Comments

2

Try this .NET Core 5+ implementation

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static unsafe float FastSigmoid(float v)
    {
        const float c1 = 0.03138777F;
        const float c2 = 0.276281267F;
        const float c_log2f = 1.442695022F;
        v *= c_log2f;
        int intPart = (int)v;
        float x = (v - intPart);
        float xx = x * x;
        float v1 = c_log2f + c2 * xx;
        float v2 = x + xx * c1 * x;
        float v3 = (v2 + v1);
        *((int*)&v3) += intPart << 24;
        float v4 = v2 - v1;
        float res = v3 / (v3 - v4); //for tanh change to (v3 + v4)/ (v3 - v4)
        return res;
    }

3 Comments

The tanh approximation has excellent precision; but I can't figure out what function the unmodified FastSigmoid is supposed to be approximating. Whatever it is, it's NOT 1/(1+exp(-x)). Any clues o the provenance of this code fragment would also be much appreciated. (I'm looking for a fast 1/(1+exp(-x)) approximation for use in a simd-accelerated LSTM implementation.)
@RobinDavies I think it is 1/(1+exp(-2x))
C++ port, precision and performance analysis of this code posted later in the thread. This is an excellent choice.
1

Using Eureqa to search for approximations to sigmoid I found 1/(1 + 0.3678749025^x) approximates it. It's pretty close, just gets rid of one operation with the negation of x.

Some of the other functions shown here are interesting, but is the power operation really that slow? I tested it and it actually did faster than addition, but that could just be a fluke. If so it should be just as fast or faster as all the others.

EDIT:0.5 + 0.5*tanh(0.5*x) and less accurate, 0.5 + 0.5*tanh(n) also works. And you could just get rid of the constants if you don't care about getting it between the range [0,1] like sigmoid. But it assumes that tanh is faster.

2 Comments

The power term is generally slow to execute yes, hence this approximation doesn't avoid that aspect of the original question, since pow() will often be implemented in CPU circuitry as a adjustment to an exp() execution/evaluation.
Your performance upper bound depends on the value of x (number of multiplies)... not good.
1

The tanh function may be optimized in some languages, making it faster than a custom defined x/(1+abs(x)), such is the case in Julia.

Comments

0

You can also use this:

    y=x / (2 * ((x<0.0)*-x+(x>=0.0)*x) + 2) + 0.5;
    y'=y(1-y);

acts like a sigmoid now because y(1-y)=y' is more let say round than 1/(2 (1 + abs(x))^2) acts more like to fast sigmoid;

1 Comment

Conditionals are slow.
0

This is the cubic (3rd order) hermite interpolation function used in computer graphics, such as GLSL smoothstep(). It is fast to compute, symmetric, and passes through 0.0 and 1.0:

f(t) = 3t^2 - 2t^3 or expanded, f(t) = 3 * t * t - 2 * t * t * t

One can get steeper curves by incrementing the values given there to the quartic, quintic, sextic orders and so on -- e.g.

f(t) = 5t^4 - 4t^5

...but while these curves may be useful to some, and pass through 0 and 1, they are asymmetric.

A parameterised version of the OP's reference sigmoid, which can provide steeper, symmetric slopes that one can adjust via parameter a being greater than 1, is:

f(ax) = 1 / (1 + e^(-ax))

Ah, but the use of exp()... that's the very problem this question tries to solve.

Adjustible AND better from a cost perspective, we have the Normalized Tunable Sigmoid Function:

f(x, k) = (x − k * x) / (k − 2k * abs(x) + 1)

Comments

0

My shot at this is based in y = exp(-abs(x * alpha)) (blue curve)
and Y = copysign(1-y, x) for the red curve (the sign of x is copied to first argument).
enter image description here

One can select the alpha to tune either to best accuracy, or best speed -- as for the exp one can use any of the fastExp implementations, some of which just use zero to one multiplications, and reinterpret cast floating points as integers or vice versa.
Selecting e.g. alpha = 1/log2(e), converts from exp(x) to 2.^x , for which there are very fast algorithms, well known and lesser known.

Based on the lesser known algorithms https://stackoverflow.com/a/78509807/1716339
a fast algorithm without divisions or square roots would be

y = 639.f - abs(x);
y = std::bit_cast<float>(std::bit_cast<uint32_t>(y) << 9);
sign = std::bit_cast<int>(x) & 0x80000000;
return std::bit_cast<float>(std::bit_cast<int>(1.0f - y) | sign);

The sign copying can be also done with a single blend, for a total of 4 arithmetic logical operation.

Possibly the range of x needs to be clamped (with one more arithmetic operation) to

y = 639.f - clamp(abs(x), 87.0f);

2 Comments

What is Y = (y.*( ... or more specifically, what is .*? Is that a typo, or are you using some particular language that others here don't know? Perhaps worth clarifying.
Right .* is from Matlab / Octave to allow vector to be multiplied element wise with another vector.
-3

I don't think you can do better than the built-in exp() but if you want another approach, you can use series expansion. WolframAlpha can compute it for you.

1 Comment

If you're wondering about the downvotes, the approximation does not seem to converge well over an acceptable range.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.