Fast sigmoid algorithm

Question

The sigmoid function is defined as

S(t) = 1 / (1 + e^(-t)) (where ^ is pow)

I found that using the C built-in function exp() to calculate the value of f(x) is slow. Is there any faster algorithm to calculate the value of f(x)?

Georg Plaz · Accepted Answer · 2022-09-01 09:08:06Z

54

you don't have to use the actual, exact sigmoid function in a neural network algorithm but can replace it with an approximated version that has similar properties but is faster the compute.

For example, you can use the "fast sigmoid" function

f(x) = x / (1 + abs(x))

Using first terms of the series expansion for exp(x) won't help too much if the arguments to f(x) are not near zero, and you have the same problem with a series expansion of the sigmoid function if the arguments are "large".

An alternative is to use table lookup. That is, you precalculate the values of the sigmoid function for a given number of data points, and then do fast (linear) interpolation between them if you want.

edited Sep 1, 2022 at 9:08

Georg Plaz

6,0185 gold badges43 silver badges66 bronze badges

answered May 24, 2012 at 8:33

Antti Huima

25.6k3 gold badges55 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Gilfoyle Over a year ago

Shouldn't it be f(x) = 0.5 * (x / (1 + abs(x)) + 1) to approximate the questioner's sigmoid function f(x) = 1 / (1 + exp(-x))?

wallabra Over a year ago

@Giffoyle Depends on whether you want the outputs to range from 0 to 1, or from -1 to 1. If you want them to be "signed", Antti's version works fine. If you want them "unsigned" (kind of like a relu but not really), yours does the job. I'm not entirely sure how deep the implications are for neural networks, but oh well :)

eXPRESS Over a year ago

@wallabra You sure? Antti's answer for x = -10 for example is negative. As for the function provided by Gilfoyle, I think that works properly.

wallabra Over a year ago

@eXPRESS what I mean is that Antti's approach has a domain which spans both negative and positive, whereas Gilfoyle's spans only positives. Both are useful in their own given scenarios. The original sigmoid goes from 0 to 1, but if this boundary isn't a hard requirement, approximations can go past it, as long as a sufficiently sigmoidal shape is still had. For integers between -500 and 500, Gilfoyle's has a mean square error of 0.000117 and Antti's of 0.49; for integers between 0 and 500, Gilfoyle's is still 0.000117, Antti's is 0.00119.

sastanin · Accepted Answer · 2013-03-29 13:35:21Z

29

It's best to measure on your hardware first. Just a quick benchmark script shows, that on my machine 1/(1+|x|) is the fastest, and tanh(x) is the close second. Error function erf is pretty fast too.

% gcc -Wall -O2 -lm -o sigmoid-bench{,.c} -std=c99 && ./sigmoid-bench
atan(pi*x/2)*2/pi   24.1 ns
atan(x)             23.0 ns
1/(1+exp(-x))       20.4 ns
1/sqrt(1+x^2)       13.4 ns
erf(sqrt(pi)*x/2)    6.7 ns
tanh(x)              5.5 ns
x/(1+|x|)            5.5 ns

I expect that the results may vary depending on architecture and the compiler used, but erf(x) (since C99), tanh(x) and x/(1.0+fabs(x)) are likely to be the fast performers.

edited Mar 29, 2013 at 13:35

answered Mar 29, 2013 at 13:13

sastanin

41.8k15 gold badges106 silver badges131 bronze badges

1 Comment

pqn Over a year ago

Also believe you meant to say x/sqrt(1+x^2) instead of 1/sqrt(1+x^2).

Engineer · Accepted Answer · 2024-06-02 15:24:16Z

14

People here are mostly concerned about how fast one function is relative to another and create micro benchmark to see whether f1(x) runs 0.0001 ms faster than f2(x). The big problem is that this is mostly irrelevant, because what matters is how fast your network learns with your activation function trying to minimize your cost function.

As of current theory, rectifier function and softplus: f(x) = ln(1+ e^x) with ^ being pow.

compared to sigmoid function or similar activation functions, allow for faster and effective training of deep neural architectures on large and complex datasets.

So I suggest to throw away micro-optimization, and take a look at which function allows faster learning (also taking looking at various other cost function).

edited Jun 2, 2024 at 15:24

Engineer

8,9387 gold badges70 silver badges115 bronze badges

answered Aug 14, 2015 at 4:51

Salvador Dali

224k151 gold badges725 silver badges766 bronze badges

1 Comment

cat Over a year ago

Optimizing execution of a trained network (e.g. on a microcontroller without floating point unit) and optimization of learning speed (learning algorithm) are two different problems. If you want to increase the neurons on a given limited hardware or if you want to reduce energy consumption on execution, you have to optimize beyond computational time/space complexity.

Oleg Imanilov · Accepted Answer · 2018-02-08 16:21:38Z

10

To do the NN more flexible usually used some alpha rate to change the angle of graph around 0.

The sigmoid function looks like:

f(x) = 1 / ( 1+exp(-x*alpha))

The nearly equivalent, (but more faster function) is:

f(x) = 0.5 * (x * alpha / (1 + abs(x*alpha))) + 0.5

You can check the graphs here

When I using abs function the network become faster 100+ times.

edited Feb 8, 2018 at 16:21

answered Apr 27, 2013 at 11:24

Oleg Imanilov

2,7791 gold badge17 silver badges28 bronze badges

3 Comments

Anton Bielousov Over a year ago

Where's the first round bracket supposed to close in the second equation?

Oleg Imanilov Over a year ago

Fixed, see inline.

Engineer Over a year ago

2nd is an excellent option, flexible and fast: only 1 div, 3 muls, 1 add and an abs which is mostly hardware accelerated these days i.e. does not incur the cost of a conditional. P.S. your link is broken.

pqn · Accepted Answer · 2014-07-03 01:47:22Z

9

This answer probably isn't relevant for most cases, but just wanted to throw out there that for CUDA computing I've found x/sqrt(1+x^2) to be the fastest function by far.

For example, done with single precision float intrinsics:

__device__ void fooCudaKernel(/* some arguments */) {
    float foo, sigmoid;
    // some code defining foo
    sigmoid = __fmul_rz(rsqrtf(__fmaf_rz(foo,foo,1)),foo);
}

answered Jul 3, 2014 at 1:47

pqn

1,5403 gold badges22 silver badges34 bronze badges

3 Comments

user1496062 Over a year ago

Good. Though would only be if you calculating neurons as a Fully connected Matrix not a Vector for a single row / Sparse Matrix.

Engineer Over a year ago

Square root has never been fast, because its implementation is highly iterative.

Aki Suihkonen Mar 14 at 8:35

This could be indeed one of the better options, with the infamous reciprocal square root approximation in mind.

ErmIg · Accepted Answer · 2015-12-24 06:52:10Z

Also you might use rough version of sigmoid (it differences not greater than 0.2% from original):

    inline float RoughSigmoid(float value)
    {
        float x = ::abs(value);
        float x2 = x*x;
        float e = 1.0f + x + x2*0.555f + x2*x2*0.143f;
        return 1.0f / (1.0f + (value > 0 ? 1.0f / e : e));
    }

    void RoughSigmoid(const float * src, size_t size, const float * slope, float * dst)
    {
        float s = slope[0];
        for (size_t i = 0; i < size; ++i)
            dst[i] = RoughSigmoid(src[i] * s);
    }

Optimization of RoughSigmoid function with using SSE:

    #include <xmmintrin.h>

    void RoughSigmoid(const float * src, size_t size, const float * slope, float * dst)
    {
        size_t alignedSize =  size/4*4;
        __m128 _slope = _mm_set1_ps(*slope);
        __m128 _0 = _mm_set1_ps(-0.0f);
        __m128 _1 = _mm_set1_ps(1.0f);
        __m128 _0555 = _mm_set1_ps(0.555f);
        __m128 _0143 = _mm_set1_ps(0.143f);
        size_t i = 0;
        for (; i < alignedSize; i += 4)
        {
            __m128 _src = _mm_loadu_ps(src + i);
            __m128 x = _mm_andnot_ps(_0, _mm_mul_ps(_src, _slope));
            __m128 x2 = _mm_mul_ps(x, x);
            __m128 x4 = _mm_mul_ps(x2, x2);
            __m128 series = _mm_add_ps(_mm_add_ps(_1, x), _mm_add_ps(_mm_mul_ps(x2, _0555), _mm_mul_ps(x4, _0143)));
            __m128 mask = _mm_cmpgt_ps(_src, _0);
            __m128 exp = _mm_or_ps(_mm_and_ps(_mm_rcp_ps(series), mask), _mm_andnot_ps(mask, series));
            __m128 sigmoid = _mm_rcp_ps(_mm_add_ps(_1, exp));
            _mm_storeu_ps(dst + i, sigmoid);
        }
        for (; i < size; ++i)
            dst[i] = RoughSigmoid(src[i] * slope[0]);
    }

Optimization of RoughSigmoid function with using AVX:

    #include <immintrin.h>

    void RoughSigmoid(const float * src, size_t size, const float * slope, float * dst)
    {
        size_t alignedSize = size/8*8;
        __m256 _slope = _mm256_set1_ps(*slope);
        __m256 _0 = _mm256_set1_ps(-0.0f);
        __m256 _1 = _mm256_set1_ps(1.0f);
        __m256 _0555 = _mm256_set1_ps(0.555f);
        __m256 _0143 = _mm256_set1_ps(0.143f);
        size_t i = 0;
        for (; i < alignedSize; i += 8)
        {
            __m256 _src = _mm256_loadu_ps(src + i);
            __m256 x = _mm256_andnot_ps(_0, _mm256_mul_ps(_src, _slope));
            __m256 x2 = _mm256_mul_ps(x, x);
            __m256 x4 = _mm256_mul_ps(x2, x2);
            __m256 series = _mm256_add_ps(_mm256_add_ps(_1, x), _mm256_add_ps(_mm256_mul_ps(x2, _0555), _mm256_mul_ps(x4, _0143)));
            __m256 mask = _mm256_cmp_ps(_src, _0, _CMP_GT_OS);
            __m256 exp = _mm256_or_ps(_mm256_and_ps(_mm256_rcp_ps(series), mask), _mm256_andnot_ps(mask, series));
            __m256 sigmoid = _mm256_rcp_ps(_mm256_add_ps(_1, exp));
            _mm256_storeu_ps(dst + i, sigmoid);
        }
        for (; i < size; ++i)
            dst[i] = RoughSigmoid(src[i] * slope[0]);
    }

what us slope here ? what are typical inputs on the function

Robin Davies · Accepted Answer · 2023-01-05 18:07:12Z

Code is based on a C# version previously posted by '@jenkas' with minor modifications.

The following C++ code provides excellent precision that outperforms low-precision approximations by virtue of the fact that it allows compilers to auto-vectorize compiled code onto SIMD instructions when used in simple loops.

GCC will compile code to SIMD (Arm Neon, or Intel AVX) instructions that perform four sigmoid (or tanh) computations in parallel. Auto-vectorization yields performance that is comparable to even very low-precision optimizations while maintaining essentially full precision. Microsoft and Intel compilers also perform auto-vectorization.

A brief discussion of auto-vectorization, compiler optimizations, and practices that produce optimal performance is provided near the end of this post.

The following functions provide a maximum error of +/- 6.55651e-07 over full range as compared to 1/(1+exp(-v)).

// Returns float approximation of 1/(1+exp(-v))
inline float fast_sigmoid(float v)
{
    constexpr float c1 = 0.03138777F;
    constexpr float c2 = 0.276281267F;
    constexpr float c_log2f = 1.442695022F;
    v *= c_log2f*0.5;
    int intPart = (int)v;
    float x = (v - intPart);
    float xx = x * x;
    float v1 = c_log2f + c2 * xx;
    float v2 = x + xx * c1 * x;
    float v3 = (v2 + v1);
    *((int*)&v3) += intPart << 24;
    float v4 = v2 - v1;
    float res = v3 / (v3 - v4); //for tanh change to (v3 + v4)/ (v3 - v4)
    return res;
}

// Returns float approximation tanh(v)
inline float fast_tanh(float v)
{
    const float c1 = 0.03138777F;
    const float c2 = 0.276281267F;
    const float c_log2f = 1.442695022F;
    v *= c_log2f;
    int intPart = (int)v;
    float x = (v - intPart);
    float xx = x * x;
    float v1 = c_log2f + c2 * xx;
    float v2 = x + xx * c1 * x;
    float v3 = (v2 + v1);
    *((int*)&v3) += intPart << 24;
    float v4 = v2 - v1;
    float res = (v3+v4) / (v3 - v4); 
    return res;
}

Benchmark results on Raspberry PI 4 (AARCH64):

-- Sigmoid benchmark --------
fast_sigmoid(x)     5.63 ns
fast_tanh(x)        5.89 ns
Vectorized fast_sigmoid(out,in,count) using Neon intrinsics
                    5.79 ns
atan(pi*/2 * x)/(pi/2)  27.29 ns
atan(x)            24.13 ns
1/(1+exp(-x))      14.92 ns
1/sqrt(1+x^2)       4.26 ns
(erf(sqrt(pi)/2 *x)  20.62 ns
tanh(x)            20.64 ns
x/(1+|x|)           8.93 ns

x (measures loop overhead)   1.62 ns
x*x (for reference)   1.62 ns
1/(1+x) (for reference)   2.64 ns

Raspberry Pi 4, aarch64 Arm Cortex [email protected]. GCC 10.2.1

In the benchmark, GCC vectorizes the fast_sigmoid call into ARM Neon instructions allowing four values to be calculated in parallel.

For optimal performance, you should ensure that input vectors are aligned on 64-byte boundaries. AVX and Neon instructions both allow for unaligned access, but do so with a mild performance penalty.

In addition, you should inform the compiler that input vectors do not alias using the non-standard restrict keyword. The restrict keyword is defined in the C99 standard, but is not standard C++. Fortunately, all major C++ compilers (Intel, Microsoft, GCC, Clang) implement it as a C++ keyword as well. Without alias guarantees, compilers will generate a small code preamble that tests for aliasing at runtime, and executes a slow code-path if aliasing is detected.

To enable vectorization, GCC requires either the -ftree-vectorize option, or -O3 (which includes -ftree-vectorize).

Loops are vectorized as long as there are no operations that prevent vectorization. Including a call to a math intrinsic (exp, sin, cos &c) will prevent loop vectorization, as will if statements within the loop. However, loop bodies can be fairly substantial. For example, in my LSTM implementation, one of the loops contains operations on four separate vector components (more operations in the loop provides more opportunity for interleaved instruction scheduling)

The restrict keyword in the following sample informs the compiler that no part of the input and output vector overlap, allowing the compiler to omit the aliasing check:

void vec_sigmoid(
   int length, 
   restrict float*output, 
   restrict float*input, 
   restrict float *bias)
{
   for (int i = 0; i < length; ++i)
   {
       output[i] = fast_sigmoid(input[i])+bias[i];
   } 
}

Code is a C++ port of @jenkas' C# code posted earlier, adjusted to return 1/(1+exp(-x)) instead of 1/(1+exp(-2*x)) which is what the original code calculates.

Now I know that the function I wrote few years ago not so bad. Thank you😁

user9848049 · Accepted Answer · 2018-05-25 16:53:15Z

2

You can use a simple but effective method by using two formulas:

if x < 0 then f(x) = 1 / (0.5/(1+(x^2)))
if x > 0 then f(x) = 1 / (-0.5/(1+(x^2)))+1

This will look like this:

Two graphs for a sigmoid {Blue: (0.5/(1+(x^2))), Yellow: (-0.5/(1+(x^2)))+1}

answered May 25, 2018 at 16:53

user9848049

291 bronze badge

Comments

jenkas · Accepted Answer · 2021-07-08 22:59:53Z

2

Try this .NET Core 5+ implementation

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static unsafe float FastSigmoid(float v)
    {
        const float c1 = 0.03138777F;
        const float c2 = 0.276281267F;
        const float c_log2f = 1.442695022F;
        v *= c_log2f;
        int intPart = (int)v;
        float x = (v - intPart);
        float xx = x * x;
        float v1 = c_log2f + c2 * xx;
        float v2 = x + xx * c1 * x;
        float v3 = (v2 + v1);
        *((int*)&v3) += intPart << 24;
        float v4 = v2 - v1;
        float res = v3 / (v3 - v4); //for tanh change to (v3 + v4)/ (v3 - v4)
        return res;
    }

answered Jul 8, 2021 at 22:59

jenkas

1,01116 silver badges18 bronze badges

3 Comments

Robin Davies Over a year ago

The tanh approximation has excellent precision; but I can't figure out what function the unmodified FastSigmoid is supposed to be approximating. Whatever it is, it's NOT 1/(1+exp(-x)). Any clues o the provenance of this code fragment would also be much appreciated. (I'm looking for a fast 1/(1+exp(-x)) approximation for use in a simd-accelerated LSTM implementation.)

jenkas Over a year ago

@RobinDavies I think it is 1/(1+exp(-2x))

Robin Davies Over a year ago

C++ port, precision and performance analysis of this code posted later in the thread. This is an excellent choice.

Houshalter · Accepted Answer · 2014-08-22 08:16:42Z

1

Using Eureqa to search for approximations to sigmoid I found 1/(1 + 0.3678749025^x) approximates it. It's pretty close, just gets rid of one operation with the negation of x.

Some of the other functions shown here are interesting, but is the power operation really that slow? I tested it and it actually did faster than addition, but that could just be a fluke. If so it should be just as fast or faster as all the others.

EDIT:0.5 + 0.5*tanh(0.5*x) and less accurate, 0.5 + 0.5*tanh(n) also works. And you could just get rid of the constants if you don't care about getting it between the range [0,1] like sigmoid. But it assumes that tanh is faster.

edited Aug 22, 2014 at 8:16

answered Aug 22, 2014 at 8:02

Houshalter

2,8381 gold badge21 silver badges21 bronze badges

2 Comments

redcalx Over a year ago

The power term is generally slow to execute yes, hence this approximation doesn't avoid that aspect of the original question, since pow() will often be implemented in CPU circuitry as a adjustment to an exp() execution/evaluation.

Engineer Over a year ago

Your performance upper bound depends on the value of x (number of multiplies)... not good.

Qni · Accepted Answer · 2016-05-07 07:16:23Z

1

The tanh function may be optimized in some languages, making it faster than a custom defined x/(1+abs(x)), such is the case in Julia.

answered May 7, 2016 at 7:16

Qni

765 bronze badges

Comments

bigotes0invisibles · Accepted Answer · 2021-01-31 20:32:19Z

0

You can also use this:

    y=x / (2 * ((x<0.0)*-x+(x>=0.0)*x) + 2) + 0.5;
    y'=y(1-y);

acts like a sigmoid now because y(1-y)=y' is more let say round than 1/(2 (1 + abs(x))^2) acts more like to fast sigmoid;

edited Jan 31, 2021 at 20:32

answered Jun 26, 2020 at 8:05

bigotes0invisibles

12 bronze badges

1 Comment

Engineer Over a year ago

Conditionals are slow.

Engineer · Accepted Answer · 2025-06-29 21:35:56Z

This is the cubic (3rd order) hermite interpolation function used in computer graphics, such as GLSL smoothstep(). It is fast to compute, symmetric, and passes through 0.0 and 1.0:

f(t) = 3t^2 - 2t^3 or expanded, f(t) = 3 * t * t - 2 * t * t * t

One can get steeper curves by incrementing the values given there to the quartic, quintic, sextic orders and so on -- e.g.

f(t) = 5t^4 - 4t^5

...but while these curves may be useful to some, and pass through 0 and 1, they are asymmetric.

A parameterised version of the OP's reference sigmoid, which can provide steeper, symmetric slopes that one can adjust via parameter a being greater than 1, is:

f(ax) = 1 / (1 + e^(-ax))

Ah, but the use of exp()... that's the very problem this question tries to solve.

Adjustible AND better from a cost perspective, we have the Normalized Tunable Sigmoid Function:

f(x, k) = (x − k * x) / (k − 2k * abs(x) + 1)

Aki Suihkonen · Accepted Answer · 2025-06-30 04:34:38Z

0

My shot at this is based in y = exp(-abs(x * alpha)) (blue curve)
and Y = copysign(1-y, x) for the red curve (the sign of x is copied to first argument).

One can select the alpha to tune either to best accuracy, or best speed -- as for the exp one can use any of the fastExp implementations, some of which just use zero to one multiplications, and reinterpret cast floating points as integers or vice versa.
Selecting e.g. alpha = 1/log2(e), converts from exp(x) to 2.^x , for which there are very fast algorithms, well known and lesser known.

Based on the lesser known algorithms https://stackoverflow.com/a/78509807/1716339
a fast algorithm without divisions or square roots would be

y = 639.f - abs(x);
y = std::bit_cast<float>(std::bit_cast<uint32_t>(y) << 9);
sign = std::bit_cast<int>(x) & 0x80000000;
return std::bit_cast<float>(std::bit_cast<int>(1.0f - y) | sign);

The sign copying can be also done with a single blend, for a total of 4 arithmetic logical operation.

Possibly the range of x needs to be clamped (with one more arithmetic operation) to

y = 639.f - clamp(abs(x), 87.0f);

edited Jun 30 at 4:34

answered Mar 14 at 8:16

Aki Suihkonen

20.5k1 gold badge43 silver badges68 bronze badges

2 Comments

Engineer Jun 29 at 21:04

What is Y = (y.*( ... or more specifically, what is .*? Is that a typo, or are you using some particular language that others here don't know? Perhaps worth clarifying.

Aki Suihkonen Jun 30 at 3:25

Right .* is from Matlab / Octave to allow vector to be multiplied element wise with another vector.

Thomash · Accepted Answer · 2012-05-24 06:31:01Z

-3

I don't think you can do better than the built-in exp() but if you want another approach, you can use series expansion. WolframAlpha can compute it for you.

answered May 24, 2012 at 6:31

Thomash

6,3791 gold badge32 silver badges50 bronze badges

1 Comment

Robin Davies Over a year ago

If you're wondering about the downvotes, the approximation does not seem to converge well over an acceptable range.

Collectives™ on Stack Overflow

Fast sigmoid algorithm

15 Answers 15

4 Comments

1 Comment

1 Comment

3 Comments

3 Comments

1 Comment

1 Comment

Comments

3 Comments

2 Comments

Comments

1 Comment

Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

15 Answers 15

4 Comments

1 Comment

1 Comment

3 Comments

3 Comments

1 Comment

1 Comment

Comments

3 Comments

2 Comments

Comments

1 Comment

Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related