Python: Numpy standard deviation error

Question

This is a simple test

import numpy as np
data = np.array([-1,0,1])
print data.std()

>> 0.816496580928

I don't understand how this result been generated? Obviously:

( (1^0.5 + 1^0.5 + 0^0.5)/(3-1) )^0.5 = 1

and in matlab it gives me std([-1,0,1]) = 1. Could you help me get understand how numpy.std() works?

Dividing by N-1 gives the sample variance, but NumPy computes the population variance. — Fred Foo
– Fred Foo, Commented Jun 5, 2014 at 18:55
Giving this an upvote because the difference between population and sample standard deviation is seldom paid attention to until results fail to match - picking one, and knowing why you're using it, will both help prevent this problem and also force you to usefully think about your problem a bit more. (All said from unpleasant experience.). — schodge
– schodge, Commented Jun 5, 2014 at 19:14

BlackVegetable · Accepted Answer · 2017-10-30 22:13:28Z

27

The crux of this problem is that you need to divide by N (3), not N-1 (2). As Iarsmans pointed out, numpy will use the population variance, not the sample variance.

So the real answer is sqrt(2/3) which is exactly that: 0.8164965...

If you happen to be trying to deliberately use a different value (than the default of 0) for the degrees of freedom, use the keyword argument ddofwith a positive value other than 0:

np.std(data, ddof=1)

... but doing so here would reintroduce your original problem as numpy will divide by N - ddof.

edited Oct 30, 2017 at 22:13

answered Jun 5, 2014 at 18:54

BlackVegetable

13.2k8 gold badges53 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

MacSanhe Over a year ago

sorry, the 2 is just typo. I think the np.std() is just universal std. If it is a sample std, it should be N-1. Is there a function for sample std?

BlackVegetable Over a year ago

@MacSanhe Ah, then that makes more sense how you could make that mistake!

BlackVegetable Over a year ago

@MacSanhe Edited with details to address your concern.

Johannes Schaub - litb Over a year ago

This appears to be incorrect. The numpy docs indicate it uses an uncorrected sample standard deviation by default, with ddof=0. ddof=1 will enable population variance (makes it less biased towards the sample mean). Or am I missing something?

BlackVegetable Over a year ago

Let me check to see if things have changed since I wrote this answer.

|

Oleg Sklyar · Accepted Answer · 2014-06-05 19:09:30Z

6

It is worth reading the help page for the function/method before suggesting it is incorrect. The method does exactly what the doc-string says it should be doing, divides by 3, because By default ddofis zero.:

In [3]: numpy.std?

String form: <function std at 0x104222398>
File:        /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/fromnumeric.py
Definition:  numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False)
Docstring:
Compute the standard deviation along the specified axis.

...

ddof : int, optional
    Means Delta Degrees of Freedom.  The divisor used in calculations
    is ``N - ddof``, where ``N`` represents the number of elements.
    By default `ddof` is zero.

edited Jun 5, 2014 at 19:09

answered Jun 5, 2014 at 19:00

Oleg Sklyar

10.1k6 gold badges43 silver badges64 bronze badges

Comments

Glorfindel · Accepted Answer · 2023-01-06 20:16:18Z

2

When getting into NumPy from Matlab, you'll probably want to keep the docs for both handy. They're similar but often differ in small but important details. Basically, they calculate the standard deviation differently. I would strongly recommend checking the documentation for anything you use that calculates standard deviation, whether a pocket calculator or a programming language, since the default is not (sorry!) standardized.

Numpy STD: http://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html

Matlab STD: http://www.mathworks.com/help/matlab/ref/std.html

The Numpy docs for std are a bit opaque, IMHO, especially considering that NumPy docs are generally fairly clear. If you read far enough: The average squared deviation is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. (In english, default is pop std dev, set ddof=1 for sample std dev).

OTOH, the Matlab docs make clear the difference that's tripping you up:

There are two common textbook definitions for the standard deviation s of a data vector X. [equations omitted] n is the number of elements in the sample. The two forms of the equation differ only in n – 1 versus n in the divisor.

So, by default, Matlab calculates the sample standard deviation (N-1 in the divisor, so bigger to compensate for the fact this is a sample) and Numpy calculates the population standard deviation (N in the divisor). You use the ddof parameter to switch to the sample standard, or any other denominator you want (which goes beyond my statistics knowledge).

Lastly, it doesn't help on this problem, but you'll probably find this helpful at some point. Link

edited Jan 6, 2023 at 20:16

Glorfindel

22.8k13 gold badges97 silver badges124 bronze badges

answered Jun 5, 2014 at 19:11

schodge

9012 gold badges16 silver badges29 bronze badges

2 Comments

PythonNut Over a year ago

Out of curiosity, when would I ever need to use a value of ddof such that ddof ∉ {0, 1}?

schodge Over a year ago

I have no idea, I've only ever used those two. Maybe a question for stats.stackexchange.com

Collectives™ on Stack Overflow

Python: Numpy standard deviation error

3 Answers 3

6 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related