5
$\begingroup$

I am comparing the means of two groups using an independent two-sample t-test in R. Initially, I had the following samples:

Group A: n = 15, mean = 52.3, sd = 4.8 Group B: n = 15, mean = 48.1, sd = 5.2

The t-test gave me: t = 2.27, p = 0.031

After collecting more data, the updated samples became:

Group A: n = 30, mean = 52.6, sd = 5.0 Group B: n = 30, mean = 49.0, sd = 5.4

Now the t-test result is: t = 1.81, p = 0.075

I expected that adding more observations would make the result more statistically significant, but instead the p-value increased and the result is no longer significant at α = 0.05.

Why does this happen even though the sample size doubled and the difference in means is still similar? Is this due to changes in variance, effect size, or assumptions of the t-test? How should this be interpreted in practice?

New contributor
KoleyPort is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
$\endgroup$
5
  • 1
    $\begingroup$ Were samples drawn from distributions with these parameters, or were these the distribution parameters of the observed samples? Doubling the sample size should reduce standard error by about 30%, and the minor decrease in mean difference and increase in SD don't seem to be enough to offset that. $\endgroup$ Commented 18 hours ago
  • 3
    $\begingroup$ Recalculating the statistics from the descriptive summaries gets roughly $t=2.67$ (Welch) for the full set; are you sure you didn't do the second calculation on just the second half? $\endgroup$ Commented 17 hours ago
  • $\begingroup$ As noted by @PBulls, there's likely a mistake in your calculation. Anyway a p-value of 0.03 would not be that different from 0.07. In both cases, this wouldn't be very strong evidence against the null; why do you use a threshold of 0.05? (or any threshold for that matter?) Secondly, increasing your sample size is not a 100% guarantee that you'll get a smaller p-value (it could be that the null is simply true or that you've been "unlucky" with your additional observations). In any case, you should check the calculation of your second result, because the p-value you got is probably incorrect. $\endgroup$ Commented 17 hours ago
  • 2
    $\begingroup$ @J-J-J - re your second sentence: The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant. $\endgroup$ Commented 16 hours ago
  • 1
    $\begingroup$ If the added observations are very compatible with the null hypothesis, the p-value will increase, and rightly so (as the additional observations don't provide you with additional evidence that the null hypothesis is false). $\endgroup$ Commented 15 hours ago

3 Answers 3

14
$\begingroup$

You seem to have made a miscalculation in your t-values

If you apply

$$t = \frac{x_1-x_2}{\sqrt{sd_1^2/n+sd_2^2/n}}$$

then your second result seems to have not been using $n=30$ (but instead the value $n=15$).


That aside

I expected that adding more observations would make the result more statistically significant, but instead the p-value increased and the result is no longer significant at α = 0.05.

P-values do not necessarily decrease with increasing observations. (that's only true when there is truly an effect, and even then due to statistical variations the values will fluctuate and can also rise)

What primarily changes is the accuracy of the estimate that you make.


Confidence intervals can illustrate this very well.

  • With n=15 you got the estimated mean difference A-B is 4.2 with a 95% confidence interval of $[0.45 , 7.94]$.

  • With n=30 you got the estimated mean difference A-B of 3.6 with a 95% confidence interval of $[0.91 , 6.29]$.

(so your second result was still significant if you change this $n$ in the formula, but say that we got an effect of 2.6 with interval $[-0.09 , 5.29]$)


So, more observations will make a result more accurate. Typically the confidence interval, the error of the estimate will decrease. As you see in the figures above, the interval got smaller.

But at the same time the estimate can shift.

Significance (in null hypothesis testing) doesn't mean whether your result is accurate or not but whether 'zero effect' is within the range of accuracy or not.


I used these calculations in r to get the intervals

(52.3-48.1)+
(sqrt(4.8^2+5.2^2)/sqrt(15))*
qt(0.025,28)*c(1,-1)
# 0.4571466 7.9428534


(52.6-49.0)+
(sqrt(5.0^2+5.4^2)/sqrt(30))*
qt(0.025,58)*c(1,-1)
# 0.9104385 6.2895615
$\endgroup$
1
  • 1
    $\begingroup$ Does that fact that the experiment appears to have had an optional stopping point have any implications for the interpretation of the outcome(s)? Seems to me that it makes any claim of 'significance' in the Neyman–Pearsonian framework quite problematical, even if the optional stopping point does not affect the evidential meaning of the p-value. $\endgroup$ Commented 4 hours ago
4
$\begingroup$

The t-test boils down to comparing the separation between groups (i.e., the difference of the means) to the spread within groups (the standard deviations).

In this particular case, adding more data

  • brought the means closer together (52.3 - 48.1 = 4.2, but 52.6 - 49.0 = 3.6)
  • increased the standard deviations (4.8 -> 5.0, 5.2 -> 5.4)

Thus, the resulting p-value is a bit larger because the groups are "harder" to distinguish. How should you interpret this?

P-values often decrease with larger samples because that data allows more precise estimates of the mean and standard deviation. That didn't quite happen here. If the sample sizes and differences were larger, I might wonder whether your initial data only came from part of the population, but I think this is likely just random sampling error. The P-values themselves are random variables and are thus affected by that sampling. You could try to convince yourself of this by running t-tests on sub-samples of 15 observations from the larger dataset. You'll find some that are close to 0.03, some that are close to 0.07, and some that are larger and smaller.

It is also important to be aware that "The Difference between Significant and Non-Significant is not Itself Statistically Significant", as the paper of the same title points out. Intuitively, imagine I told that one number was between 1-3 ("different from zero") and another was between 0-5 (not). Could you reliably say which is likely to be larger? Nope. Thus, I would not read much of anything into the difference between 0.03 and 0.07.

$\endgroup$
2
  • $\begingroup$ The idea looks good, but I think you should be examining the standard errors instead of the standard deviations. The SEs reduced from $4.8/\sqrt{15}\approx 1.24$ and $5.0/\sqrt{15}\approx 1.29$ to $5.2/\sqrt{30}\approx 0.95$ and $5.4/\sqrt{30}\approx 0.99.$ That will more than compensate for the change in mean difference, indicating there is some error in the reported p-values. $\endgroup$ Commented 15 hours ago
  • $\begingroup$ Despite the population variance increasing, we still get more precise estimates of the population means due to the additional sample size - the standard error of the mean goes down even though the standard deviation of the sample goes up. We gain more in reduced SE than we lose in the shrinking mean difference. $\endgroup$ Commented 15 hours ago
4
$\begingroup$

The simple answer is that you made a computation error.
I used a Welch t-test, because one should always use a Welch t-test.
For the sample of 15, I obtain $p=0.029$, and a CI of the difference of means of $[0.451, 7.949]$.
For the sample of 30, I obtain $p=0.01$, and a CI of the difference of means of $[0.909, 6.291]$

So as you see, the p-value went down, the CI is further away from $0$, and the width of the CI has been narrowed. So nothing to wonder about...

Note that in your case, the p-values are virtually the same for a Student t-test, as you have balanced sample sizes, and very similar variances. So the error is some other data entry. And there must be multiple errors.
E.g., for the sample size of 30, you report $t=1.81$, when it should be $t=2.68$; but then you also report that $p=0.075$ for $t=1.81$. That is also incorrect as $p=0.81$ for $t=1.81$. I am not sure which software, tables, calculator, etc. you are using, so it is difficult to dig further.
If you want to check for yourself, I can recommend this calculator for the Student t-test, or this calculator for the Welch t-test. They are both quite didactic, show the various aspects of the test, the formulas, etc.

Now, should one always expect a lower p-value when increasing the sample size? Usually yes, but not always. It will depend on the change in the difference of means, and the change in standard deviations. In your case, these are not very large, so yes, you should have expected this, and indeed, properly computed, that is what the data says.

$\endgroup$
7
  • 2
    $\begingroup$ +1. But a stronger argument can be developed by using a test that reproduces at least one of the posted p-values. $\endgroup$ Commented 14 hours ago
  • $\begingroup$ Re the edit: it is not the case that Welch's t-test will necessarily yield the same p-value as a Student t-test when the two group sample sizes are the same. Equality is assured only when that is that case and the group variances are equal. In the examples in the question the difference is practically unnoticeable because the variances are so close and the sample sizes are (for t-tests, anyway) fairly large. Check it out in R: g <- function(n, m, s) s * scale(rnorm(n)) + m; lapply(c(TRUE, FALSE), \(t) t.test(g(15, 52.3, 4.8), g(15, 48.1, 5.2), var.equal = t)$p.value) $\endgroup$ Commented 11 hours ago
  • $\begingroup$ (I was surprised, upon running variations of this calculation, at how difficult it is to make the two versions of the t test give anything but very slight differences in their p-values.) One of the groups has to be tiny (n is 5 or less, typically) for real differences to show up. $\endgroup$ Commented 11 hours ago
  • $\begingroup$ @whuber, I did try. But there are several errors (e.g. the t stats are wrong, but then the p-value derivation from the t-stat is also wrong...), and so, the number of possible permutations becomes very large, very fast. $\endgroup$ Commented 11 hours ago
  • 1
    $\begingroup$ @whuber, when the sample sizes are balanced, the p-values of the 2 tests will be very close, even with unequal variances, because the Behrens-Fisher problem does not rear its ugly head. My only point in the answer was to say that the issue was not Welch vs. Student. Indeed, one gets virtually the same result either way. $\endgroup$ Commented 10 hours ago

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.