Does a uniform distribution of many p-values give statistical evidence that H0 is true?
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
26
down vote
favorite
A single statistical test can give evidence that the null hypothesis (H0) is false and therefore the alternative hypothesis (H1) is true. But it cannot be used to show that H0 is true because failure to reject H0 does not mean that H0 is true.
But let's assume you have the possibility to do the statistical test many times because you have many datasets, all independent of each other. All datasets are the result of the same process and you want to make some statement (H0/H1) over the process itself and aren't interested in the results of each single test. You then collect all the resulting p-values and happen to see via histogram plot that the p-values are clearly uniformly distributed.
My reasoning now is that this can only happen if H0 is true — else the p-values would be distributed differently. Is this therefore enough evidence to conclude that H0 is true? Or am i missing here something essential, because it took me a lot of willpower to write "conclude that H0 is true" which just sounds horribly wrong in my head.
hypothesis-testing p-value combining-p-values
add a comment |
up vote
26
down vote
favorite
A single statistical test can give evidence that the null hypothesis (H0) is false and therefore the alternative hypothesis (H1) is true. But it cannot be used to show that H0 is true because failure to reject H0 does not mean that H0 is true.
But let's assume you have the possibility to do the statistical test many times because you have many datasets, all independent of each other. All datasets are the result of the same process and you want to make some statement (H0/H1) over the process itself and aren't interested in the results of each single test. You then collect all the resulting p-values and happen to see via histogram plot that the p-values are clearly uniformly distributed.
My reasoning now is that this can only happen if H0 is true — else the p-values would be distributed differently. Is this therefore enough evidence to conclude that H0 is true? Or am i missing here something essential, because it took me a lot of willpower to write "conclude that H0 is true" which just sounds horribly wrong in my head.
hypothesis-testing p-value combining-p-values
1
You might be interested in my answer to a different question stats.stackexchange.com/questions/171742/… which has some comments about the hypotheses here.
– mdewey
yesterday
H0 is false by its definition.
– Joshua
yesterday
1
On a side note, the reason why i have so many tests (and haven't just combined all the data into a single one) is that my data is spatially distributed around the globe and i wanted to see whether there are spatial patterns in the p-values (there aren't, but if there were it would mean that either independence is violated or that H0/H1 is true in different parts of the globe). I haven't included this in the question text because i wanted to keep it general.
– Leander Moesinger
yesterday
add a comment |
up vote
26
down vote
favorite
up vote
26
down vote
favorite
A single statistical test can give evidence that the null hypothesis (H0) is false and therefore the alternative hypothesis (H1) is true. But it cannot be used to show that H0 is true because failure to reject H0 does not mean that H0 is true.
But let's assume you have the possibility to do the statistical test many times because you have many datasets, all independent of each other. All datasets are the result of the same process and you want to make some statement (H0/H1) over the process itself and aren't interested in the results of each single test. You then collect all the resulting p-values and happen to see via histogram plot that the p-values are clearly uniformly distributed.
My reasoning now is that this can only happen if H0 is true — else the p-values would be distributed differently. Is this therefore enough evidence to conclude that H0 is true? Or am i missing here something essential, because it took me a lot of willpower to write "conclude that H0 is true" which just sounds horribly wrong in my head.
hypothesis-testing p-value combining-p-values
A single statistical test can give evidence that the null hypothesis (H0) is false and therefore the alternative hypothesis (H1) is true. But it cannot be used to show that H0 is true because failure to reject H0 does not mean that H0 is true.
But let's assume you have the possibility to do the statistical test many times because you have many datasets, all independent of each other. All datasets are the result of the same process and you want to make some statement (H0/H1) over the process itself and aren't interested in the results of each single test. You then collect all the resulting p-values and happen to see via histogram plot that the p-values are clearly uniformly distributed.
My reasoning now is that this can only happen if H0 is true — else the p-values would be distributed differently. Is this therefore enough evidence to conclude that H0 is true? Or am i missing here something essential, because it took me a lot of willpower to write "conclude that H0 is true" which just sounds horribly wrong in my head.
hypothesis-testing p-value combining-p-values
hypothesis-testing p-value combining-p-values
edited 9 hours ago
mdewey
11.3k72041
11.3k72041
asked yesterday
Leander Moesinger
25429
25429
1
You might be interested in my answer to a different question stats.stackexchange.com/questions/171742/… which has some comments about the hypotheses here.
– mdewey
yesterday
H0 is false by its definition.
– Joshua
yesterday
1
On a side note, the reason why i have so many tests (and haven't just combined all the data into a single one) is that my data is spatially distributed around the globe and i wanted to see whether there are spatial patterns in the p-values (there aren't, but if there were it would mean that either independence is violated or that H0/H1 is true in different parts of the globe). I haven't included this in the question text because i wanted to keep it general.
– Leander Moesinger
yesterday
add a comment |
1
You might be interested in my answer to a different question stats.stackexchange.com/questions/171742/… which has some comments about the hypotheses here.
– mdewey
yesterday
H0 is false by its definition.
– Joshua
yesterday
1
On a side note, the reason why i have so many tests (and haven't just combined all the data into a single one) is that my data is spatially distributed around the globe and i wanted to see whether there are spatial patterns in the p-values (there aren't, but if there were it would mean that either independence is violated or that H0/H1 is true in different parts of the globe). I haven't included this in the question text because i wanted to keep it general.
– Leander Moesinger
yesterday
1
1
You might be interested in my answer to a different question stats.stackexchange.com/questions/171742/… which has some comments about the hypotheses here.
– mdewey
yesterday
You might be interested in my answer to a different question stats.stackexchange.com/questions/171742/… which has some comments about the hypotheses here.
– mdewey
yesterday
H0 is false by its definition.
– Joshua
yesterday
H0 is false by its definition.
– Joshua
yesterday
1
1
On a side note, the reason why i have so many tests (and haven't just combined all the data into a single one) is that my data is spatially distributed around the globe and i wanted to see whether there are spatial patterns in the p-values (there aren't, but if there were it would mean that either independence is violated or that H0/H1 is true in different parts of the globe). I haven't included this in the question text because i wanted to keep it general.
– Leander Moesinger
yesterday
On a side note, the reason why i have so many tests (and haven't just combined all the data into a single one) is that my data is spatially distributed around the globe and i wanted to see whether there are spatial patterns in the p-values (there aren't, but if there were it would mean that either independence is violated or that H0/H1 is true in different parts of the globe). I haven't included this in the question text because i wanted to keep it general.
– Leander Moesinger
yesterday
add a comment |
4 Answers
4
active
oldest
votes
up vote
20
down vote
accepted
I like your question, but unfortunately my answer is NO, it doesn't prove $H_0$. The reason is very simple. How would do you know that the distribution of p-values is uniform? You would probably have to run a test for uniformity which will return you its own p-value, and you end up with the same kind of inference question that you were trying to avoid, only one step farther. Instead of looking at p-value of the original $H_0$, now you look at a p-value of another $H'_0$ about the uniformity of distribution of original p-values.
UPDATE
Here's the demonstration. I generate 100 samples of 100 observations from Gaussian and Poisson distribution, then obtain 100 p-values for normality test of each sample. So, the premise of the question is that if the p-values are from uniform distribution, then it proves that the null hypothesis is correct, which is a stronger statement than a usual "fails to reject" in statistical inference. The trouble is that "the p-values are from uniform" is a hypothesis itself, which you have to somehow test.
In the picture (first row) below I'm showing the histograms of p-values from a normality test for the Guassian and Poisson sample, and you can see that it's hard to say whether one is more uniform than the other. That was my main point.
The second row shows one of the samples from each distribution. The samples are relatively small, so you can't have too many bins indeed. Actually, this particular Gaussian sample doesn't look that much Gaussian at all on the histogram.
In the third row, I'm showing the combined samples of 10,000 observations for each distribution on a histogram. Here, you can have more bins, and the shapes are more obvious.
Finally, I run the same normality test and get p-values for the combined samples and it rejects normality for Poisson, while failing to reject for Gaussian. The p-values are: [0.45348631] [0.]
This is not a proof, of course, but the demonstration of the idea that you better run the same test on the combined sample, instead of trying to analyze the distribution of p-values from subsamples.
Here's Python code:
import numpy as np
from scipy import stats
from matplotlib import pyplot as plt
def pvs(x):
pn = x.shape[1]
pvals = np.zeros(pn)
for i in range(pn):
pvals[i] = stats.jarque_bera(x[:,i])[1]
return pvals
n = 100
pn = 100
mu, sigma = 1, 2
np.random.seed(0)
x = np.random.normal(mu, sigma, size=(n,pn))
x2 = np.random.poisson(15, size=(n,pn))
print(x[1,1])
pvals = pvs(x)
pvals2 = pvs(x2)
x_f = x.reshape((n*pn,1))
pvals_f = pvs(x_f)
x2_f = x2.reshape((n*pn,1))
pvals2_f = pvs(x2_f)
print(pvals_f,pvals2_f)
print(x_f.shape,x_f[:,0])
#print(pvals)
plt.figure(figsize=(9,9))
plt.subplot(3,2,1)
plt.hist(pvals)
plt.gca().set_title('True Normal')
plt.gca().set_ylabel('p-value')
plt.subplot(3,2,2)
plt.hist(pvals2)
plt.gca().set_title('Poisson')
plt.gca().set_ylabel('p-value')
plt.subplot(3,2,3)
plt.hist(x[:,0])
plt.gca().set_title('a small sample')
plt.gca().set_ylabel('x')
plt.subplot(3,2,4)
plt.hist(x2[:,0])
plt.gca().set_title('a small Sample')
plt.gca().set_ylabel('x')
plt.subplot(3,2,5)
plt.hist(x_f[:,0],100)
plt.gca().set_title('Full Sample')
plt.gca().set_ylabel('x')
plt.subplot(3,2,6)
plt.hist(x2_f[:,0],100)
plt.gca().set_title('Full Sample')
plt.gca().set_ylabel('x')
plt.show()
2
@LeanderMoesinger you're going to make a stronger point by collecting all your tests into one. Suppose, you have a sample with 100 observations, and get p-value; then get 99 additional samples and end up with 100 p-values. Instead, you could just run one 10,000 observations sample and get on p-value, but it'll be more convincing.
– Aksakal
yesterday
1
@LeanderMoesinger, it's likely to be not small
– Aksakal
yesterday
1
Your answer does not address the question, he didn’t ask about proof but about evidence.
– Carlos Cinelli
yesterday
1
@CarlosCinelli, he'll have a bunch of p-values, which he would claim are uniform. How is this an evidence unless he proves the values are from uniform? That's what I'm talking about.
– Aksakal
yesterday
2
@Aksakal this is about mathematics, an observed event (like a sequence of p-values) may not constitute evidence of something, but the reason does not logically follow from your argument.
– Carlos Cinelli
yesterday
|
show 5 more comments
up vote
19
down vote
Your series of experiments can be viewed as a single experiment with far more data, and as we know, more data is advantageous (eg. typically standard errors decrease as $sqrtn$ increases). Also, your partitioning the data and fitting a model repeatedly can indeed be useful.
But you ask, "Is this ... enough evidence to conclude that H0 is true?" Perhaps a rephrasing is, "If I obtain more and more data consistent with $H_0$ being true, can I ever conclude that $H_0$ is true?"
That question is deeply related to 18th century philosopher David Hume's problem of induction. If all observed instances of A have been B, can we say that the next instance of A will be B? Hume famously said no, that we cannot logically deduce that "all A are B" even from voluminous data.
In more modern math, a finite set of observations cannot logically entail $forall_a in A left[ a in B right]$ if A is not a finite set. Two notable examples as discussed by Magee and Passermore:
For centuries, every swan observed by Europeans was white. Then Europeans discovered Australia and saw black swans.
For centuries, Newton's law of gravity agreed with observation and was thought correct. It was overturned though by Einstein's theory of general relativity.
If Hume's conclusion is correct, proving $H_0$ true in a strict sense is unachievable. That we cannot make statements with certitude though is not equivalent to saying we know nothing at all. Experimental science and statistics have been successful in helping us understand and navigate the world.
An (incomplete) listing of ways forward:
Karl Popper and falsificationism
In Popper's view, no scientific law is ever proven true. We only have scientific laws not yet proven false.
Popper argued that science proceeds forward by guessing hypotheses and subjecting them to rigorous scrutiny. It proceeds forward through deduction (observation proving theories false), not induction (repeated observation proving theories true). Much of frequentist statistics was constructed consistent with this philosophy.
Popper's view has been immensely influential, but as Kuhn and others have argued, it does not quite conform to the empirically observed practice of successful science.
Bayesian, subjective probability
Let's assume we're interested in a parameter $theta$.
To the frequentist statistician, parameter $theta$ is a scalar value, a number. If you instead take a subjective Bayesian viewpoint (such as in Leonard Jimmie Savage's Foundation of Statistics), you can model your own uncertainty over $theta$ using the tools of probability. To the subjective Bayesian, $theta$ is a random variable and you have some prior $P(theta)$. You can then talk about the subjective probability $P(theta mid X)$ of different values of $theta$ given the data $X$. How you behave in various situations has some correspondence to these subjective probabilities.
This is a logical way to model your own subjective beliefs, but it's not a magic way to produce probabilities that are true in terms of correspondence to reality. A tricky question for any Bayesian interpretation is where do priors come from? Also, what if the model is misspecified?
George P. Box
A famous aphorism of George E.P. Box is that "all models are false, but some are useful."
Newton's law may not be true in the strong sense, but it immensely useful for many problems. Box's view is quite important in the modern big data context where studies are so overpowered that you can reject basically any meaningful proposition. Strictly true versus false is a bad question.
Additional comments
There's a world of difference in statistics between estimating a parameter $theta approx 0$ with a small standard error versus with a large standard error! Don't walk away thinking that because certitude is impossible, passing rigorous scrutiny is irrelevant.
Perhaps also of interest, statistically analyzing the results of multiple studies is called meta-analysis.
How far you can go beyond narrow statistical interpretations is a difficult question.
This has been an interesting read and gave some nice things to think about! I wish i could accept multiple answers.
– Leander Moesinger
yesterday
Quite an explanation. My prof once summarized Kuhn in the spirit of Popper: 'Science progresses from funeral to funeral'
– skrubber
yesterday
Kuhn etc famously misinterpret Popper when claiming his observations don't match how science is done. This is known as native falsificationism, and it's not what Popper (later) put forward. It's a straw man.
– Konrad Rudolph
yesterday
1
It's answers like this I keep visiting StackExchange sites.
– Trilarion
6 hours ago
add a comment |
up vote
4
down vote
In a sense you are right (see the p-curve) with some small caveats:
- you need the test to have some power under the alternative. Illustration of the potential problem: generating a p-value as a uniform distribution on 0 to 1 and rejecting when $p leq alpha$ is a (admittedly pretty useless) level $alpha$ test for any null hypothesis, but you will get a uniform distribution of p-values whether $H_0$ is true or not.
- You can only really show that you are quite close to $H_0$ being true (i.e. under the true parameter values three distribution might be close to uniform, even if $H_0$ is false.
With realistic applications, you tend to get additional issues. These mostly arise, because no one person/lab/study group can usually do all the necessary studies. As a result one tends to look at studies from lots of groups, at which point you have increased concerns (i.e. if you had done all relevant experiments yourself, at least you'd know) of underreporting, selective reporting of significant/surprising findings, p-hacking, multiple testing/multiple testing corrections and so on.
add a comment |
up vote
-2
down vote
Null hypothesis (H0): Gravity causes everything in the universe to fall toward Earth's surface.
Alternate hypothesis (H1): Nothing ever falls.
Performed 1 million experiments with dozens of household objects, fail to reject H0 with $p < 0.01$ every time. Is H0 true?
New contributor
2
Do you think Galileo did one million trials? None of this stuff is necessary in physical sciences. Establishing the laws of nature by applying scientific method does not reduce into statistical inference.
– Aksakal
yesterday
1
-1 This is scientifically, statistically, and historically inaccurate. Greeks once believed that it was affinity that drew objects to the Earth. Not bad, but doesn't explain the 3+ body system problems well. Hypotheses should be complementary. Lastly stating a possibly known bias as H_0 and showing experiments continue to lead to the same incorrect conclusion doesn't make the conclusion correct. e.g. women earn less than men b/c they are less driven, sample all women's salaries, H_0 is true!
– AdamO
yesterday
@AdamO that is exactly my point.
– usul
yesterday
@AdamO, in the Western countries women earn less when they work less for a variety of reasons including their own choice, disincentives of all kinds and hostile work environment in some places. When they work the same, they earn about the same, e.g. see medicare nurse salaries where women are the great majority: medscape.com/slideshow/…. They all earn the same $37 when working hourly. Totally off-topic, of course.
– Aksakal
yesterday
2
If your null hypothesis isGravity causes everything in the universe to fall toward Earth's surface
isn't the alternative hypothesisThere is at least one thing in the universe that does not fall toward the Earth's surface
and notNothing ever falls
?
– Eff
15 hours ago
|
show 2 more comments
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
20
down vote
accepted
I like your question, but unfortunately my answer is NO, it doesn't prove $H_0$. The reason is very simple. How would do you know that the distribution of p-values is uniform? You would probably have to run a test for uniformity which will return you its own p-value, and you end up with the same kind of inference question that you were trying to avoid, only one step farther. Instead of looking at p-value of the original $H_0$, now you look at a p-value of another $H'_0$ about the uniformity of distribution of original p-values.
UPDATE
Here's the demonstration. I generate 100 samples of 100 observations from Gaussian and Poisson distribution, then obtain 100 p-values for normality test of each sample. So, the premise of the question is that if the p-values are from uniform distribution, then it proves that the null hypothesis is correct, which is a stronger statement than a usual "fails to reject" in statistical inference. The trouble is that "the p-values are from uniform" is a hypothesis itself, which you have to somehow test.
In the picture (first row) below I'm showing the histograms of p-values from a normality test for the Guassian and Poisson sample, and you can see that it's hard to say whether one is more uniform than the other. That was my main point.
The second row shows one of the samples from each distribution. The samples are relatively small, so you can't have too many bins indeed. Actually, this particular Gaussian sample doesn't look that much Gaussian at all on the histogram.
In the third row, I'm showing the combined samples of 10,000 observations for each distribution on a histogram. Here, you can have more bins, and the shapes are more obvious.
Finally, I run the same normality test and get p-values for the combined samples and it rejects normality for Poisson, while failing to reject for Gaussian. The p-values are: [0.45348631] [0.]
This is not a proof, of course, but the demonstration of the idea that you better run the same test on the combined sample, instead of trying to analyze the distribution of p-values from subsamples.
Here's Python code:
import numpy as np
from scipy import stats
from matplotlib import pyplot as plt
def pvs(x):
pn = x.shape[1]
pvals = np.zeros(pn)
for i in range(pn):
pvals[i] = stats.jarque_bera(x[:,i])[1]
return pvals
n = 100
pn = 100
mu, sigma = 1, 2
np.random.seed(0)
x = np.random.normal(mu, sigma, size=(n,pn))
x2 = np.random.poisson(15, size=(n,pn))
print(x[1,1])
pvals = pvs(x)
pvals2 = pvs(x2)
x_f = x.reshape((n*pn,1))
pvals_f = pvs(x_f)
x2_f = x2.reshape((n*pn,1))
pvals2_f = pvs(x2_f)
print(pvals_f,pvals2_f)
print(x_f.shape,x_f[:,0])
#print(pvals)
plt.figure(figsize=(9,9))
plt.subplot(3,2,1)
plt.hist(pvals)
plt.gca().set_title('True Normal')
plt.gca().set_ylabel('p-value')
plt.subplot(3,2,2)
plt.hist(pvals2)
plt.gca().set_title('Poisson')
plt.gca().set_ylabel('p-value')
plt.subplot(3,2,3)
plt.hist(x[:,0])
plt.gca().set_title('a small sample')
plt.gca().set_ylabel('x')
plt.subplot(3,2,4)
plt.hist(x2[:,0])
plt.gca().set_title('a small Sample')
plt.gca().set_ylabel('x')
plt.subplot(3,2,5)
plt.hist(x_f[:,0],100)
plt.gca().set_title('Full Sample')
plt.gca().set_ylabel('x')
plt.subplot(3,2,6)
plt.hist(x2_f[:,0],100)
plt.gca().set_title('Full Sample')
plt.gca().set_ylabel('x')
plt.show()
2
@LeanderMoesinger you're going to make a stronger point by collecting all your tests into one. Suppose, you have a sample with 100 observations, and get p-value; then get 99 additional samples and end up with 100 p-values. Instead, you could just run one 10,000 observations sample and get on p-value, but it'll be more convincing.
– Aksakal
yesterday
1
@LeanderMoesinger, it's likely to be not small
– Aksakal
yesterday
1
Your answer does not address the question, he didn’t ask about proof but about evidence.
– Carlos Cinelli
yesterday
1
@CarlosCinelli, he'll have a bunch of p-values, which he would claim are uniform. How is this an evidence unless he proves the values are from uniform? That's what I'm talking about.
– Aksakal
yesterday
2
@Aksakal this is about mathematics, an observed event (like a sequence of p-values) may not constitute evidence of something, but the reason does not logically follow from your argument.
– Carlos Cinelli
yesterday
|
show 5 more comments
up vote
20
down vote
accepted
I like your question, but unfortunately my answer is NO, it doesn't prove $H_0$. The reason is very simple. How would do you know that the distribution of p-values is uniform? You would probably have to run a test for uniformity which will return you its own p-value, and you end up with the same kind of inference question that you were trying to avoid, only one step farther. Instead of looking at p-value of the original $H_0$, now you look at a p-value of another $H'_0$ about the uniformity of distribution of original p-values.
UPDATE
Here's the demonstration. I generate 100 samples of 100 observations from Gaussian and Poisson distribution, then obtain 100 p-values for normality test of each sample. So, the premise of the question is that if the p-values are from uniform distribution, then it proves that the null hypothesis is correct, which is a stronger statement than a usual "fails to reject" in statistical inference. The trouble is that "the p-values are from uniform" is a hypothesis itself, which you have to somehow test.
In the picture (first row) below I'm showing the histograms of p-values from a normality test for the Guassian and Poisson sample, and you can see that it's hard to say whether one is more uniform than the other. That was my main point.
The second row shows one of the samples from each distribution. The samples are relatively small, so you can't have too many bins indeed. Actually, this particular Gaussian sample doesn't look that much Gaussian at all on the histogram.
In the third row, I'm showing the combined samples of 10,000 observations for each distribution on a histogram. Here, you can have more bins, and the shapes are more obvious.
Finally, I run the same normality test and get p-values for the combined samples and it rejects normality for Poisson, while failing to reject for Gaussian. The p-values are: [0.45348631] [0.]
This is not a proof, of course, but the demonstration of the idea that you better run the same test on the combined sample, instead of trying to analyze the distribution of p-values from subsamples.
Here's Python code:
import numpy as np
from scipy import stats
from matplotlib import pyplot as plt
def pvs(x):
pn = x.shape[1]
pvals = np.zeros(pn)
for i in range(pn):
pvals[i] = stats.jarque_bera(x[:,i])[1]
return pvals
n = 100
pn = 100
mu, sigma = 1, 2
np.random.seed(0)
x = np.random.normal(mu, sigma, size=(n,pn))
x2 = np.random.poisson(15, size=(n,pn))
print(x[1,1])
pvals = pvs(x)
pvals2 = pvs(x2)
x_f = x.reshape((n*pn,1))
pvals_f = pvs(x_f)
x2_f = x2.reshape((n*pn,1))
pvals2_f = pvs(x2_f)
print(pvals_f,pvals2_f)
print(x_f.shape,x_f[:,0])
#print(pvals)
plt.figure(figsize=(9,9))
plt.subplot(3,2,1)
plt.hist(pvals)
plt.gca().set_title('True Normal')
plt.gca().set_ylabel('p-value')
plt.subplot(3,2,2)
plt.hist(pvals2)
plt.gca().set_title('Poisson')
plt.gca().set_ylabel('p-value')
plt.subplot(3,2,3)
plt.hist(x[:,0])
plt.gca().set_title('a small sample')
plt.gca().set_ylabel('x')
plt.subplot(3,2,4)
plt.hist(x2[:,0])
plt.gca().set_title('a small Sample')
plt.gca().set_ylabel('x')
plt.subplot(3,2,5)
plt.hist(x_f[:,0],100)
plt.gca().set_title('Full Sample')
plt.gca().set_ylabel('x')
plt.subplot(3,2,6)
plt.hist(x2_f[:,0],100)
plt.gca().set_title('Full Sample')
plt.gca().set_ylabel('x')
plt.show()
2
@LeanderMoesinger you're going to make a stronger point by collecting all your tests into one. Suppose, you have a sample with 100 observations, and get p-value; then get 99 additional samples and end up with 100 p-values. Instead, you could just run one 10,000 observations sample and get on p-value, but it'll be more convincing.
– Aksakal
yesterday
1
@LeanderMoesinger, it's likely to be not small
– Aksakal
yesterday
1
Your answer does not address the question, he didn’t ask about proof but about evidence.
– Carlos Cinelli
yesterday
1
@CarlosCinelli, he'll have a bunch of p-values, which he would claim are uniform. How is this an evidence unless he proves the values are from uniform? That's what I'm talking about.
– Aksakal
yesterday
2
@Aksakal this is about mathematics, an observed event (like a sequence of p-values) may not constitute evidence of something, but the reason does not logically follow from your argument.
– Carlos Cinelli
yesterday
|
show 5 more comments
up vote
20
down vote
accepted
up vote
20
down vote
accepted
I like your question, but unfortunately my answer is NO, it doesn't prove $H_0$. The reason is very simple. How would do you know that the distribution of p-values is uniform? You would probably have to run a test for uniformity which will return you its own p-value, and you end up with the same kind of inference question that you were trying to avoid, only one step farther. Instead of looking at p-value of the original $H_0$, now you look at a p-value of another $H'_0$ about the uniformity of distribution of original p-values.
UPDATE
Here's the demonstration. I generate 100 samples of 100 observations from Gaussian and Poisson distribution, then obtain 100 p-values for normality test of each sample. So, the premise of the question is that if the p-values are from uniform distribution, then it proves that the null hypothesis is correct, which is a stronger statement than a usual "fails to reject" in statistical inference. The trouble is that "the p-values are from uniform" is a hypothesis itself, which you have to somehow test.
In the picture (first row) below I'm showing the histograms of p-values from a normality test for the Guassian and Poisson sample, and you can see that it's hard to say whether one is more uniform than the other. That was my main point.
The second row shows one of the samples from each distribution. The samples are relatively small, so you can't have too many bins indeed. Actually, this particular Gaussian sample doesn't look that much Gaussian at all on the histogram.
In the third row, I'm showing the combined samples of 10,000 observations for each distribution on a histogram. Here, you can have more bins, and the shapes are more obvious.
Finally, I run the same normality test and get p-values for the combined samples and it rejects normality for Poisson, while failing to reject for Gaussian. The p-values are: [0.45348631] [0.]
This is not a proof, of course, but the demonstration of the idea that you better run the same test on the combined sample, instead of trying to analyze the distribution of p-values from subsamples.
Here's Python code:
import numpy as np
from scipy import stats
from matplotlib import pyplot as plt
def pvs(x):
pn = x.shape[1]
pvals = np.zeros(pn)
for i in range(pn):
pvals[i] = stats.jarque_bera(x[:,i])[1]
return pvals
n = 100
pn = 100
mu, sigma = 1, 2
np.random.seed(0)
x = np.random.normal(mu, sigma, size=(n,pn))
x2 = np.random.poisson(15, size=(n,pn))
print(x[1,1])
pvals = pvs(x)
pvals2 = pvs(x2)
x_f = x.reshape((n*pn,1))
pvals_f = pvs(x_f)
x2_f = x2.reshape((n*pn,1))
pvals2_f = pvs(x2_f)
print(pvals_f,pvals2_f)
print(x_f.shape,x_f[:,0])
#print(pvals)
plt.figure(figsize=(9,9))
plt.subplot(3,2,1)
plt.hist(pvals)
plt.gca().set_title('True Normal')
plt.gca().set_ylabel('p-value')
plt.subplot(3,2,2)
plt.hist(pvals2)
plt.gca().set_title('Poisson')
plt.gca().set_ylabel('p-value')
plt.subplot(3,2,3)
plt.hist(x[:,0])
plt.gca().set_title('a small sample')
plt.gca().set_ylabel('x')
plt.subplot(3,2,4)
plt.hist(x2[:,0])
plt.gca().set_title('a small Sample')
plt.gca().set_ylabel('x')
plt.subplot(3,2,5)
plt.hist(x_f[:,0],100)
plt.gca().set_title('Full Sample')
plt.gca().set_ylabel('x')
plt.subplot(3,2,6)
plt.hist(x2_f[:,0],100)
plt.gca().set_title('Full Sample')
plt.gca().set_ylabel('x')
plt.show()
I like your question, but unfortunately my answer is NO, it doesn't prove $H_0$. The reason is very simple. How would do you know that the distribution of p-values is uniform? You would probably have to run a test for uniformity which will return you its own p-value, and you end up with the same kind of inference question that you were trying to avoid, only one step farther. Instead of looking at p-value of the original $H_0$, now you look at a p-value of another $H'_0$ about the uniformity of distribution of original p-values.
UPDATE
Here's the demonstration. I generate 100 samples of 100 observations from Gaussian and Poisson distribution, then obtain 100 p-values for normality test of each sample. So, the premise of the question is that if the p-values are from uniform distribution, then it proves that the null hypothesis is correct, which is a stronger statement than a usual "fails to reject" in statistical inference. The trouble is that "the p-values are from uniform" is a hypothesis itself, which you have to somehow test.
In the picture (first row) below I'm showing the histograms of p-values from a normality test for the Guassian and Poisson sample, and you can see that it's hard to say whether one is more uniform than the other. That was my main point.
The second row shows one of the samples from each distribution. The samples are relatively small, so you can't have too many bins indeed. Actually, this particular Gaussian sample doesn't look that much Gaussian at all on the histogram.
In the third row, I'm showing the combined samples of 10,000 observations for each distribution on a histogram. Here, you can have more bins, and the shapes are more obvious.
Finally, I run the same normality test and get p-values for the combined samples and it rejects normality for Poisson, while failing to reject for Gaussian. The p-values are: [0.45348631] [0.]
This is not a proof, of course, but the demonstration of the idea that you better run the same test on the combined sample, instead of trying to analyze the distribution of p-values from subsamples.
Here's Python code:
import numpy as np
from scipy import stats
from matplotlib import pyplot as plt
def pvs(x):
pn = x.shape[1]
pvals = np.zeros(pn)
for i in range(pn):
pvals[i] = stats.jarque_bera(x[:,i])[1]
return pvals
n = 100
pn = 100
mu, sigma = 1, 2
np.random.seed(0)
x = np.random.normal(mu, sigma, size=(n,pn))
x2 = np.random.poisson(15, size=(n,pn))
print(x[1,1])
pvals = pvs(x)
pvals2 = pvs(x2)
x_f = x.reshape((n*pn,1))
pvals_f = pvs(x_f)
x2_f = x2.reshape((n*pn,1))
pvals2_f = pvs(x2_f)
print(pvals_f,pvals2_f)
print(x_f.shape,x_f[:,0])
#print(pvals)
plt.figure(figsize=(9,9))
plt.subplot(3,2,1)
plt.hist(pvals)
plt.gca().set_title('True Normal')
plt.gca().set_ylabel('p-value')
plt.subplot(3,2,2)
plt.hist(pvals2)
plt.gca().set_title('Poisson')
plt.gca().set_ylabel('p-value')
plt.subplot(3,2,3)
plt.hist(x[:,0])
plt.gca().set_title('a small sample')
plt.gca().set_ylabel('x')
plt.subplot(3,2,4)
plt.hist(x2[:,0])
plt.gca().set_title('a small Sample')
plt.gca().set_ylabel('x')
plt.subplot(3,2,5)
plt.hist(x_f[:,0],100)
plt.gca().set_title('Full Sample')
plt.gca().set_ylabel('x')
plt.subplot(3,2,6)
plt.hist(x2_f[:,0],100)
plt.gca().set_title('Full Sample')
plt.gca().set_ylabel('x')
plt.show()
edited 4 hours ago
answered yesterday
Aksakal
37.5k447109
37.5k447109
2
@LeanderMoesinger you're going to make a stronger point by collecting all your tests into one. Suppose, you have a sample with 100 observations, and get p-value; then get 99 additional samples and end up with 100 p-values. Instead, you could just run one 10,000 observations sample and get on p-value, but it'll be more convincing.
– Aksakal
yesterday
1
@LeanderMoesinger, it's likely to be not small
– Aksakal
yesterday
1
Your answer does not address the question, he didn’t ask about proof but about evidence.
– Carlos Cinelli
yesterday
1
@CarlosCinelli, he'll have a bunch of p-values, which he would claim are uniform. How is this an evidence unless he proves the values are from uniform? That's what I'm talking about.
– Aksakal
yesterday
2
@Aksakal this is about mathematics, an observed event (like a sequence of p-values) may not constitute evidence of something, but the reason does not logically follow from your argument.
– Carlos Cinelli
yesterday
|
show 5 more comments
2
@LeanderMoesinger you're going to make a stronger point by collecting all your tests into one. Suppose, you have a sample with 100 observations, and get p-value; then get 99 additional samples and end up with 100 p-values. Instead, you could just run one 10,000 observations sample and get on p-value, but it'll be more convincing.
– Aksakal
yesterday
1
@LeanderMoesinger, it's likely to be not small
– Aksakal
yesterday
1
Your answer does not address the question, he didn’t ask about proof but about evidence.
– Carlos Cinelli
yesterday
1
@CarlosCinelli, he'll have a bunch of p-values, which he would claim are uniform. How is this an evidence unless he proves the values are from uniform? That's what I'm talking about.
– Aksakal
yesterday
2
@Aksakal this is about mathematics, an observed event (like a sequence of p-values) may not constitute evidence of something, but the reason does not logically follow from your argument.
– Carlos Cinelli
yesterday
2
2
@LeanderMoesinger you're going to make a stronger point by collecting all your tests into one. Suppose, you have a sample with 100 observations, and get p-value; then get 99 additional samples and end up with 100 p-values. Instead, you could just run one 10,000 observations sample and get on p-value, but it'll be more convincing.
– Aksakal
yesterday
@LeanderMoesinger you're going to make a stronger point by collecting all your tests into one. Suppose, you have a sample with 100 observations, and get p-value; then get 99 additional samples and end up with 100 p-values. Instead, you could just run one 10,000 observations sample and get on p-value, but it'll be more convincing.
– Aksakal
yesterday
1
1
@LeanderMoesinger, it's likely to be not small
– Aksakal
yesterday
@LeanderMoesinger, it's likely to be not small
– Aksakal
yesterday
1
1
Your answer does not address the question, he didn’t ask about proof but about evidence.
– Carlos Cinelli
yesterday
Your answer does not address the question, he didn’t ask about proof but about evidence.
– Carlos Cinelli
yesterday
1
1
@CarlosCinelli, he'll have a bunch of p-values, which he would claim are uniform. How is this an evidence unless he proves the values are from uniform? That's what I'm talking about.
– Aksakal
yesterday
@CarlosCinelli, he'll have a bunch of p-values, which he would claim are uniform. How is this an evidence unless he proves the values are from uniform? That's what I'm talking about.
– Aksakal
yesterday
2
2
@Aksakal this is about mathematics, an observed event (like a sequence of p-values) may not constitute evidence of something, but the reason does not logically follow from your argument.
– Carlos Cinelli
yesterday
@Aksakal this is about mathematics, an observed event (like a sequence of p-values) may not constitute evidence of something, but the reason does not logically follow from your argument.
– Carlos Cinelli
yesterday
|
show 5 more comments
up vote
19
down vote
Your series of experiments can be viewed as a single experiment with far more data, and as we know, more data is advantageous (eg. typically standard errors decrease as $sqrtn$ increases). Also, your partitioning the data and fitting a model repeatedly can indeed be useful.
But you ask, "Is this ... enough evidence to conclude that H0 is true?" Perhaps a rephrasing is, "If I obtain more and more data consistent with $H_0$ being true, can I ever conclude that $H_0$ is true?"
That question is deeply related to 18th century philosopher David Hume's problem of induction. If all observed instances of A have been B, can we say that the next instance of A will be B? Hume famously said no, that we cannot logically deduce that "all A are B" even from voluminous data.
In more modern math, a finite set of observations cannot logically entail $forall_a in A left[ a in B right]$ if A is not a finite set. Two notable examples as discussed by Magee and Passermore:
For centuries, every swan observed by Europeans was white. Then Europeans discovered Australia and saw black swans.
For centuries, Newton's law of gravity agreed with observation and was thought correct. It was overturned though by Einstein's theory of general relativity.
If Hume's conclusion is correct, proving $H_0$ true in a strict sense is unachievable. That we cannot make statements with certitude though is not equivalent to saying we know nothing at all. Experimental science and statistics have been successful in helping us understand and navigate the world.
An (incomplete) listing of ways forward:
Karl Popper and falsificationism
In Popper's view, no scientific law is ever proven true. We only have scientific laws not yet proven false.
Popper argued that science proceeds forward by guessing hypotheses and subjecting them to rigorous scrutiny. It proceeds forward through deduction (observation proving theories false), not induction (repeated observation proving theories true). Much of frequentist statistics was constructed consistent with this philosophy.
Popper's view has been immensely influential, but as Kuhn and others have argued, it does not quite conform to the empirically observed practice of successful science.
Bayesian, subjective probability
Let's assume we're interested in a parameter $theta$.
To the frequentist statistician, parameter $theta$ is a scalar value, a number. If you instead take a subjective Bayesian viewpoint (such as in Leonard Jimmie Savage's Foundation of Statistics), you can model your own uncertainty over $theta$ using the tools of probability. To the subjective Bayesian, $theta$ is a random variable and you have some prior $P(theta)$. You can then talk about the subjective probability $P(theta mid X)$ of different values of $theta$ given the data $X$. How you behave in various situations has some correspondence to these subjective probabilities.
This is a logical way to model your own subjective beliefs, but it's not a magic way to produce probabilities that are true in terms of correspondence to reality. A tricky question for any Bayesian interpretation is where do priors come from? Also, what if the model is misspecified?
George P. Box
A famous aphorism of George E.P. Box is that "all models are false, but some are useful."
Newton's law may not be true in the strong sense, but it immensely useful for many problems. Box's view is quite important in the modern big data context where studies are so overpowered that you can reject basically any meaningful proposition. Strictly true versus false is a bad question.
Additional comments
There's a world of difference in statistics between estimating a parameter $theta approx 0$ with a small standard error versus with a large standard error! Don't walk away thinking that because certitude is impossible, passing rigorous scrutiny is irrelevant.
Perhaps also of interest, statistically analyzing the results of multiple studies is called meta-analysis.
How far you can go beyond narrow statistical interpretations is a difficult question.
This has been an interesting read and gave some nice things to think about! I wish i could accept multiple answers.
– Leander Moesinger
yesterday
Quite an explanation. My prof once summarized Kuhn in the spirit of Popper: 'Science progresses from funeral to funeral'
– skrubber
yesterday
Kuhn etc famously misinterpret Popper when claiming his observations don't match how science is done. This is known as native falsificationism, and it's not what Popper (later) put forward. It's a straw man.
– Konrad Rudolph
yesterday
1
It's answers like this I keep visiting StackExchange sites.
– Trilarion
6 hours ago
add a comment |
up vote
19
down vote
Your series of experiments can be viewed as a single experiment with far more data, and as we know, more data is advantageous (eg. typically standard errors decrease as $sqrtn$ increases). Also, your partitioning the data and fitting a model repeatedly can indeed be useful.
But you ask, "Is this ... enough evidence to conclude that H0 is true?" Perhaps a rephrasing is, "If I obtain more and more data consistent with $H_0$ being true, can I ever conclude that $H_0$ is true?"
That question is deeply related to 18th century philosopher David Hume's problem of induction. If all observed instances of A have been B, can we say that the next instance of A will be B? Hume famously said no, that we cannot logically deduce that "all A are B" even from voluminous data.
In more modern math, a finite set of observations cannot logically entail $forall_a in A left[ a in B right]$ if A is not a finite set. Two notable examples as discussed by Magee and Passermore:
For centuries, every swan observed by Europeans was white. Then Europeans discovered Australia and saw black swans.
For centuries, Newton's law of gravity agreed with observation and was thought correct. It was overturned though by Einstein's theory of general relativity.
If Hume's conclusion is correct, proving $H_0$ true in a strict sense is unachievable. That we cannot make statements with certitude though is not equivalent to saying we know nothing at all. Experimental science and statistics have been successful in helping us understand and navigate the world.
An (incomplete) listing of ways forward:
Karl Popper and falsificationism
In Popper's view, no scientific law is ever proven true. We only have scientific laws not yet proven false.
Popper argued that science proceeds forward by guessing hypotheses and subjecting them to rigorous scrutiny. It proceeds forward through deduction (observation proving theories false), not induction (repeated observation proving theories true). Much of frequentist statistics was constructed consistent with this philosophy.
Popper's view has been immensely influential, but as Kuhn and others have argued, it does not quite conform to the empirically observed practice of successful science.
Bayesian, subjective probability
Let's assume we're interested in a parameter $theta$.
To the frequentist statistician, parameter $theta$ is a scalar value, a number. If you instead take a subjective Bayesian viewpoint (such as in Leonard Jimmie Savage's Foundation of Statistics), you can model your own uncertainty over $theta$ using the tools of probability. To the subjective Bayesian, $theta$ is a random variable and you have some prior $P(theta)$. You can then talk about the subjective probability $P(theta mid X)$ of different values of $theta$ given the data $X$. How you behave in various situations has some correspondence to these subjective probabilities.
This is a logical way to model your own subjective beliefs, but it's not a magic way to produce probabilities that are true in terms of correspondence to reality. A tricky question for any Bayesian interpretation is where do priors come from? Also, what if the model is misspecified?
George P. Box
A famous aphorism of George E.P. Box is that "all models are false, but some are useful."
Newton's law may not be true in the strong sense, but it immensely useful for many problems. Box's view is quite important in the modern big data context where studies are so overpowered that you can reject basically any meaningful proposition. Strictly true versus false is a bad question.
Additional comments
There's a world of difference in statistics between estimating a parameter $theta approx 0$ with a small standard error versus with a large standard error! Don't walk away thinking that because certitude is impossible, passing rigorous scrutiny is irrelevant.
Perhaps also of interest, statistically analyzing the results of multiple studies is called meta-analysis.
How far you can go beyond narrow statistical interpretations is a difficult question.
This has been an interesting read and gave some nice things to think about! I wish i could accept multiple answers.
– Leander Moesinger
yesterday
Quite an explanation. My prof once summarized Kuhn in the spirit of Popper: 'Science progresses from funeral to funeral'
– skrubber
yesterday
Kuhn etc famously misinterpret Popper when claiming his observations don't match how science is done. This is known as native falsificationism, and it's not what Popper (later) put forward. It's a straw man.
– Konrad Rudolph
yesterday
1
It's answers like this I keep visiting StackExchange sites.
– Trilarion
6 hours ago
add a comment |
up vote
19
down vote
up vote
19
down vote
Your series of experiments can be viewed as a single experiment with far more data, and as we know, more data is advantageous (eg. typically standard errors decrease as $sqrtn$ increases). Also, your partitioning the data and fitting a model repeatedly can indeed be useful.
But you ask, "Is this ... enough evidence to conclude that H0 is true?" Perhaps a rephrasing is, "If I obtain more and more data consistent with $H_0$ being true, can I ever conclude that $H_0$ is true?"
That question is deeply related to 18th century philosopher David Hume's problem of induction. If all observed instances of A have been B, can we say that the next instance of A will be B? Hume famously said no, that we cannot logically deduce that "all A are B" even from voluminous data.
In more modern math, a finite set of observations cannot logically entail $forall_a in A left[ a in B right]$ if A is not a finite set. Two notable examples as discussed by Magee and Passermore:
For centuries, every swan observed by Europeans was white. Then Europeans discovered Australia and saw black swans.
For centuries, Newton's law of gravity agreed with observation and was thought correct. It was overturned though by Einstein's theory of general relativity.
If Hume's conclusion is correct, proving $H_0$ true in a strict sense is unachievable. That we cannot make statements with certitude though is not equivalent to saying we know nothing at all. Experimental science and statistics have been successful in helping us understand and navigate the world.
An (incomplete) listing of ways forward:
Karl Popper and falsificationism
In Popper's view, no scientific law is ever proven true. We only have scientific laws not yet proven false.
Popper argued that science proceeds forward by guessing hypotheses and subjecting them to rigorous scrutiny. It proceeds forward through deduction (observation proving theories false), not induction (repeated observation proving theories true). Much of frequentist statistics was constructed consistent with this philosophy.
Popper's view has been immensely influential, but as Kuhn and others have argued, it does not quite conform to the empirically observed practice of successful science.
Bayesian, subjective probability
Let's assume we're interested in a parameter $theta$.
To the frequentist statistician, parameter $theta$ is a scalar value, a number. If you instead take a subjective Bayesian viewpoint (such as in Leonard Jimmie Savage's Foundation of Statistics), you can model your own uncertainty over $theta$ using the tools of probability. To the subjective Bayesian, $theta$ is a random variable and you have some prior $P(theta)$. You can then talk about the subjective probability $P(theta mid X)$ of different values of $theta$ given the data $X$. How you behave in various situations has some correspondence to these subjective probabilities.
This is a logical way to model your own subjective beliefs, but it's not a magic way to produce probabilities that are true in terms of correspondence to reality. A tricky question for any Bayesian interpretation is where do priors come from? Also, what if the model is misspecified?
George P. Box
A famous aphorism of George E.P. Box is that "all models are false, but some are useful."
Newton's law may not be true in the strong sense, but it immensely useful for many problems. Box's view is quite important in the modern big data context where studies are so overpowered that you can reject basically any meaningful proposition. Strictly true versus false is a bad question.
Additional comments
There's a world of difference in statistics between estimating a parameter $theta approx 0$ with a small standard error versus with a large standard error! Don't walk away thinking that because certitude is impossible, passing rigorous scrutiny is irrelevant.
Perhaps also of interest, statistically analyzing the results of multiple studies is called meta-analysis.
How far you can go beyond narrow statistical interpretations is a difficult question.
Your series of experiments can be viewed as a single experiment with far more data, and as we know, more data is advantageous (eg. typically standard errors decrease as $sqrtn$ increases). Also, your partitioning the data and fitting a model repeatedly can indeed be useful.
But you ask, "Is this ... enough evidence to conclude that H0 is true?" Perhaps a rephrasing is, "If I obtain more and more data consistent with $H_0$ being true, can I ever conclude that $H_0$ is true?"
That question is deeply related to 18th century philosopher David Hume's problem of induction. If all observed instances of A have been B, can we say that the next instance of A will be B? Hume famously said no, that we cannot logically deduce that "all A are B" even from voluminous data.
In more modern math, a finite set of observations cannot logically entail $forall_a in A left[ a in B right]$ if A is not a finite set. Two notable examples as discussed by Magee and Passermore:
For centuries, every swan observed by Europeans was white. Then Europeans discovered Australia and saw black swans.
For centuries, Newton's law of gravity agreed with observation and was thought correct. It was overturned though by Einstein's theory of general relativity.
If Hume's conclusion is correct, proving $H_0$ true in a strict sense is unachievable. That we cannot make statements with certitude though is not equivalent to saying we know nothing at all. Experimental science and statistics have been successful in helping us understand and navigate the world.
An (incomplete) listing of ways forward:
Karl Popper and falsificationism
In Popper's view, no scientific law is ever proven true. We only have scientific laws not yet proven false.
Popper argued that science proceeds forward by guessing hypotheses and subjecting them to rigorous scrutiny. It proceeds forward through deduction (observation proving theories false), not induction (repeated observation proving theories true). Much of frequentist statistics was constructed consistent with this philosophy.
Popper's view has been immensely influential, but as Kuhn and others have argued, it does not quite conform to the empirically observed practice of successful science.
Bayesian, subjective probability
Let's assume we're interested in a parameter $theta$.
To the frequentist statistician, parameter $theta$ is a scalar value, a number. If you instead take a subjective Bayesian viewpoint (such as in Leonard Jimmie Savage's Foundation of Statistics), you can model your own uncertainty over $theta$ using the tools of probability. To the subjective Bayesian, $theta$ is a random variable and you have some prior $P(theta)$. You can then talk about the subjective probability $P(theta mid X)$ of different values of $theta$ given the data $X$. How you behave in various situations has some correspondence to these subjective probabilities.
This is a logical way to model your own subjective beliefs, but it's not a magic way to produce probabilities that are true in terms of correspondence to reality. A tricky question for any Bayesian interpretation is where do priors come from? Also, what if the model is misspecified?
George P. Box
A famous aphorism of George E.P. Box is that "all models are false, but some are useful."
Newton's law may not be true in the strong sense, but it immensely useful for many problems. Box's view is quite important in the modern big data context where studies are so overpowered that you can reject basically any meaningful proposition. Strictly true versus false is a bad question.
Additional comments
There's a world of difference in statistics between estimating a parameter $theta approx 0$ with a small standard error versus with a large standard error! Don't walk away thinking that because certitude is impossible, passing rigorous scrutiny is irrelevant.
Perhaps also of interest, statistically analyzing the results of multiple studies is called meta-analysis.
How far you can go beyond narrow statistical interpretations is a difficult question.
edited yesterday
answered yesterday
Matthew Gunn
16.7k13167
16.7k13167
This has been an interesting read and gave some nice things to think about! I wish i could accept multiple answers.
– Leander Moesinger
yesterday
Quite an explanation. My prof once summarized Kuhn in the spirit of Popper: 'Science progresses from funeral to funeral'
– skrubber
yesterday
Kuhn etc famously misinterpret Popper when claiming his observations don't match how science is done. This is known as native falsificationism, and it's not what Popper (later) put forward. It's a straw man.
– Konrad Rudolph
yesterday
1
It's answers like this I keep visiting StackExchange sites.
– Trilarion
6 hours ago
add a comment |
This has been an interesting read and gave some nice things to think about! I wish i could accept multiple answers.
– Leander Moesinger
yesterday
Quite an explanation. My prof once summarized Kuhn in the spirit of Popper: 'Science progresses from funeral to funeral'
– skrubber
yesterday
Kuhn etc famously misinterpret Popper when claiming his observations don't match how science is done. This is known as native falsificationism, and it's not what Popper (later) put forward. It's a straw man.
– Konrad Rudolph
yesterday
1
It's answers like this I keep visiting StackExchange sites.
– Trilarion
6 hours ago
This has been an interesting read and gave some nice things to think about! I wish i could accept multiple answers.
– Leander Moesinger
yesterday
This has been an interesting read and gave some nice things to think about! I wish i could accept multiple answers.
– Leander Moesinger
yesterday
Quite an explanation. My prof once summarized Kuhn in the spirit of Popper: 'Science progresses from funeral to funeral'
– skrubber
yesterday
Quite an explanation. My prof once summarized Kuhn in the spirit of Popper: 'Science progresses from funeral to funeral'
– skrubber
yesterday
Kuhn etc famously misinterpret Popper when claiming his observations don't match how science is done. This is known as native falsificationism, and it's not what Popper (later) put forward. It's a straw man.
– Konrad Rudolph
yesterday
Kuhn etc famously misinterpret Popper when claiming his observations don't match how science is done. This is known as native falsificationism, and it's not what Popper (later) put forward. It's a straw man.
– Konrad Rudolph
yesterday
1
1
It's answers like this I keep visiting StackExchange sites.
– Trilarion
6 hours ago
It's answers like this I keep visiting StackExchange sites.
– Trilarion
6 hours ago
add a comment |
up vote
4
down vote
In a sense you are right (see the p-curve) with some small caveats:
- you need the test to have some power under the alternative. Illustration of the potential problem: generating a p-value as a uniform distribution on 0 to 1 and rejecting when $p leq alpha$ is a (admittedly pretty useless) level $alpha$ test for any null hypothesis, but you will get a uniform distribution of p-values whether $H_0$ is true or not.
- You can only really show that you are quite close to $H_0$ being true (i.e. under the true parameter values three distribution might be close to uniform, even if $H_0$ is false.
With realistic applications, you tend to get additional issues. These mostly arise, because no one person/lab/study group can usually do all the necessary studies. As a result one tends to look at studies from lots of groups, at which point you have increased concerns (i.e. if you had done all relevant experiments yourself, at least you'd know) of underreporting, selective reporting of significant/surprising findings, p-hacking, multiple testing/multiple testing corrections and so on.
add a comment |
up vote
4
down vote
In a sense you are right (see the p-curve) with some small caveats:
- you need the test to have some power under the alternative. Illustration of the potential problem: generating a p-value as a uniform distribution on 0 to 1 and rejecting when $p leq alpha$ is a (admittedly pretty useless) level $alpha$ test for any null hypothesis, but you will get a uniform distribution of p-values whether $H_0$ is true or not.
- You can only really show that you are quite close to $H_0$ being true (i.e. under the true parameter values three distribution might be close to uniform, even if $H_0$ is false.
With realistic applications, you tend to get additional issues. These mostly arise, because no one person/lab/study group can usually do all the necessary studies. As a result one tends to look at studies from lots of groups, at which point you have increased concerns (i.e. if you had done all relevant experiments yourself, at least you'd know) of underreporting, selective reporting of significant/surprising findings, p-hacking, multiple testing/multiple testing corrections and so on.
add a comment |
up vote
4
down vote
up vote
4
down vote
In a sense you are right (see the p-curve) with some small caveats:
- you need the test to have some power under the alternative. Illustration of the potential problem: generating a p-value as a uniform distribution on 0 to 1 and rejecting when $p leq alpha$ is a (admittedly pretty useless) level $alpha$ test for any null hypothesis, but you will get a uniform distribution of p-values whether $H_0$ is true or not.
- You can only really show that you are quite close to $H_0$ being true (i.e. under the true parameter values three distribution might be close to uniform, even if $H_0$ is false.
With realistic applications, you tend to get additional issues. These mostly arise, because no one person/lab/study group can usually do all the necessary studies. As a result one tends to look at studies from lots of groups, at which point you have increased concerns (i.e. if you had done all relevant experiments yourself, at least you'd know) of underreporting, selective reporting of significant/surprising findings, p-hacking, multiple testing/multiple testing corrections and so on.
In a sense you are right (see the p-curve) with some small caveats:
- you need the test to have some power under the alternative. Illustration of the potential problem: generating a p-value as a uniform distribution on 0 to 1 and rejecting when $p leq alpha$ is a (admittedly pretty useless) level $alpha$ test for any null hypothesis, but you will get a uniform distribution of p-values whether $H_0$ is true or not.
- You can only really show that you are quite close to $H_0$ being true (i.e. under the true parameter values three distribution might be close to uniform, even if $H_0$ is false.
With realistic applications, you tend to get additional issues. These mostly arise, because no one person/lab/study group can usually do all the necessary studies. As a result one tends to look at studies from lots of groups, at which point you have increased concerns (i.e. if you had done all relevant experiments yourself, at least you'd know) of underreporting, selective reporting of significant/surprising findings, p-hacking, multiple testing/multiple testing corrections and so on.
edited yesterday
answered yesterday
Björn
8,6011834
8,6011834
add a comment |
add a comment |
up vote
-2
down vote
Null hypothesis (H0): Gravity causes everything in the universe to fall toward Earth's surface.
Alternate hypothesis (H1): Nothing ever falls.
Performed 1 million experiments with dozens of household objects, fail to reject H0 with $p < 0.01$ every time. Is H0 true?
New contributor
2
Do you think Galileo did one million trials? None of this stuff is necessary in physical sciences. Establishing the laws of nature by applying scientific method does not reduce into statistical inference.
– Aksakal
yesterday
1
-1 This is scientifically, statistically, and historically inaccurate. Greeks once believed that it was affinity that drew objects to the Earth. Not bad, but doesn't explain the 3+ body system problems well. Hypotheses should be complementary. Lastly stating a possibly known bias as H_0 and showing experiments continue to lead to the same incorrect conclusion doesn't make the conclusion correct. e.g. women earn less than men b/c they are less driven, sample all women's salaries, H_0 is true!
– AdamO
yesterday
@AdamO that is exactly my point.
– usul
yesterday
@AdamO, in the Western countries women earn less when they work less for a variety of reasons including their own choice, disincentives of all kinds and hostile work environment in some places. When they work the same, they earn about the same, e.g. see medicare nurse salaries where women are the great majority: medscape.com/slideshow/…. They all earn the same $37 when working hourly. Totally off-topic, of course.
– Aksakal
yesterday
2
If your null hypothesis isGravity causes everything in the universe to fall toward Earth's surface
isn't the alternative hypothesisThere is at least one thing in the universe that does not fall toward the Earth's surface
and notNothing ever falls
?
– Eff
15 hours ago
|
show 2 more comments
up vote
-2
down vote
Null hypothesis (H0): Gravity causes everything in the universe to fall toward Earth's surface.
Alternate hypothesis (H1): Nothing ever falls.
Performed 1 million experiments with dozens of household objects, fail to reject H0 with $p < 0.01$ every time. Is H0 true?
New contributor
2
Do you think Galileo did one million trials? None of this stuff is necessary in physical sciences. Establishing the laws of nature by applying scientific method does not reduce into statistical inference.
– Aksakal
yesterday
1
-1 This is scientifically, statistically, and historically inaccurate. Greeks once believed that it was affinity that drew objects to the Earth. Not bad, but doesn't explain the 3+ body system problems well. Hypotheses should be complementary. Lastly stating a possibly known bias as H_0 and showing experiments continue to lead to the same incorrect conclusion doesn't make the conclusion correct. e.g. women earn less than men b/c they are less driven, sample all women's salaries, H_0 is true!
– AdamO
yesterday
@AdamO that is exactly my point.
– usul
yesterday
@AdamO, in the Western countries women earn less when they work less for a variety of reasons including their own choice, disincentives of all kinds and hostile work environment in some places. When they work the same, they earn about the same, e.g. see medicare nurse salaries where women are the great majority: medscape.com/slideshow/…. They all earn the same $37 when working hourly. Totally off-topic, of course.
– Aksakal
yesterday
2
If your null hypothesis isGravity causes everything in the universe to fall toward Earth's surface
isn't the alternative hypothesisThere is at least one thing in the universe that does not fall toward the Earth's surface
and notNothing ever falls
?
– Eff
15 hours ago
|
show 2 more comments
up vote
-2
down vote
up vote
-2
down vote
Null hypothesis (H0): Gravity causes everything in the universe to fall toward Earth's surface.
Alternate hypothesis (H1): Nothing ever falls.
Performed 1 million experiments with dozens of household objects, fail to reject H0 with $p < 0.01$ every time. Is H0 true?
New contributor
Null hypothesis (H0): Gravity causes everything in the universe to fall toward Earth's surface.
Alternate hypothesis (H1): Nothing ever falls.
Performed 1 million experiments with dozens of household objects, fail to reject H0 with $p < 0.01$ every time. Is H0 true?
New contributor
New contributor
answered yesterday
usul
971
971
New contributor
New contributor
2
Do you think Galileo did one million trials? None of this stuff is necessary in physical sciences. Establishing the laws of nature by applying scientific method does not reduce into statistical inference.
– Aksakal
yesterday
1
-1 This is scientifically, statistically, and historically inaccurate. Greeks once believed that it was affinity that drew objects to the Earth. Not bad, but doesn't explain the 3+ body system problems well. Hypotheses should be complementary. Lastly stating a possibly known bias as H_0 and showing experiments continue to lead to the same incorrect conclusion doesn't make the conclusion correct. e.g. women earn less than men b/c they are less driven, sample all women's salaries, H_0 is true!
– AdamO
yesterday
@AdamO that is exactly my point.
– usul
yesterday
@AdamO, in the Western countries women earn less when they work less for a variety of reasons including their own choice, disincentives of all kinds and hostile work environment in some places. When they work the same, they earn about the same, e.g. see medicare nurse salaries where women are the great majority: medscape.com/slideshow/…. They all earn the same $37 when working hourly. Totally off-topic, of course.
– Aksakal
yesterday
2
If your null hypothesis isGravity causes everything in the universe to fall toward Earth's surface
isn't the alternative hypothesisThere is at least one thing in the universe that does not fall toward the Earth's surface
and notNothing ever falls
?
– Eff
15 hours ago
|
show 2 more comments
2
Do you think Galileo did one million trials? None of this stuff is necessary in physical sciences. Establishing the laws of nature by applying scientific method does not reduce into statistical inference.
– Aksakal
yesterday
1
-1 This is scientifically, statistically, and historically inaccurate. Greeks once believed that it was affinity that drew objects to the Earth. Not bad, but doesn't explain the 3+ body system problems well. Hypotheses should be complementary. Lastly stating a possibly known bias as H_0 and showing experiments continue to lead to the same incorrect conclusion doesn't make the conclusion correct. e.g. women earn less than men b/c they are less driven, sample all women's salaries, H_0 is true!
– AdamO
yesterday
@AdamO that is exactly my point.
– usul
yesterday
@AdamO, in the Western countries women earn less when they work less for a variety of reasons including their own choice, disincentives of all kinds and hostile work environment in some places. When they work the same, they earn about the same, e.g. see medicare nurse salaries where women are the great majority: medscape.com/slideshow/…. They all earn the same $37 when working hourly. Totally off-topic, of course.
– Aksakal
yesterday
2
If your null hypothesis isGravity causes everything in the universe to fall toward Earth's surface
isn't the alternative hypothesisThere is at least one thing in the universe that does not fall toward the Earth's surface
and notNothing ever falls
?
– Eff
15 hours ago
2
2
Do you think Galileo did one million trials? None of this stuff is necessary in physical sciences. Establishing the laws of nature by applying scientific method does not reduce into statistical inference.
– Aksakal
yesterday
Do you think Galileo did one million trials? None of this stuff is necessary in physical sciences. Establishing the laws of nature by applying scientific method does not reduce into statistical inference.
– Aksakal
yesterday
1
1
-1 This is scientifically, statistically, and historically inaccurate. Greeks once believed that it was affinity that drew objects to the Earth. Not bad, but doesn't explain the 3+ body system problems well. Hypotheses should be complementary. Lastly stating a possibly known bias as H_0 and showing experiments continue to lead to the same incorrect conclusion doesn't make the conclusion correct. e.g. women earn less than men b/c they are less driven, sample all women's salaries, H_0 is true!
– AdamO
yesterday
-1 This is scientifically, statistically, and historically inaccurate. Greeks once believed that it was affinity that drew objects to the Earth. Not bad, but doesn't explain the 3+ body system problems well. Hypotheses should be complementary. Lastly stating a possibly known bias as H_0 and showing experiments continue to lead to the same incorrect conclusion doesn't make the conclusion correct. e.g. women earn less than men b/c they are less driven, sample all women's salaries, H_0 is true!
– AdamO
yesterday
@AdamO that is exactly my point.
– usul
yesterday
@AdamO that is exactly my point.
– usul
yesterday
@AdamO, in the Western countries women earn less when they work less for a variety of reasons including their own choice, disincentives of all kinds and hostile work environment in some places. When they work the same, they earn about the same, e.g. see medicare nurse salaries where women are the great majority: medscape.com/slideshow/…. They all earn the same $37 when working hourly. Totally off-topic, of course.
– Aksakal
yesterday
@AdamO, in the Western countries women earn less when they work less for a variety of reasons including their own choice, disincentives of all kinds and hostile work environment in some places. When they work the same, they earn about the same, e.g. see medicare nurse salaries where women are the great majority: medscape.com/slideshow/…. They all earn the same $37 when working hourly. Totally off-topic, of course.
– Aksakal
yesterday
2
2
If your null hypothesis is
Gravity causes everything in the universe to fall toward Earth's surface
isn't the alternative hypothesis There is at least one thing in the universe that does not fall toward the Earth's surface
and not Nothing ever falls
?– Eff
15 hours ago
If your null hypothesis is
Gravity causes everything in the universe to fall toward Earth's surface
isn't the alternative hypothesis There is at least one thing in the universe that does not fall toward the Earth's surface
and not Nothing ever falls
?– Eff
15 hours ago
|
show 2 more comments
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f376772%2fdoes-a-uniform-distribution-of-many-p-values-give-statistical-evidence-that-h0-i%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
You might be interested in my answer to a different question stats.stackexchange.com/questions/171742/… which has some comments about the hypotheses here.
– mdewey
yesterday
H0 is false by its definition.
– Joshua
yesterday
1
On a side note, the reason why i have so many tests (and haven't just combined all the data into a single one) is that my data is spatially distributed around the globe and i wanted to see whether there are spatial patterns in the p-values (there aren't, but if there were it would mean that either independence is violated or that H0/H1 is true in different parts of the globe). I haven't included this in the question text because i wanted to keep it general.
– Leander Moesinger
yesterday