Does a uniform distribution of many p-values give statistical evidence that H0 is true?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
26
down vote

favorite
9












A single statistical test can give evidence that the null hypothesis (H0) is false and therefore the alternative hypothesis (H1) is true. But it cannot be used to show that H0 is true because failure to reject H0 does not mean that H0 is true.



But let's assume you have the possibility to do the statistical test many times because you have many datasets, all independent of each other. All datasets are the result of the same process and you want to make some statement (H0/H1) over the process itself and aren't interested in the results of each single test. You then collect all the resulting p-values and happen to see via histogram plot that the p-values are clearly uniformly distributed.



My reasoning now is that this can only happen if H0 is true — else the p-values would be distributed differently. Is this therefore enough evidence to conclude that H0 is true? Or am i missing here something essential, because it took me a lot of willpower to write "conclude that H0 is true" which just sounds horribly wrong in my head.










share|cite|improve this question



















  • 1




    You might be interested in my answer to a different question stats.stackexchange.com/questions/171742/… which has some comments about the hypotheses here.
    – mdewey
    yesterday










  • H0 is false by its definition.
    – Joshua
    yesterday






  • 1




    On a side note, the reason why i have so many tests (and haven't just combined all the data into a single one) is that my data is spatially distributed around the globe and i wanted to see whether there are spatial patterns in the p-values (there aren't, but if there were it would mean that either independence is violated or that H0/H1 is true in different parts of the globe). I haven't included this in the question text because i wanted to keep it general.
    – Leander Moesinger
    yesterday

















up vote
26
down vote

favorite
9












A single statistical test can give evidence that the null hypothesis (H0) is false and therefore the alternative hypothesis (H1) is true. But it cannot be used to show that H0 is true because failure to reject H0 does not mean that H0 is true.



But let's assume you have the possibility to do the statistical test many times because you have many datasets, all independent of each other. All datasets are the result of the same process and you want to make some statement (H0/H1) over the process itself and aren't interested in the results of each single test. You then collect all the resulting p-values and happen to see via histogram plot that the p-values are clearly uniformly distributed.



My reasoning now is that this can only happen if H0 is true — else the p-values would be distributed differently. Is this therefore enough evidence to conclude that H0 is true? Or am i missing here something essential, because it took me a lot of willpower to write "conclude that H0 is true" which just sounds horribly wrong in my head.










share|cite|improve this question



















  • 1




    You might be interested in my answer to a different question stats.stackexchange.com/questions/171742/… which has some comments about the hypotheses here.
    – mdewey
    yesterday










  • H0 is false by its definition.
    – Joshua
    yesterday






  • 1




    On a side note, the reason why i have so many tests (and haven't just combined all the data into a single one) is that my data is spatially distributed around the globe and i wanted to see whether there are spatial patterns in the p-values (there aren't, but if there were it would mean that either independence is violated or that H0/H1 is true in different parts of the globe). I haven't included this in the question text because i wanted to keep it general.
    – Leander Moesinger
    yesterday













up vote
26
down vote

favorite
9









up vote
26
down vote

favorite
9






9





A single statistical test can give evidence that the null hypothesis (H0) is false and therefore the alternative hypothesis (H1) is true. But it cannot be used to show that H0 is true because failure to reject H0 does not mean that H0 is true.



But let's assume you have the possibility to do the statistical test many times because you have many datasets, all independent of each other. All datasets are the result of the same process and you want to make some statement (H0/H1) over the process itself and aren't interested in the results of each single test. You then collect all the resulting p-values and happen to see via histogram plot that the p-values are clearly uniformly distributed.



My reasoning now is that this can only happen if H0 is true — else the p-values would be distributed differently. Is this therefore enough evidence to conclude that H0 is true? Or am i missing here something essential, because it took me a lot of willpower to write "conclude that H0 is true" which just sounds horribly wrong in my head.










share|cite|improve this question















A single statistical test can give evidence that the null hypothesis (H0) is false and therefore the alternative hypothesis (H1) is true. But it cannot be used to show that H0 is true because failure to reject H0 does not mean that H0 is true.



But let's assume you have the possibility to do the statistical test many times because you have many datasets, all independent of each other. All datasets are the result of the same process and you want to make some statement (H0/H1) over the process itself and aren't interested in the results of each single test. You then collect all the resulting p-values and happen to see via histogram plot that the p-values are clearly uniformly distributed.



My reasoning now is that this can only happen if H0 is true — else the p-values would be distributed differently. Is this therefore enough evidence to conclude that H0 is true? Or am i missing here something essential, because it took me a lot of willpower to write "conclude that H0 is true" which just sounds horribly wrong in my head.







hypothesis-testing p-value combining-p-values






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited 9 hours ago









mdewey

11.3k72041




11.3k72041










asked yesterday









Leander Moesinger

25429




25429







  • 1




    You might be interested in my answer to a different question stats.stackexchange.com/questions/171742/… which has some comments about the hypotheses here.
    – mdewey
    yesterday










  • H0 is false by its definition.
    – Joshua
    yesterday






  • 1




    On a side note, the reason why i have so many tests (and haven't just combined all the data into a single one) is that my data is spatially distributed around the globe and i wanted to see whether there are spatial patterns in the p-values (there aren't, but if there were it would mean that either independence is violated or that H0/H1 is true in different parts of the globe). I haven't included this in the question text because i wanted to keep it general.
    – Leander Moesinger
    yesterday













  • 1




    You might be interested in my answer to a different question stats.stackexchange.com/questions/171742/… which has some comments about the hypotheses here.
    – mdewey
    yesterday










  • H0 is false by its definition.
    – Joshua
    yesterday






  • 1




    On a side note, the reason why i have so many tests (and haven't just combined all the data into a single one) is that my data is spatially distributed around the globe and i wanted to see whether there are spatial patterns in the p-values (there aren't, but if there were it would mean that either independence is violated or that H0/H1 is true in different parts of the globe). I haven't included this in the question text because i wanted to keep it general.
    – Leander Moesinger
    yesterday








1




1




You might be interested in my answer to a different question stats.stackexchange.com/questions/171742/… which has some comments about the hypotheses here.
– mdewey
yesterday




You might be interested in my answer to a different question stats.stackexchange.com/questions/171742/… which has some comments about the hypotheses here.
– mdewey
yesterday












H0 is false by its definition.
– Joshua
yesterday




H0 is false by its definition.
– Joshua
yesterday




1




1




On a side note, the reason why i have so many tests (and haven't just combined all the data into a single one) is that my data is spatially distributed around the globe and i wanted to see whether there are spatial patterns in the p-values (there aren't, but if there were it would mean that either independence is violated or that H0/H1 is true in different parts of the globe). I haven't included this in the question text because i wanted to keep it general.
– Leander Moesinger
yesterday





On a side note, the reason why i have so many tests (and haven't just combined all the data into a single one) is that my data is spatially distributed around the globe and i wanted to see whether there are spatial patterns in the p-values (there aren't, but if there were it would mean that either independence is violated or that H0/H1 is true in different parts of the globe). I haven't included this in the question text because i wanted to keep it general.
– Leander Moesinger
yesterday











4 Answers
4






active

oldest

votes

















up vote
20
down vote



accepted










I like your question, but unfortunately my answer is NO, it doesn't prove $H_0$. The reason is very simple. How would do you know that the distribution of p-values is uniform? You would probably have to run a test for uniformity which will return you its own p-value, and you end up with the same kind of inference question that you were trying to avoid, only one step farther. Instead of looking at p-value of the original $H_0$, now you look at a p-value of another $H'_0$ about the uniformity of distribution of original p-values.



UPDATE



Here's the demonstration. I generate 100 samples of 100 observations from Gaussian and Poisson distribution, then obtain 100 p-values for normality test of each sample. So, the premise of the question is that if the p-values are from uniform distribution, then it proves that the null hypothesis is correct, which is a stronger statement than a usual "fails to reject" in statistical inference. The trouble is that "the p-values are from uniform" is a hypothesis itself, which you have to somehow test.



In the picture (first row) below I'm showing the histograms of p-values from a normality test for the Guassian and Poisson sample, and you can see that it's hard to say whether one is more uniform than the other. That was my main point.



The second row shows one of the samples from each distribution. The samples are relatively small, so you can't have too many bins indeed. Actually, this particular Gaussian sample doesn't look that much Gaussian at all on the histogram.



In the third row, I'm showing the combined samples of 10,000 observations for each distribution on a histogram. Here, you can have more bins, and the shapes are more obvious.



Finally, I run the same normality test and get p-values for the combined samples and it rejects normality for Poisson, while failing to reject for Gaussian. The p-values are: [0.45348631] [0.]



enter image description here



This is not a proof, of course, but the demonstration of the idea that you better run the same test on the combined sample, instead of trying to analyze the distribution of p-values from subsamples.



Here's Python code:



import numpy as np
from scipy import stats
from matplotlib import pyplot as plt

def pvs(x):
pn = x.shape[1]
pvals = np.zeros(pn)
for i in range(pn):
pvals[i] = stats.jarque_bera(x[:,i])[1]
return pvals

n = 100
pn = 100
mu, sigma = 1, 2
np.random.seed(0)
x = np.random.normal(mu, sigma, size=(n,pn))
x2 = np.random.poisson(15, size=(n,pn))
print(x[1,1])

pvals = pvs(x)
pvals2 = pvs(x2)

x_f = x.reshape((n*pn,1))
pvals_f = pvs(x_f)

x2_f = x2.reshape((n*pn,1))
pvals2_f = pvs(x2_f)
print(pvals_f,pvals2_f)

print(x_f.shape,x_f[:,0])


#print(pvals)
plt.figure(figsize=(9,9))
plt.subplot(3,2,1)
plt.hist(pvals)
plt.gca().set_title('True Normal')
plt.gca().set_ylabel('p-value')

plt.subplot(3,2,2)
plt.hist(pvals2)
plt.gca().set_title('Poisson')
plt.gca().set_ylabel('p-value')

plt.subplot(3,2,3)
plt.hist(x[:,0])
plt.gca().set_title('a small sample')
plt.gca().set_ylabel('x')

plt.subplot(3,2,4)
plt.hist(x2[:,0])
plt.gca().set_title('a small Sample')
plt.gca().set_ylabel('x')

plt.subplot(3,2,5)
plt.hist(x_f[:,0],100)
plt.gca().set_title('Full Sample')
plt.gca().set_ylabel('x')

plt.subplot(3,2,6)
plt.hist(x2_f[:,0],100)
plt.gca().set_title('Full Sample')
plt.gca().set_ylabel('x')

plt.show()





share|cite|improve this answer


















  • 2




    @LeanderMoesinger you're going to make a stronger point by collecting all your tests into one. Suppose, you have a sample with 100 observations, and get p-value; then get 99 additional samples and end up with 100 p-values. Instead, you could just run one 10,000 observations sample and get on p-value, but it'll be more convincing.
    – Aksakal
    yesterday






  • 1




    @LeanderMoesinger, it's likely to be not small
    – Aksakal
    yesterday






  • 1




    Your answer does not address the question, he didn’t ask about proof but about evidence.
    – Carlos Cinelli
    yesterday







  • 1




    @CarlosCinelli, he'll have a bunch of p-values, which he would claim are uniform. How is this an evidence unless he proves the values are from uniform? That's what I'm talking about.
    – Aksakal
    yesterday







  • 2




    @Aksakal this is about mathematics, an observed event (like a sequence of p-values) may not constitute evidence of something, but the reason does not logically follow from your argument.
    – Carlos Cinelli
    yesterday


















up vote
19
down vote













Your series of experiments can be viewed as a single experiment with far more data, and as we know, more data is advantageous (eg. typically standard errors decrease as $sqrtn$ increases). Also, your partitioning the data and fitting a model repeatedly can indeed be useful.



But you ask, "Is this ... enough evidence to conclude that H0 is true?" Perhaps a rephrasing is, "If I obtain more and more data consistent with $H_0$ being true, can I ever conclude that $H_0$ is true?"



That question is deeply related to 18th century philosopher David Hume's problem of induction. If all observed instances of A have been B, can we say that the next instance of A will be B? Hume famously said no, that we cannot logically deduce that "all A are B" even from voluminous data.



In more modern math, a finite set of observations cannot logically entail $forall_a in A left[ a in B right]$ if A is not a finite set. Two notable examples as discussed by Magee and Passermore:



  • For centuries, every swan observed by Europeans was white. Then Europeans discovered Australia and saw black swans.


  • For centuries, Newton's law of gravity agreed with observation and was thought correct. It was overturned though by Einstein's theory of general relativity.


If Hume's conclusion is correct, proving $H_0$ true in a strict sense is unachievable. That we cannot make statements with certitude though is not equivalent to saying we know nothing at all. Experimental science and statistics have been successful in helping us understand and navigate the world.



An (incomplete) listing of ways forward:




Karl Popper and falsificationism



In Popper's view, no scientific law is ever proven true. We only have scientific laws not yet proven false.



Popper argued that science proceeds forward by guessing hypotheses and subjecting them to rigorous scrutiny. It proceeds forward through deduction (observation proving theories false), not induction (repeated observation proving theories true). Much of frequentist statistics was constructed consistent with this philosophy.



Popper's view has been immensely influential, but as Kuhn and others have argued, it does not quite conform to the empirically observed practice of successful science.



Bayesian, subjective probability



Let's assume we're interested in a parameter $theta$.



To the frequentist statistician, parameter $theta$ is a scalar value, a number. If you instead take a subjective Bayesian viewpoint (such as in Leonard Jimmie Savage's Foundation of Statistics), you can model your own uncertainty over $theta$ using the tools of probability. To the subjective Bayesian, $theta$ is a random variable and you have some prior $P(theta)$. You can then talk about the subjective probability $P(theta mid X)$ of different values of $theta$ given the data $X$. How you behave in various situations has some correspondence to these subjective probabilities.



This is a logical way to model your own subjective beliefs, but it's not a magic way to produce probabilities that are true in terms of correspondence to reality. A tricky question for any Bayesian interpretation is where do priors come from? Also, what if the model is misspecified?



George P. Box



A famous aphorism of George E.P. Box is that "all models are false, but some are useful."



Newton's law may not be true in the strong sense, but it immensely useful for many problems. Box's view is quite important in the modern big data context where studies are so overpowered that you can reject basically any meaningful proposition. Strictly true versus false is a bad question.



Additional comments



There's a world of difference in statistics between estimating a parameter $theta approx 0$ with a small standard error versus with a large standard error! Don't walk away thinking that because certitude is impossible, passing rigorous scrutiny is irrelevant.



Perhaps also of interest, statistically analyzing the results of multiple studies is called meta-analysis.



How far you can go beyond narrow statistical interpretations is a difficult question.






share|cite|improve this answer






















  • This has been an interesting read and gave some nice things to think about! I wish i could accept multiple answers.
    – Leander Moesinger
    yesterday











  • Quite an explanation. My prof once summarized Kuhn in the spirit of Popper: 'Science progresses from funeral to funeral'
    – skrubber
    yesterday










  • Kuhn etc famously misinterpret Popper when claiming his observations don't match how science is done. This is known as native falsificationism, and it's not what Popper (later) put forward. It's a straw man.
    – Konrad Rudolph
    yesterday






  • 1




    It's answers like this I keep visiting StackExchange sites.
    – Trilarion
    6 hours ago

















up vote
4
down vote













In a sense you are right (see the p-curve) with some small caveats:



  1. you need the test to have some power under the alternative. Illustration of the potential problem: generating a p-value as a uniform distribution on 0 to 1 and rejecting when $p leq alpha$ is a (admittedly pretty useless) level $alpha$ test for any null hypothesis, but you will get a uniform distribution of p-values whether $H_0$ is true or not.

  2. You can only really show that you are quite close to $H_0$ being true (i.e. under the true parameter values three distribution might be close to uniform, even if $H_0$ is false.

With realistic applications, you tend to get additional issues. These mostly arise, because no one person/lab/study group can usually do all the necessary studies. As a result one tends to look at studies from lots of groups, at which point you have increased concerns (i.e. if you had done all relevant experiments yourself, at least you'd know) of underreporting, selective reporting of significant/surprising findings, p-hacking, multiple testing/multiple testing corrections and so on.






share|cite|improve this answer





























    up vote
    -2
    down vote













    Null hypothesis (H0): Gravity causes everything in the universe to fall toward Earth's surface.



    Alternate hypothesis (H1): Nothing ever falls.



    Performed 1 million experiments with dozens of household objects, fail to reject H0 with $p < 0.01$ every time. Is H0 true?






    share|cite|improve this answer








    New contributor




    usul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.













    • 2




      Do you think Galileo did one million trials? None of this stuff is necessary in physical sciences. Establishing the laws of nature by applying scientific method does not reduce into statistical inference.
      – Aksakal
      yesterday







    • 1




      -1 This is scientifically, statistically, and historically inaccurate. Greeks once believed that it was affinity that drew objects to the Earth. Not bad, but doesn't explain the 3+ body system problems well. Hypotheses should be complementary. Lastly stating a possibly known bias as H_0 and showing experiments continue to lead to the same incorrect conclusion doesn't make the conclusion correct. e.g. women earn less than men b/c they are less driven, sample all women's salaries, H_0 is true!
      – AdamO
      yesterday











    • @AdamO that is exactly my point.
      – usul
      yesterday










    • @AdamO, in the Western countries women earn less when they work less for a variety of reasons including their own choice, disincentives of all kinds and hostile work environment in some places. When they work the same, they earn about the same, e.g. see medicare nurse salaries where women are the great majority: medscape.com/slideshow/…. They all earn the same $37 when working hourly. Totally off-topic, of course.
      – Aksakal
      yesterday







    • 2




      If your null hypothesis is Gravity causes everything in the universe to fall toward Earth's surface isn't the alternative hypothesis There is at least one thing in the universe that does not fall toward the Earth's surface and not Nothing ever falls?
      – Eff
      15 hours ago










    Your Answer





    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "65"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f376772%2fdoes-a-uniform-distribution-of-many-p-values-give-statistical-evidence-that-h0-i%23new-answer', 'question_page');

    );

    Post as a guest






























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    20
    down vote



    accepted










    I like your question, but unfortunately my answer is NO, it doesn't prove $H_0$. The reason is very simple. How would do you know that the distribution of p-values is uniform? You would probably have to run a test for uniformity which will return you its own p-value, and you end up with the same kind of inference question that you were trying to avoid, only one step farther. Instead of looking at p-value of the original $H_0$, now you look at a p-value of another $H'_0$ about the uniformity of distribution of original p-values.



    UPDATE



    Here's the demonstration. I generate 100 samples of 100 observations from Gaussian and Poisson distribution, then obtain 100 p-values for normality test of each sample. So, the premise of the question is that if the p-values are from uniform distribution, then it proves that the null hypothesis is correct, which is a stronger statement than a usual "fails to reject" in statistical inference. The trouble is that "the p-values are from uniform" is a hypothesis itself, which you have to somehow test.



    In the picture (first row) below I'm showing the histograms of p-values from a normality test for the Guassian and Poisson sample, and you can see that it's hard to say whether one is more uniform than the other. That was my main point.



    The second row shows one of the samples from each distribution. The samples are relatively small, so you can't have too many bins indeed. Actually, this particular Gaussian sample doesn't look that much Gaussian at all on the histogram.



    In the third row, I'm showing the combined samples of 10,000 observations for each distribution on a histogram. Here, you can have more bins, and the shapes are more obvious.



    Finally, I run the same normality test and get p-values for the combined samples and it rejects normality for Poisson, while failing to reject for Gaussian. The p-values are: [0.45348631] [0.]



    enter image description here



    This is not a proof, of course, but the demonstration of the idea that you better run the same test on the combined sample, instead of trying to analyze the distribution of p-values from subsamples.



    Here's Python code:



    import numpy as np
    from scipy import stats
    from matplotlib import pyplot as plt

    def pvs(x):
    pn = x.shape[1]
    pvals = np.zeros(pn)
    for i in range(pn):
    pvals[i] = stats.jarque_bera(x[:,i])[1]
    return pvals

    n = 100
    pn = 100
    mu, sigma = 1, 2
    np.random.seed(0)
    x = np.random.normal(mu, sigma, size=(n,pn))
    x2 = np.random.poisson(15, size=(n,pn))
    print(x[1,1])

    pvals = pvs(x)
    pvals2 = pvs(x2)

    x_f = x.reshape((n*pn,1))
    pvals_f = pvs(x_f)

    x2_f = x2.reshape((n*pn,1))
    pvals2_f = pvs(x2_f)
    print(pvals_f,pvals2_f)

    print(x_f.shape,x_f[:,0])


    #print(pvals)
    plt.figure(figsize=(9,9))
    plt.subplot(3,2,1)
    plt.hist(pvals)
    plt.gca().set_title('True Normal')
    plt.gca().set_ylabel('p-value')

    plt.subplot(3,2,2)
    plt.hist(pvals2)
    plt.gca().set_title('Poisson')
    plt.gca().set_ylabel('p-value')

    plt.subplot(3,2,3)
    plt.hist(x[:,0])
    plt.gca().set_title('a small sample')
    plt.gca().set_ylabel('x')

    plt.subplot(3,2,4)
    plt.hist(x2[:,0])
    plt.gca().set_title('a small Sample')
    plt.gca().set_ylabel('x')

    plt.subplot(3,2,5)
    plt.hist(x_f[:,0],100)
    plt.gca().set_title('Full Sample')
    plt.gca().set_ylabel('x')

    plt.subplot(3,2,6)
    plt.hist(x2_f[:,0],100)
    plt.gca().set_title('Full Sample')
    plt.gca().set_ylabel('x')

    plt.show()





    share|cite|improve this answer


















    • 2




      @LeanderMoesinger you're going to make a stronger point by collecting all your tests into one. Suppose, you have a sample with 100 observations, and get p-value; then get 99 additional samples and end up with 100 p-values. Instead, you could just run one 10,000 observations sample and get on p-value, but it'll be more convincing.
      – Aksakal
      yesterday






    • 1




      @LeanderMoesinger, it's likely to be not small
      – Aksakal
      yesterday






    • 1




      Your answer does not address the question, he didn’t ask about proof but about evidence.
      – Carlos Cinelli
      yesterday







    • 1




      @CarlosCinelli, he'll have a bunch of p-values, which he would claim are uniform. How is this an evidence unless he proves the values are from uniform? That's what I'm talking about.
      – Aksakal
      yesterday







    • 2




      @Aksakal this is about mathematics, an observed event (like a sequence of p-values) may not constitute evidence of something, but the reason does not logically follow from your argument.
      – Carlos Cinelli
      yesterday















    up vote
    20
    down vote



    accepted










    I like your question, but unfortunately my answer is NO, it doesn't prove $H_0$. The reason is very simple. How would do you know that the distribution of p-values is uniform? You would probably have to run a test for uniformity which will return you its own p-value, and you end up with the same kind of inference question that you were trying to avoid, only one step farther. Instead of looking at p-value of the original $H_0$, now you look at a p-value of another $H'_0$ about the uniformity of distribution of original p-values.



    UPDATE



    Here's the demonstration. I generate 100 samples of 100 observations from Gaussian and Poisson distribution, then obtain 100 p-values for normality test of each sample. So, the premise of the question is that if the p-values are from uniform distribution, then it proves that the null hypothesis is correct, which is a stronger statement than a usual "fails to reject" in statistical inference. The trouble is that "the p-values are from uniform" is a hypothesis itself, which you have to somehow test.



    In the picture (first row) below I'm showing the histograms of p-values from a normality test for the Guassian and Poisson sample, and you can see that it's hard to say whether one is more uniform than the other. That was my main point.



    The second row shows one of the samples from each distribution. The samples are relatively small, so you can't have too many bins indeed. Actually, this particular Gaussian sample doesn't look that much Gaussian at all on the histogram.



    In the third row, I'm showing the combined samples of 10,000 observations for each distribution on a histogram. Here, you can have more bins, and the shapes are more obvious.



    Finally, I run the same normality test and get p-values for the combined samples and it rejects normality for Poisson, while failing to reject for Gaussian. The p-values are: [0.45348631] [0.]



    enter image description here



    This is not a proof, of course, but the demonstration of the idea that you better run the same test on the combined sample, instead of trying to analyze the distribution of p-values from subsamples.



    Here's Python code:



    import numpy as np
    from scipy import stats
    from matplotlib import pyplot as plt

    def pvs(x):
    pn = x.shape[1]
    pvals = np.zeros(pn)
    for i in range(pn):
    pvals[i] = stats.jarque_bera(x[:,i])[1]
    return pvals

    n = 100
    pn = 100
    mu, sigma = 1, 2
    np.random.seed(0)
    x = np.random.normal(mu, sigma, size=(n,pn))
    x2 = np.random.poisson(15, size=(n,pn))
    print(x[1,1])

    pvals = pvs(x)
    pvals2 = pvs(x2)

    x_f = x.reshape((n*pn,1))
    pvals_f = pvs(x_f)

    x2_f = x2.reshape((n*pn,1))
    pvals2_f = pvs(x2_f)
    print(pvals_f,pvals2_f)

    print(x_f.shape,x_f[:,0])


    #print(pvals)
    plt.figure(figsize=(9,9))
    plt.subplot(3,2,1)
    plt.hist(pvals)
    plt.gca().set_title('True Normal')
    plt.gca().set_ylabel('p-value')

    plt.subplot(3,2,2)
    plt.hist(pvals2)
    plt.gca().set_title('Poisson')
    plt.gca().set_ylabel('p-value')

    plt.subplot(3,2,3)
    plt.hist(x[:,0])
    plt.gca().set_title('a small sample')
    plt.gca().set_ylabel('x')

    plt.subplot(3,2,4)
    plt.hist(x2[:,0])
    plt.gca().set_title('a small Sample')
    plt.gca().set_ylabel('x')

    plt.subplot(3,2,5)
    plt.hist(x_f[:,0],100)
    plt.gca().set_title('Full Sample')
    plt.gca().set_ylabel('x')

    plt.subplot(3,2,6)
    plt.hist(x2_f[:,0],100)
    plt.gca().set_title('Full Sample')
    plt.gca().set_ylabel('x')

    plt.show()





    share|cite|improve this answer


















    • 2




      @LeanderMoesinger you're going to make a stronger point by collecting all your tests into one. Suppose, you have a sample with 100 observations, and get p-value; then get 99 additional samples and end up with 100 p-values. Instead, you could just run one 10,000 observations sample and get on p-value, but it'll be more convincing.
      – Aksakal
      yesterday






    • 1




      @LeanderMoesinger, it's likely to be not small
      – Aksakal
      yesterday






    • 1




      Your answer does not address the question, he didn’t ask about proof but about evidence.
      – Carlos Cinelli
      yesterday







    • 1




      @CarlosCinelli, he'll have a bunch of p-values, which he would claim are uniform. How is this an evidence unless he proves the values are from uniform? That's what I'm talking about.
      – Aksakal
      yesterday







    • 2




      @Aksakal this is about mathematics, an observed event (like a sequence of p-values) may not constitute evidence of something, but the reason does not logically follow from your argument.
      – Carlos Cinelli
      yesterday













    up vote
    20
    down vote



    accepted







    up vote
    20
    down vote



    accepted






    I like your question, but unfortunately my answer is NO, it doesn't prove $H_0$. The reason is very simple. How would do you know that the distribution of p-values is uniform? You would probably have to run a test for uniformity which will return you its own p-value, and you end up with the same kind of inference question that you were trying to avoid, only one step farther. Instead of looking at p-value of the original $H_0$, now you look at a p-value of another $H'_0$ about the uniformity of distribution of original p-values.



    UPDATE



    Here's the demonstration. I generate 100 samples of 100 observations from Gaussian and Poisson distribution, then obtain 100 p-values for normality test of each sample. So, the premise of the question is that if the p-values are from uniform distribution, then it proves that the null hypothesis is correct, which is a stronger statement than a usual "fails to reject" in statistical inference. The trouble is that "the p-values are from uniform" is a hypothesis itself, which you have to somehow test.



    In the picture (first row) below I'm showing the histograms of p-values from a normality test for the Guassian and Poisson sample, and you can see that it's hard to say whether one is more uniform than the other. That was my main point.



    The second row shows one of the samples from each distribution. The samples are relatively small, so you can't have too many bins indeed. Actually, this particular Gaussian sample doesn't look that much Gaussian at all on the histogram.



    In the third row, I'm showing the combined samples of 10,000 observations for each distribution on a histogram. Here, you can have more bins, and the shapes are more obvious.



    Finally, I run the same normality test and get p-values for the combined samples and it rejects normality for Poisson, while failing to reject for Gaussian. The p-values are: [0.45348631] [0.]



    enter image description here



    This is not a proof, of course, but the demonstration of the idea that you better run the same test on the combined sample, instead of trying to analyze the distribution of p-values from subsamples.



    Here's Python code:



    import numpy as np
    from scipy import stats
    from matplotlib import pyplot as plt

    def pvs(x):
    pn = x.shape[1]
    pvals = np.zeros(pn)
    for i in range(pn):
    pvals[i] = stats.jarque_bera(x[:,i])[1]
    return pvals

    n = 100
    pn = 100
    mu, sigma = 1, 2
    np.random.seed(0)
    x = np.random.normal(mu, sigma, size=(n,pn))
    x2 = np.random.poisson(15, size=(n,pn))
    print(x[1,1])

    pvals = pvs(x)
    pvals2 = pvs(x2)

    x_f = x.reshape((n*pn,1))
    pvals_f = pvs(x_f)

    x2_f = x2.reshape((n*pn,1))
    pvals2_f = pvs(x2_f)
    print(pvals_f,pvals2_f)

    print(x_f.shape,x_f[:,0])


    #print(pvals)
    plt.figure(figsize=(9,9))
    plt.subplot(3,2,1)
    plt.hist(pvals)
    plt.gca().set_title('True Normal')
    plt.gca().set_ylabel('p-value')

    plt.subplot(3,2,2)
    plt.hist(pvals2)
    plt.gca().set_title('Poisson')
    plt.gca().set_ylabel('p-value')

    plt.subplot(3,2,3)
    plt.hist(x[:,0])
    plt.gca().set_title('a small sample')
    plt.gca().set_ylabel('x')

    plt.subplot(3,2,4)
    plt.hist(x2[:,0])
    plt.gca().set_title('a small Sample')
    plt.gca().set_ylabel('x')

    plt.subplot(3,2,5)
    plt.hist(x_f[:,0],100)
    plt.gca().set_title('Full Sample')
    plt.gca().set_ylabel('x')

    plt.subplot(3,2,6)
    plt.hist(x2_f[:,0],100)
    plt.gca().set_title('Full Sample')
    plt.gca().set_ylabel('x')

    plt.show()





    share|cite|improve this answer














    I like your question, but unfortunately my answer is NO, it doesn't prove $H_0$. The reason is very simple. How would do you know that the distribution of p-values is uniform? You would probably have to run a test for uniformity which will return you its own p-value, and you end up with the same kind of inference question that you were trying to avoid, only one step farther. Instead of looking at p-value of the original $H_0$, now you look at a p-value of another $H'_0$ about the uniformity of distribution of original p-values.



    UPDATE



    Here's the demonstration. I generate 100 samples of 100 observations from Gaussian and Poisson distribution, then obtain 100 p-values for normality test of each sample. So, the premise of the question is that if the p-values are from uniform distribution, then it proves that the null hypothesis is correct, which is a stronger statement than a usual "fails to reject" in statistical inference. The trouble is that "the p-values are from uniform" is a hypothesis itself, which you have to somehow test.



    In the picture (first row) below I'm showing the histograms of p-values from a normality test for the Guassian and Poisson sample, and you can see that it's hard to say whether one is more uniform than the other. That was my main point.



    The second row shows one of the samples from each distribution. The samples are relatively small, so you can't have too many bins indeed. Actually, this particular Gaussian sample doesn't look that much Gaussian at all on the histogram.



    In the third row, I'm showing the combined samples of 10,000 observations for each distribution on a histogram. Here, you can have more bins, and the shapes are more obvious.



    Finally, I run the same normality test and get p-values for the combined samples and it rejects normality for Poisson, while failing to reject for Gaussian. The p-values are: [0.45348631] [0.]



    enter image description here



    This is not a proof, of course, but the demonstration of the idea that you better run the same test on the combined sample, instead of trying to analyze the distribution of p-values from subsamples.



    Here's Python code:



    import numpy as np
    from scipy import stats
    from matplotlib import pyplot as plt

    def pvs(x):
    pn = x.shape[1]
    pvals = np.zeros(pn)
    for i in range(pn):
    pvals[i] = stats.jarque_bera(x[:,i])[1]
    return pvals

    n = 100
    pn = 100
    mu, sigma = 1, 2
    np.random.seed(0)
    x = np.random.normal(mu, sigma, size=(n,pn))
    x2 = np.random.poisson(15, size=(n,pn))
    print(x[1,1])

    pvals = pvs(x)
    pvals2 = pvs(x2)

    x_f = x.reshape((n*pn,1))
    pvals_f = pvs(x_f)

    x2_f = x2.reshape((n*pn,1))
    pvals2_f = pvs(x2_f)
    print(pvals_f,pvals2_f)

    print(x_f.shape,x_f[:,0])


    #print(pvals)
    plt.figure(figsize=(9,9))
    plt.subplot(3,2,1)
    plt.hist(pvals)
    plt.gca().set_title('True Normal')
    plt.gca().set_ylabel('p-value')

    plt.subplot(3,2,2)
    plt.hist(pvals2)
    plt.gca().set_title('Poisson')
    plt.gca().set_ylabel('p-value')

    plt.subplot(3,2,3)
    plt.hist(x[:,0])
    plt.gca().set_title('a small sample')
    plt.gca().set_ylabel('x')

    plt.subplot(3,2,4)
    plt.hist(x2[:,0])
    plt.gca().set_title('a small Sample')
    plt.gca().set_ylabel('x')

    plt.subplot(3,2,5)
    plt.hist(x_f[:,0],100)
    plt.gca().set_title('Full Sample')
    plt.gca().set_ylabel('x')

    plt.subplot(3,2,6)
    plt.hist(x2_f[:,0],100)
    plt.gca().set_title('Full Sample')
    plt.gca().set_ylabel('x')

    plt.show()






    share|cite|improve this answer














    share|cite|improve this answer



    share|cite|improve this answer








    edited 4 hours ago

























    answered yesterday









    Aksakal

    37.5k447109




    37.5k447109







    • 2




      @LeanderMoesinger you're going to make a stronger point by collecting all your tests into one. Suppose, you have a sample with 100 observations, and get p-value; then get 99 additional samples and end up with 100 p-values. Instead, you could just run one 10,000 observations sample and get on p-value, but it'll be more convincing.
      – Aksakal
      yesterday






    • 1




      @LeanderMoesinger, it's likely to be not small
      – Aksakal
      yesterday






    • 1




      Your answer does not address the question, he didn’t ask about proof but about evidence.
      – Carlos Cinelli
      yesterday







    • 1




      @CarlosCinelli, he'll have a bunch of p-values, which he would claim are uniform. How is this an evidence unless he proves the values are from uniform? That's what I'm talking about.
      – Aksakal
      yesterday







    • 2




      @Aksakal this is about mathematics, an observed event (like a sequence of p-values) may not constitute evidence of something, but the reason does not logically follow from your argument.
      – Carlos Cinelli
      yesterday













    • 2




      @LeanderMoesinger you're going to make a stronger point by collecting all your tests into one. Suppose, you have a sample with 100 observations, and get p-value; then get 99 additional samples and end up with 100 p-values. Instead, you could just run one 10,000 observations sample and get on p-value, but it'll be more convincing.
      – Aksakal
      yesterday






    • 1




      @LeanderMoesinger, it's likely to be not small
      – Aksakal
      yesterday






    • 1




      Your answer does not address the question, he didn’t ask about proof but about evidence.
      – Carlos Cinelli
      yesterday







    • 1




      @CarlosCinelli, he'll have a bunch of p-values, which he would claim are uniform. How is this an evidence unless he proves the values are from uniform? That's what I'm talking about.
      – Aksakal
      yesterday







    • 2




      @Aksakal this is about mathematics, an observed event (like a sequence of p-values) may not constitute evidence of something, but the reason does not logically follow from your argument.
      – Carlos Cinelli
      yesterday








    2




    2




    @LeanderMoesinger you're going to make a stronger point by collecting all your tests into one. Suppose, you have a sample with 100 observations, and get p-value; then get 99 additional samples and end up with 100 p-values. Instead, you could just run one 10,000 observations sample and get on p-value, but it'll be more convincing.
    – Aksakal
    yesterday




    @LeanderMoesinger you're going to make a stronger point by collecting all your tests into one. Suppose, you have a sample with 100 observations, and get p-value; then get 99 additional samples and end up with 100 p-values. Instead, you could just run one 10,000 observations sample and get on p-value, but it'll be more convincing.
    – Aksakal
    yesterday




    1




    1




    @LeanderMoesinger, it's likely to be not small
    – Aksakal
    yesterday




    @LeanderMoesinger, it's likely to be not small
    – Aksakal
    yesterday




    1




    1




    Your answer does not address the question, he didn’t ask about proof but about evidence.
    – Carlos Cinelli
    yesterday





    Your answer does not address the question, he didn’t ask about proof but about evidence.
    – Carlos Cinelli
    yesterday





    1




    1




    @CarlosCinelli, he'll have a bunch of p-values, which he would claim are uniform. How is this an evidence unless he proves the values are from uniform? That's what I'm talking about.
    – Aksakal
    yesterday





    @CarlosCinelli, he'll have a bunch of p-values, which he would claim are uniform. How is this an evidence unless he proves the values are from uniform? That's what I'm talking about.
    – Aksakal
    yesterday





    2




    2




    @Aksakal this is about mathematics, an observed event (like a sequence of p-values) may not constitute evidence of something, but the reason does not logically follow from your argument.
    – Carlos Cinelli
    yesterday





    @Aksakal this is about mathematics, an observed event (like a sequence of p-values) may not constitute evidence of something, but the reason does not logically follow from your argument.
    – Carlos Cinelli
    yesterday













    up vote
    19
    down vote













    Your series of experiments can be viewed as a single experiment with far more data, and as we know, more data is advantageous (eg. typically standard errors decrease as $sqrtn$ increases). Also, your partitioning the data and fitting a model repeatedly can indeed be useful.



    But you ask, "Is this ... enough evidence to conclude that H0 is true?" Perhaps a rephrasing is, "If I obtain more and more data consistent with $H_0$ being true, can I ever conclude that $H_0$ is true?"



    That question is deeply related to 18th century philosopher David Hume's problem of induction. If all observed instances of A have been B, can we say that the next instance of A will be B? Hume famously said no, that we cannot logically deduce that "all A are B" even from voluminous data.



    In more modern math, a finite set of observations cannot logically entail $forall_a in A left[ a in B right]$ if A is not a finite set. Two notable examples as discussed by Magee and Passermore:



    • For centuries, every swan observed by Europeans was white. Then Europeans discovered Australia and saw black swans.


    • For centuries, Newton's law of gravity agreed with observation and was thought correct. It was overturned though by Einstein's theory of general relativity.


    If Hume's conclusion is correct, proving $H_0$ true in a strict sense is unachievable. That we cannot make statements with certitude though is not equivalent to saying we know nothing at all. Experimental science and statistics have been successful in helping us understand and navigate the world.



    An (incomplete) listing of ways forward:




    Karl Popper and falsificationism



    In Popper's view, no scientific law is ever proven true. We only have scientific laws not yet proven false.



    Popper argued that science proceeds forward by guessing hypotheses and subjecting them to rigorous scrutiny. It proceeds forward through deduction (observation proving theories false), not induction (repeated observation proving theories true). Much of frequentist statistics was constructed consistent with this philosophy.



    Popper's view has been immensely influential, but as Kuhn and others have argued, it does not quite conform to the empirically observed practice of successful science.



    Bayesian, subjective probability



    Let's assume we're interested in a parameter $theta$.



    To the frequentist statistician, parameter $theta$ is a scalar value, a number. If you instead take a subjective Bayesian viewpoint (such as in Leonard Jimmie Savage's Foundation of Statistics), you can model your own uncertainty over $theta$ using the tools of probability. To the subjective Bayesian, $theta$ is a random variable and you have some prior $P(theta)$. You can then talk about the subjective probability $P(theta mid X)$ of different values of $theta$ given the data $X$. How you behave in various situations has some correspondence to these subjective probabilities.



    This is a logical way to model your own subjective beliefs, but it's not a magic way to produce probabilities that are true in terms of correspondence to reality. A tricky question for any Bayesian interpretation is where do priors come from? Also, what if the model is misspecified?



    George P. Box



    A famous aphorism of George E.P. Box is that "all models are false, but some are useful."



    Newton's law may not be true in the strong sense, but it immensely useful for many problems. Box's view is quite important in the modern big data context where studies are so overpowered that you can reject basically any meaningful proposition. Strictly true versus false is a bad question.



    Additional comments



    There's a world of difference in statistics between estimating a parameter $theta approx 0$ with a small standard error versus with a large standard error! Don't walk away thinking that because certitude is impossible, passing rigorous scrutiny is irrelevant.



    Perhaps also of interest, statistically analyzing the results of multiple studies is called meta-analysis.



    How far you can go beyond narrow statistical interpretations is a difficult question.






    share|cite|improve this answer






















    • This has been an interesting read and gave some nice things to think about! I wish i could accept multiple answers.
      – Leander Moesinger
      yesterday











    • Quite an explanation. My prof once summarized Kuhn in the spirit of Popper: 'Science progresses from funeral to funeral'
      – skrubber
      yesterday










    • Kuhn etc famously misinterpret Popper when claiming his observations don't match how science is done. This is known as native falsificationism, and it's not what Popper (later) put forward. It's a straw man.
      – Konrad Rudolph
      yesterday






    • 1




      It's answers like this I keep visiting StackExchange sites.
      – Trilarion
      6 hours ago














    up vote
    19
    down vote













    Your series of experiments can be viewed as a single experiment with far more data, and as we know, more data is advantageous (eg. typically standard errors decrease as $sqrtn$ increases). Also, your partitioning the data and fitting a model repeatedly can indeed be useful.



    But you ask, "Is this ... enough evidence to conclude that H0 is true?" Perhaps a rephrasing is, "If I obtain more and more data consistent with $H_0$ being true, can I ever conclude that $H_0$ is true?"



    That question is deeply related to 18th century philosopher David Hume's problem of induction. If all observed instances of A have been B, can we say that the next instance of A will be B? Hume famously said no, that we cannot logically deduce that "all A are B" even from voluminous data.



    In more modern math, a finite set of observations cannot logically entail $forall_a in A left[ a in B right]$ if A is not a finite set. Two notable examples as discussed by Magee and Passermore:



    • For centuries, every swan observed by Europeans was white. Then Europeans discovered Australia and saw black swans.


    • For centuries, Newton's law of gravity agreed with observation and was thought correct. It was overturned though by Einstein's theory of general relativity.


    If Hume's conclusion is correct, proving $H_0$ true in a strict sense is unachievable. That we cannot make statements with certitude though is not equivalent to saying we know nothing at all. Experimental science and statistics have been successful in helping us understand and navigate the world.



    An (incomplete) listing of ways forward:




    Karl Popper and falsificationism



    In Popper's view, no scientific law is ever proven true. We only have scientific laws not yet proven false.



    Popper argued that science proceeds forward by guessing hypotheses and subjecting them to rigorous scrutiny. It proceeds forward through deduction (observation proving theories false), not induction (repeated observation proving theories true). Much of frequentist statistics was constructed consistent with this philosophy.



    Popper's view has been immensely influential, but as Kuhn and others have argued, it does not quite conform to the empirically observed practice of successful science.



    Bayesian, subjective probability



    Let's assume we're interested in a parameter $theta$.



    To the frequentist statistician, parameter $theta$ is a scalar value, a number. If you instead take a subjective Bayesian viewpoint (such as in Leonard Jimmie Savage's Foundation of Statistics), you can model your own uncertainty over $theta$ using the tools of probability. To the subjective Bayesian, $theta$ is a random variable and you have some prior $P(theta)$. You can then talk about the subjective probability $P(theta mid X)$ of different values of $theta$ given the data $X$. How you behave in various situations has some correspondence to these subjective probabilities.



    This is a logical way to model your own subjective beliefs, but it's not a magic way to produce probabilities that are true in terms of correspondence to reality. A tricky question for any Bayesian interpretation is where do priors come from? Also, what if the model is misspecified?



    George P. Box



    A famous aphorism of George E.P. Box is that "all models are false, but some are useful."



    Newton's law may not be true in the strong sense, but it immensely useful for many problems. Box's view is quite important in the modern big data context where studies are so overpowered that you can reject basically any meaningful proposition. Strictly true versus false is a bad question.



    Additional comments



    There's a world of difference in statistics between estimating a parameter $theta approx 0$ with a small standard error versus with a large standard error! Don't walk away thinking that because certitude is impossible, passing rigorous scrutiny is irrelevant.



    Perhaps also of interest, statistically analyzing the results of multiple studies is called meta-analysis.



    How far you can go beyond narrow statistical interpretations is a difficult question.






    share|cite|improve this answer






















    • This has been an interesting read and gave some nice things to think about! I wish i could accept multiple answers.
      – Leander Moesinger
      yesterday











    • Quite an explanation. My prof once summarized Kuhn in the spirit of Popper: 'Science progresses from funeral to funeral'
      – skrubber
      yesterday










    • Kuhn etc famously misinterpret Popper when claiming his observations don't match how science is done. This is known as native falsificationism, and it's not what Popper (later) put forward. It's a straw man.
      – Konrad Rudolph
      yesterday






    • 1




      It's answers like this I keep visiting StackExchange sites.
      – Trilarion
      6 hours ago












    up vote
    19
    down vote










    up vote
    19
    down vote









    Your series of experiments can be viewed as a single experiment with far more data, and as we know, more data is advantageous (eg. typically standard errors decrease as $sqrtn$ increases). Also, your partitioning the data and fitting a model repeatedly can indeed be useful.



    But you ask, "Is this ... enough evidence to conclude that H0 is true?" Perhaps a rephrasing is, "If I obtain more and more data consistent with $H_0$ being true, can I ever conclude that $H_0$ is true?"



    That question is deeply related to 18th century philosopher David Hume's problem of induction. If all observed instances of A have been B, can we say that the next instance of A will be B? Hume famously said no, that we cannot logically deduce that "all A are B" even from voluminous data.



    In more modern math, a finite set of observations cannot logically entail $forall_a in A left[ a in B right]$ if A is not a finite set. Two notable examples as discussed by Magee and Passermore:



    • For centuries, every swan observed by Europeans was white. Then Europeans discovered Australia and saw black swans.


    • For centuries, Newton's law of gravity agreed with observation and was thought correct. It was overturned though by Einstein's theory of general relativity.


    If Hume's conclusion is correct, proving $H_0$ true in a strict sense is unachievable. That we cannot make statements with certitude though is not equivalent to saying we know nothing at all. Experimental science and statistics have been successful in helping us understand and navigate the world.



    An (incomplete) listing of ways forward:




    Karl Popper and falsificationism



    In Popper's view, no scientific law is ever proven true. We only have scientific laws not yet proven false.



    Popper argued that science proceeds forward by guessing hypotheses and subjecting them to rigorous scrutiny. It proceeds forward through deduction (observation proving theories false), not induction (repeated observation proving theories true). Much of frequentist statistics was constructed consistent with this philosophy.



    Popper's view has been immensely influential, but as Kuhn and others have argued, it does not quite conform to the empirically observed practice of successful science.



    Bayesian, subjective probability



    Let's assume we're interested in a parameter $theta$.



    To the frequentist statistician, parameter $theta$ is a scalar value, a number. If you instead take a subjective Bayesian viewpoint (such as in Leonard Jimmie Savage's Foundation of Statistics), you can model your own uncertainty over $theta$ using the tools of probability. To the subjective Bayesian, $theta$ is a random variable and you have some prior $P(theta)$. You can then talk about the subjective probability $P(theta mid X)$ of different values of $theta$ given the data $X$. How you behave in various situations has some correspondence to these subjective probabilities.



    This is a logical way to model your own subjective beliefs, but it's not a magic way to produce probabilities that are true in terms of correspondence to reality. A tricky question for any Bayesian interpretation is where do priors come from? Also, what if the model is misspecified?



    George P. Box



    A famous aphorism of George E.P. Box is that "all models are false, but some are useful."



    Newton's law may not be true in the strong sense, but it immensely useful for many problems. Box's view is quite important in the modern big data context where studies are so overpowered that you can reject basically any meaningful proposition. Strictly true versus false is a bad question.



    Additional comments



    There's a world of difference in statistics between estimating a parameter $theta approx 0$ with a small standard error versus with a large standard error! Don't walk away thinking that because certitude is impossible, passing rigorous scrutiny is irrelevant.



    Perhaps also of interest, statistically analyzing the results of multiple studies is called meta-analysis.



    How far you can go beyond narrow statistical interpretations is a difficult question.






    share|cite|improve this answer














    Your series of experiments can be viewed as a single experiment with far more data, and as we know, more data is advantageous (eg. typically standard errors decrease as $sqrtn$ increases). Also, your partitioning the data and fitting a model repeatedly can indeed be useful.



    But you ask, "Is this ... enough evidence to conclude that H0 is true?" Perhaps a rephrasing is, "If I obtain more and more data consistent with $H_0$ being true, can I ever conclude that $H_0$ is true?"



    That question is deeply related to 18th century philosopher David Hume's problem of induction. If all observed instances of A have been B, can we say that the next instance of A will be B? Hume famously said no, that we cannot logically deduce that "all A are B" even from voluminous data.



    In more modern math, a finite set of observations cannot logically entail $forall_a in A left[ a in B right]$ if A is not a finite set. Two notable examples as discussed by Magee and Passermore:



    • For centuries, every swan observed by Europeans was white. Then Europeans discovered Australia and saw black swans.


    • For centuries, Newton's law of gravity agreed with observation and was thought correct. It was overturned though by Einstein's theory of general relativity.


    If Hume's conclusion is correct, proving $H_0$ true in a strict sense is unachievable. That we cannot make statements with certitude though is not equivalent to saying we know nothing at all. Experimental science and statistics have been successful in helping us understand and navigate the world.



    An (incomplete) listing of ways forward:




    Karl Popper and falsificationism



    In Popper's view, no scientific law is ever proven true. We only have scientific laws not yet proven false.



    Popper argued that science proceeds forward by guessing hypotheses and subjecting them to rigorous scrutiny. It proceeds forward through deduction (observation proving theories false), not induction (repeated observation proving theories true). Much of frequentist statistics was constructed consistent with this philosophy.



    Popper's view has been immensely influential, but as Kuhn and others have argued, it does not quite conform to the empirically observed practice of successful science.



    Bayesian, subjective probability



    Let's assume we're interested in a parameter $theta$.



    To the frequentist statistician, parameter $theta$ is a scalar value, a number. If you instead take a subjective Bayesian viewpoint (such as in Leonard Jimmie Savage's Foundation of Statistics), you can model your own uncertainty over $theta$ using the tools of probability. To the subjective Bayesian, $theta$ is a random variable and you have some prior $P(theta)$. You can then talk about the subjective probability $P(theta mid X)$ of different values of $theta$ given the data $X$. How you behave in various situations has some correspondence to these subjective probabilities.



    This is a logical way to model your own subjective beliefs, but it's not a magic way to produce probabilities that are true in terms of correspondence to reality. A tricky question for any Bayesian interpretation is where do priors come from? Also, what if the model is misspecified?



    George P. Box



    A famous aphorism of George E.P. Box is that "all models are false, but some are useful."



    Newton's law may not be true in the strong sense, but it immensely useful for many problems. Box's view is quite important in the modern big data context where studies are so overpowered that you can reject basically any meaningful proposition. Strictly true versus false is a bad question.



    Additional comments



    There's a world of difference in statistics between estimating a parameter $theta approx 0$ with a small standard error versus with a large standard error! Don't walk away thinking that because certitude is impossible, passing rigorous scrutiny is irrelevant.



    Perhaps also of interest, statistically analyzing the results of multiple studies is called meta-analysis.



    How far you can go beyond narrow statistical interpretations is a difficult question.







    share|cite|improve this answer














    share|cite|improve this answer



    share|cite|improve this answer








    edited yesterday

























    answered yesterday









    Matthew Gunn

    16.7k13167




    16.7k13167











    • This has been an interesting read and gave some nice things to think about! I wish i could accept multiple answers.
      – Leander Moesinger
      yesterday











    • Quite an explanation. My prof once summarized Kuhn in the spirit of Popper: 'Science progresses from funeral to funeral'
      – skrubber
      yesterday










    • Kuhn etc famously misinterpret Popper when claiming his observations don't match how science is done. This is known as native falsificationism, and it's not what Popper (later) put forward. It's a straw man.
      – Konrad Rudolph
      yesterday






    • 1




      It's answers like this I keep visiting StackExchange sites.
      – Trilarion
      6 hours ago
















    • This has been an interesting read and gave some nice things to think about! I wish i could accept multiple answers.
      – Leander Moesinger
      yesterday











    • Quite an explanation. My prof once summarized Kuhn in the spirit of Popper: 'Science progresses from funeral to funeral'
      – skrubber
      yesterday










    • Kuhn etc famously misinterpret Popper when claiming his observations don't match how science is done. This is known as native falsificationism, and it's not what Popper (later) put forward. It's a straw man.
      – Konrad Rudolph
      yesterday






    • 1




      It's answers like this I keep visiting StackExchange sites.
      – Trilarion
      6 hours ago















    This has been an interesting read and gave some nice things to think about! I wish i could accept multiple answers.
    – Leander Moesinger
    yesterday





    This has been an interesting read and gave some nice things to think about! I wish i could accept multiple answers.
    – Leander Moesinger
    yesterday













    Quite an explanation. My prof once summarized Kuhn in the spirit of Popper: 'Science progresses from funeral to funeral'
    – skrubber
    yesterday




    Quite an explanation. My prof once summarized Kuhn in the spirit of Popper: 'Science progresses from funeral to funeral'
    – skrubber
    yesterday












    Kuhn etc famously misinterpret Popper when claiming his observations don't match how science is done. This is known as native falsificationism, and it's not what Popper (later) put forward. It's a straw man.
    – Konrad Rudolph
    yesterday




    Kuhn etc famously misinterpret Popper when claiming his observations don't match how science is done. This is known as native falsificationism, and it's not what Popper (later) put forward. It's a straw man.
    – Konrad Rudolph
    yesterday




    1




    1




    It's answers like this I keep visiting StackExchange sites.
    – Trilarion
    6 hours ago




    It's answers like this I keep visiting StackExchange sites.
    – Trilarion
    6 hours ago










    up vote
    4
    down vote













    In a sense you are right (see the p-curve) with some small caveats:



    1. you need the test to have some power under the alternative. Illustration of the potential problem: generating a p-value as a uniform distribution on 0 to 1 and rejecting when $p leq alpha$ is a (admittedly pretty useless) level $alpha$ test for any null hypothesis, but you will get a uniform distribution of p-values whether $H_0$ is true or not.

    2. You can only really show that you are quite close to $H_0$ being true (i.e. under the true parameter values three distribution might be close to uniform, even if $H_0$ is false.

    With realistic applications, you tend to get additional issues. These mostly arise, because no one person/lab/study group can usually do all the necessary studies. As a result one tends to look at studies from lots of groups, at which point you have increased concerns (i.e. if you had done all relevant experiments yourself, at least you'd know) of underreporting, selective reporting of significant/surprising findings, p-hacking, multiple testing/multiple testing corrections and so on.






    share|cite|improve this answer


























      up vote
      4
      down vote













      In a sense you are right (see the p-curve) with some small caveats:



      1. you need the test to have some power under the alternative. Illustration of the potential problem: generating a p-value as a uniform distribution on 0 to 1 and rejecting when $p leq alpha$ is a (admittedly pretty useless) level $alpha$ test for any null hypothesis, but you will get a uniform distribution of p-values whether $H_0$ is true or not.

      2. You can only really show that you are quite close to $H_0$ being true (i.e. under the true parameter values three distribution might be close to uniform, even if $H_0$ is false.

      With realistic applications, you tend to get additional issues. These mostly arise, because no one person/lab/study group can usually do all the necessary studies. As a result one tends to look at studies from lots of groups, at which point you have increased concerns (i.e. if you had done all relevant experiments yourself, at least you'd know) of underreporting, selective reporting of significant/surprising findings, p-hacking, multiple testing/multiple testing corrections and so on.






      share|cite|improve this answer
























        up vote
        4
        down vote










        up vote
        4
        down vote









        In a sense you are right (see the p-curve) with some small caveats:



        1. you need the test to have some power under the alternative. Illustration of the potential problem: generating a p-value as a uniform distribution on 0 to 1 and rejecting when $p leq alpha$ is a (admittedly pretty useless) level $alpha$ test for any null hypothesis, but you will get a uniform distribution of p-values whether $H_0$ is true or not.

        2. You can only really show that you are quite close to $H_0$ being true (i.e. under the true parameter values three distribution might be close to uniform, even if $H_0$ is false.

        With realistic applications, you tend to get additional issues. These mostly arise, because no one person/lab/study group can usually do all the necessary studies. As a result one tends to look at studies from lots of groups, at which point you have increased concerns (i.e. if you had done all relevant experiments yourself, at least you'd know) of underreporting, selective reporting of significant/surprising findings, p-hacking, multiple testing/multiple testing corrections and so on.






        share|cite|improve this answer














        In a sense you are right (see the p-curve) with some small caveats:



        1. you need the test to have some power under the alternative. Illustration of the potential problem: generating a p-value as a uniform distribution on 0 to 1 and rejecting when $p leq alpha$ is a (admittedly pretty useless) level $alpha$ test for any null hypothesis, but you will get a uniform distribution of p-values whether $H_0$ is true or not.

        2. You can only really show that you are quite close to $H_0$ being true (i.e. under the true parameter values three distribution might be close to uniform, even if $H_0$ is false.

        With realistic applications, you tend to get additional issues. These mostly arise, because no one person/lab/study group can usually do all the necessary studies. As a result one tends to look at studies from lots of groups, at which point you have increased concerns (i.e. if you had done all relevant experiments yourself, at least you'd know) of underreporting, selective reporting of significant/surprising findings, p-hacking, multiple testing/multiple testing corrections and so on.







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited yesterday

























        answered yesterday









        Björn

        8,6011834




        8,6011834




















            up vote
            -2
            down vote













            Null hypothesis (H0): Gravity causes everything in the universe to fall toward Earth's surface.



            Alternate hypothesis (H1): Nothing ever falls.



            Performed 1 million experiments with dozens of household objects, fail to reject H0 with $p < 0.01$ every time. Is H0 true?






            share|cite|improve this answer








            New contributor




            usul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.













            • 2




              Do you think Galileo did one million trials? None of this stuff is necessary in physical sciences. Establishing the laws of nature by applying scientific method does not reduce into statistical inference.
              – Aksakal
              yesterday







            • 1




              -1 This is scientifically, statistically, and historically inaccurate. Greeks once believed that it was affinity that drew objects to the Earth. Not bad, but doesn't explain the 3+ body system problems well. Hypotheses should be complementary. Lastly stating a possibly known bias as H_0 and showing experiments continue to lead to the same incorrect conclusion doesn't make the conclusion correct. e.g. women earn less than men b/c they are less driven, sample all women's salaries, H_0 is true!
              – AdamO
              yesterday











            • @AdamO that is exactly my point.
              – usul
              yesterday










            • @AdamO, in the Western countries women earn less when they work less for a variety of reasons including their own choice, disincentives of all kinds and hostile work environment in some places. When they work the same, they earn about the same, e.g. see medicare nurse salaries where women are the great majority: medscape.com/slideshow/…. They all earn the same $37 when working hourly. Totally off-topic, of course.
              – Aksakal
              yesterday







            • 2




              If your null hypothesis is Gravity causes everything in the universe to fall toward Earth's surface isn't the alternative hypothesis There is at least one thing in the universe that does not fall toward the Earth's surface and not Nothing ever falls?
              – Eff
              15 hours ago














            up vote
            -2
            down vote













            Null hypothesis (H0): Gravity causes everything in the universe to fall toward Earth's surface.



            Alternate hypothesis (H1): Nothing ever falls.



            Performed 1 million experiments with dozens of household objects, fail to reject H0 with $p < 0.01$ every time. Is H0 true?






            share|cite|improve this answer








            New contributor




            usul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.













            • 2




              Do you think Galileo did one million trials? None of this stuff is necessary in physical sciences. Establishing the laws of nature by applying scientific method does not reduce into statistical inference.
              – Aksakal
              yesterday







            • 1




              -1 This is scientifically, statistically, and historically inaccurate. Greeks once believed that it was affinity that drew objects to the Earth. Not bad, but doesn't explain the 3+ body system problems well. Hypotheses should be complementary. Lastly stating a possibly known bias as H_0 and showing experiments continue to lead to the same incorrect conclusion doesn't make the conclusion correct. e.g. women earn less than men b/c they are less driven, sample all women's salaries, H_0 is true!
              – AdamO
              yesterday











            • @AdamO that is exactly my point.
              – usul
              yesterday










            • @AdamO, in the Western countries women earn less when they work less for a variety of reasons including their own choice, disincentives of all kinds and hostile work environment in some places. When they work the same, they earn about the same, e.g. see medicare nurse salaries where women are the great majority: medscape.com/slideshow/…. They all earn the same $37 when working hourly. Totally off-topic, of course.
              – Aksakal
              yesterday







            • 2




              If your null hypothesis is Gravity causes everything in the universe to fall toward Earth's surface isn't the alternative hypothesis There is at least one thing in the universe that does not fall toward the Earth's surface and not Nothing ever falls?
              – Eff
              15 hours ago












            up vote
            -2
            down vote










            up vote
            -2
            down vote









            Null hypothesis (H0): Gravity causes everything in the universe to fall toward Earth's surface.



            Alternate hypothesis (H1): Nothing ever falls.



            Performed 1 million experiments with dozens of household objects, fail to reject H0 with $p < 0.01$ every time. Is H0 true?






            share|cite|improve this answer








            New contributor




            usul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.









            Null hypothesis (H0): Gravity causes everything in the universe to fall toward Earth's surface.



            Alternate hypothesis (H1): Nothing ever falls.



            Performed 1 million experiments with dozens of household objects, fail to reject H0 with $p < 0.01$ every time. Is H0 true?







            share|cite|improve this answer








            New contributor




            usul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.









            share|cite|improve this answer



            share|cite|improve this answer






            New contributor




            usul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.









            answered yesterday









            usul

            971




            971




            New contributor




            usul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.





            New contributor





            usul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            usul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.







            • 2




              Do you think Galileo did one million trials? None of this stuff is necessary in physical sciences. Establishing the laws of nature by applying scientific method does not reduce into statistical inference.
              – Aksakal
              yesterday







            • 1




              -1 This is scientifically, statistically, and historically inaccurate. Greeks once believed that it was affinity that drew objects to the Earth. Not bad, but doesn't explain the 3+ body system problems well. Hypotheses should be complementary. Lastly stating a possibly known bias as H_0 and showing experiments continue to lead to the same incorrect conclusion doesn't make the conclusion correct. e.g. women earn less than men b/c they are less driven, sample all women's salaries, H_0 is true!
              – AdamO
              yesterday











            • @AdamO that is exactly my point.
              – usul
              yesterday










            • @AdamO, in the Western countries women earn less when they work less for a variety of reasons including their own choice, disincentives of all kinds and hostile work environment in some places. When they work the same, they earn about the same, e.g. see medicare nurse salaries where women are the great majority: medscape.com/slideshow/…. They all earn the same $37 when working hourly. Totally off-topic, of course.
              – Aksakal
              yesterday







            • 2




              If your null hypothesis is Gravity causes everything in the universe to fall toward Earth's surface isn't the alternative hypothesis There is at least one thing in the universe that does not fall toward the Earth's surface and not Nothing ever falls?
              – Eff
              15 hours ago












            • 2




              Do you think Galileo did one million trials? None of this stuff is necessary in physical sciences. Establishing the laws of nature by applying scientific method does not reduce into statistical inference.
              – Aksakal
              yesterday







            • 1




              -1 This is scientifically, statistically, and historically inaccurate. Greeks once believed that it was affinity that drew objects to the Earth. Not bad, but doesn't explain the 3+ body system problems well. Hypotheses should be complementary. Lastly stating a possibly known bias as H_0 and showing experiments continue to lead to the same incorrect conclusion doesn't make the conclusion correct. e.g. women earn less than men b/c they are less driven, sample all women's salaries, H_0 is true!
              – AdamO
              yesterday











            • @AdamO that is exactly my point.
              – usul
              yesterday










            • @AdamO, in the Western countries women earn less when they work less for a variety of reasons including their own choice, disincentives of all kinds and hostile work environment in some places. When they work the same, they earn about the same, e.g. see medicare nurse salaries where women are the great majority: medscape.com/slideshow/…. They all earn the same $37 when working hourly. Totally off-topic, of course.
              – Aksakal
              yesterday







            • 2




              If your null hypothesis is Gravity causes everything in the universe to fall toward Earth's surface isn't the alternative hypothesis There is at least one thing in the universe that does not fall toward the Earth's surface and not Nothing ever falls?
              – Eff
              15 hours ago







            2




            2




            Do you think Galileo did one million trials? None of this stuff is necessary in physical sciences. Establishing the laws of nature by applying scientific method does not reduce into statistical inference.
            – Aksakal
            yesterday





            Do you think Galileo did one million trials? None of this stuff is necessary in physical sciences. Establishing the laws of nature by applying scientific method does not reduce into statistical inference.
            – Aksakal
            yesterday





            1




            1




            -1 This is scientifically, statistically, and historically inaccurate. Greeks once believed that it was affinity that drew objects to the Earth. Not bad, but doesn't explain the 3+ body system problems well. Hypotheses should be complementary. Lastly stating a possibly known bias as H_0 and showing experiments continue to lead to the same incorrect conclusion doesn't make the conclusion correct. e.g. women earn less than men b/c they are less driven, sample all women's salaries, H_0 is true!
            – AdamO
            yesterday





            -1 This is scientifically, statistically, and historically inaccurate. Greeks once believed that it was affinity that drew objects to the Earth. Not bad, but doesn't explain the 3+ body system problems well. Hypotheses should be complementary. Lastly stating a possibly known bias as H_0 and showing experiments continue to lead to the same incorrect conclusion doesn't make the conclusion correct. e.g. women earn less than men b/c they are less driven, sample all women's salaries, H_0 is true!
            – AdamO
            yesterday













            @AdamO that is exactly my point.
            – usul
            yesterday




            @AdamO that is exactly my point.
            – usul
            yesterday












            @AdamO, in the Western countries women earn less when they work less for a variety of reasons including their own choice, disincentives of all kinds and hostile work environment in some places. When they work the same, they earn about the same, e.g. see medicare nurse salaries where women are the great majority: medscape.com/slideshow/…. They all earn the same $37 when working hourly. Totally off-topic, of course.
            – Aksakal
            yesterday





            @AdamO, in the Western countries women earn less when they work less for a variety of reasons including their own choice, disincentives of all kinds and hostile work environment in some places. When they work the same, they earn about the same, e.g. see medicare nurse salaries where women are the great majority: medscape.com/slideshow/…. They all earn the same $37 when working hourly. Totally off-topic, of course.
            – Aksakal
            yesterday





            2




            2




            If your null hypothesis is Gravity causes everything in the universe to fall toward Earth's surface isn't the alternative hypothesis There is at least one thing in the universe that does not fall toward the Earth's surface and not Nothing ever falls?
            – Eff
            15 hours ago




            If your null hypothesis is Gravity causes everything in the universe to fall toward Earth's surface isn't the alternative hypothesis There is at least one thing in the universe that does not fall toward the Earth's surface and not Nothing ever falls?
            – Eff
            15 hours ago

















             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f376772%2fdoes-a-uniform-distribution-of-many-p-values-give-statistical-evidence-that-h0-i%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Bahrain

            Postfix configuration issue with fips on centos 7; mailgun relay