Interpreting P values (2024)

A P value measures a sample's compatibility with a hypothesis, not the truth of the hypothesis.

Although P values are convenient and popular summaries of experimental results, we can be led astray if we consider them as our only metric1. Even in the ideal case of a rigorously designed randomized study fit to a predetermined model, P values still need to be supplemented with other information to avoid misinterpretation.

A P value is a probability statement about the observed sample in the context of a hypothesis, not about the hypotheses being tested. For example, suppose we wish to know whether disease affects the level of a biomarker. The P value of a comparison of the mean biomarker levels in healthy versus diseased samples would be the probability that a difference in means at least as large as the one observed can be generated from random samples if the disease does not affect the mean biomarker level. It is not the probability of the biomarker-level means in the two samples being equal—they either are or are not equal.

However, this relationship between P values and inference about hypotheses is a critical point—interpretation of statistical analysis depends on it. It is one of the key themes in the American Statistical Association's statement on statistical significance and P values2, published to mitigate widespread misuse and misinterpretation of P values. This relationship is discussed in some of the 18 short commentaries that accompany the statement, from which three main ideas for using, interpreting and reporting P values emerge: the use of more stringent P value cutoffs supported by Bayesian analysis, use of the observed P value to estimate false discovery rate (FDR), and the combination of P values and effect sizes to create more informative confidence intervals. The first two of these ideas are currently most useful as guidelines for assessing how strongly the data support null versus alternative hypotheses, whereas the third could be used to assess how strongly the data support parameter values in the confidence interval. However, like P values, these methods will be biased toward the alternative hypothesis when used with a P value selected from the most significant of multiple tests or models1.

To illustrate these three ideas, let's expand on the biomarker example above with the null hypothesis that disease does not influence the biomarker level. For samples, we'll use n = 10 individuals, randomly chosen from each of the unaffected and affected populations, assumed to be normally distributed with σ2 = 1. At this sample size, a two-sample t-test has 80% power to reject the null at significance α = 0.05 when the effect size is 1.32 (Fig. 1a). Suppose that we observe a difference in sample means of 1.2 and that our samples have a pooled s.d. of sp = 1.1. These give us t = 1.2/(sp√(2/n)) = 2.44 with d.f. = 2(n – 1) = 18 and a P value of 0.025.

(a) Power drops at more stringent P value cutoffs α. The curve is based on a two-sample t-test with n = 10 and an effect size of 1.32. (b) The Benjamin and Berger bound calibrates the P value to probability statements about the hypothesis. At P = 0.05, the bound suggests that our alternative hypothesis is at most 2.5 times more likely than the null (black dashed line). Also shown are the conventional Bayesian Interpreting P values (2) = 20 (blue dashed line; P = 0.0032) cutoff and Interpreting P values (3) = 14 (orange dashed line; P = 0.005), suggested by Johnson in ref. 2. (c) Use of the more stringent Benjamin and Berger bounds in b reduces the power of the test, because now testing is performed at a < 0.05. For Interpreting P values (4) = 14 (orange dashed line; α = 0.005), the power is only 43%. The blue and orange dashed lines show the same bounds as in b. In all panels, black dotted lines are present to help the reader locate values discussed in the text.

Full size image

Once a P value has been computed, it is useful to assess the strength of evidence of the truth or falsehood of the null hypothesis. Here we can look to Bayesian analysis for ways to make this connection3, where decisions about statistical significance can be based on the Bayes factor, B, which is the ratio of average likelihoods under the alternative and null hypotheses. However, using Bayesian analysis adds an element of subjectivity because it requires the specification of a prior distribution for the model parameters under both hypotheses.

Benjamin and Berger, in their discussion in ref. 2, note that the P value can be used to compute an upper bound for the Bayes factor, Interpreting P values (5). The bound does not require the specification of a prior and holds for many reasonable choices of priors. For example, Interpreting P values (6) = 10 means that the alternative hypothesis is at most ten times more likely to be true than the null.

Because it quantifies the extent to which the alternative hypothesis is more likely, the Bayes factor can be used for significance testing. Decision boundaries for the Bayes factor are less prescriptive than those for P values, with descriptors such as “anectodal,” “substantial,” “strong” and “decisive” often used for cutoff values. The exact terms and corresponding values vary across the literature, and their interpretation requires active consideration on the part of the researcher4. A Bayes factor of 20 or more is generally considered to be strong evidence for the alternative hypothesis.

The Benjamin and Berger bound is given by Interpreting P values (7) ≤ −1/(e P ln(P)) for a given P value5 (Fig. 1b). For example, when we reject the null at P < α = 0.05, we do so when the alternative hypothesis is at most Interpreting P values (8) ≤ 2.5 times more likely than the null! This significance boundary is considered by many Bayesians to be extremely weak to nonexistent evidence against the null hypothesis.

For our biomarker example, we found P = 0.025 and thus conclude that the alternative hypothesis that disease affects the biomarker level is at most Interpreting P values (9) ≤ 3.9 times more likely than the null. If we insist on Interpreting P values (10) > 20, which corresponds to 'strong' evidence for the alternative, we need P < 0.0032 (Fig. 1b). Johnson, in a discussion in ref. 2, suggests testing at P < α = 0.005 (Interpreting P values (11) > 14) for statistical significance (Fig. 1b). Notice that the computation for Interpreting P values (12) does not use the power of the test. If we compute power using the same effect size of 1.32 but reject the null at α < 0.005 (Interpreting P values (13) > 14), the power is only 43% (Fig. 1c). To achieve 80% power at this cutoff, we would need a sample size of n = 18.

Altman (this author), in a discussion in ref. 2, proposes to supplement P values with an estimate of the FDR by using plug-in values to account for both the power of the test and the prior evidence in favor of the null hypothesis. In high-throughput multiple-testing problems, the FDR is the expected proportion of the rejected null hypotheses that consists of false rejections. If some proportion π0 of the tests are truly null and we reject at P < α, we expect απ0 of the tests to be false rejections. Given that 1 – π0 of the tests are non-null, then with power β we reject β(1 – π0) of these tests. So, a reasonable estimate of the FDR is the ratio of expected false rejections to all expected rejections, eFDR = απ0/(απ0 + β(1 − π0)).

For low-throughput testing, Altman uses the heuristic that π0 is the probability that the null hypothesis is true as based on prior evidence. She suggests using π0 = 0.5 or 0.75 for the primary hypotheses or secondary hypotheses of a research proposal, respectively, and π0 = 0.95 for hypotheses formulated after exploration of the data (post hoc tests) (Fig. 2a). In the high-throughput scenario, π0 can be estimated from the data, but for low-throughput experiments Altman uses the Bayesian argument that π0 should be based on the prior odds that the investigator would be willing to put on the truth of the null hypothesis. She then replaces a with the observed P value, and β with the planned power of the study.

(a) The relationship between the estimated FDR (eFDR) and the proportion of tests expected to be null, π0, when testing at α = 0.05. Dashed lines indicate Altman's proposals2 for π0. (b) The profile of P values for our biomarker example (n = 10, sp = 1.1). The dashed line at P = 0.05 cuts the curve at the boundaries of the 95% confidence interval (0.17, 2.23), shown as an error bar. (c) P value percentiles (shown by contour lines) and 95% range (gray shading) expected from a two-sample t-test as effect size is increased. At each effect size d, data were simulated from 100,000 normally distributed samples (n = 10 per sample) with means 0 and d, respectively, and σ2 = 1. The fraction of P values smaller than α is the power of the test—for example, 80% (blue contour) are smaller than 0.05 for d = 1.32 (blue dashed line). When d = 0, P values are randomly uniformly distributed.

Full size image

For our example, using P = 0.025 and 80% power gives eFDR = 0.03, 0.09 and 0.38 for primary, secondary and post hoc tests, respectively (Fig. 2a). In other words, for a primary hypothesis in our study, we estimate that only 3% of the tests where we reject the null at this level of P are actually false discoveries, but if we tested only after exploring the data, we would expect 38% of the discoveries to be false.

Altman's 'rule-of-thumb' values for π0 are arbitrary. A simple way to avoid this is to determine the value of π0 required to achieve a given eFDR. For example, to achieve eFDR = 0.05 for our example with 80% power, we require π0 ≤ 0.62, which is fairly strong prior evidence for the alternative hypothesis. For our biomarker example, this might be reasonable if studies in other labs or biological arguments suggest that this biomarker is associated with disease status, but it is unreasonable if multiple models were fitted or if this is the most significant of multiple biomarkers tested with little biological guidance.

Many investigators and journals advocate supplementing P values with confidence intervals, which provide a range of effect sizes compatible with the observations. Mayo, in a discussion in ref. 2, suggests considering the P value for a range of hypotheses. We demonstrate this approach in Figure 2b, which shows the P values of other levels of the biomarker in comparison to one that is observed. The 95% confidence interval, which is (0.17, 2.23) for this example, is the range of levels that are not significantly different at a = 0.05 from the observed level of 1.2.

As a final comment, we stress that P values are random variables—that is, random draws of data will yield a distribution for the P value1. When the data are continuous and the null hypothesis is true, the P value is uniformly distributed on (0,1), with a mean of 0.5 and s.d. of 1/√12 ≈ 0.29 (ref. 1). This means that the P value is very variable from sample to sample, and this variability is not a function of the sample size or the power of the study. When the alternative hypothesis is true, the variability decreases as the power increases, but the P value is still random. We show this in Figure 2c, in which we simulate 100,000 sample pairs for each mean biomarker level.

P values can provide a useful assessment of whether data observed in an experiment are compatible with a null hypothesis. However, the proper use of P values requires that they be properly computed (with appropriate attention to the sampling design), reported only for analyses for which the analysis pipeline was specified ahead of time, and appropriately adjusted for multiple testing when present. Interpretation of P values can be greatly assisted by accompanying heuristics, such as those based on the Bayes factor or the FDR, which translate the P value into a more intuitive quantity. Finally, variability of the P value from different samples points to the need to bring many sources of evidence to the table before drawing scientific conclusions.

Interpreting P values (2024)

FAQs

Interpreting P values? ›

These are as follows: if the P value is 0.05, the null hypothesis has a 5% chance of being true; a nonsignificant P value means that (for example) there is no difference between groups; a statistically significant finding (P is below a predetermined threshold) is clinically important; studies that yield P values on ...

How to interpret p-value results? ›

A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true. The lower the p-value, the greater the statistical significance of the observed difference. A p-value of 0.05 or lower is generally considered statistically significant.

What does 0.999 p-value mean? ›

The value 0.999 represents the “total probability” of getting a result “less than the sample score 78”, with respect to the population. Here, the red point signifies where the sample mean lies with respect to the population distribution.

Is p-value 0.005 significant? ›

If the p-value is under . 01, results are considered statistically significant and if it's below . 005 they are considered highly statistically significant.

Is p-value 0.15 significant? ›

If the p-value is larger than 0.05, we cannot conclude that a significant difference exists. That's pretty straightforward, right? Below 0.05, significant. Over 0.05, not significant.

How do you explain p-value to non-technicians? ›

Academically, the P-value is the probability of obtaining results as extreme as the observed data, assuming that the null hypothesis is correct1.

Is the p-value of 0.03 significant? ›

After analyzing the sample delivery times collected, the p-value of 0.03 is lower than the significance level of 0.05 (assume that we set this before our experiment), and we can say that the result is statistically significant.

What is a too high p-value? ›

Article. The p-value can be perceived as an oracle that judges our results. If the p-value is 0.05 or lower, the result is trumpeted as significant, but if it is higher than 0.05, the result is non-significant and tends to be passed over in silence.

Is the p-value of 0.98 significant? ›

Anyone running a real experiment that obtained a p-value this large would conclude that the drug was ineffective and move on to a different research question. But, this is why we say “fail to reject the null,” folks, because the null hypothesis can be wrong—even with a p-value of 0.98!

Is the p-value of 0.1 good? ›

This leads to the typical guidelines of: p < 0.001 indicating very strong evidence against H0, p < 0.01 strong evidence, p < 0.05 moderate evidence, p < 0.1 weak evidence or a trend, and p ≥ 0.1 indicating insufficient evidence [1], and a strong debate on what this threshold should be.

What is a good p-value range? ›

The threshold value, P < 0.05 is arbitrary. As has been said earlier, it was the practice of Fisher to assign P the value of 0.05 as a measure of evidence against null effect. One can make the “significant test” more stringent by moving to 0.01 (1%) or less stringent moving the borderline to 0.10 (10%).

Is p-value .001 significant? ›

Most authors refer to statistically significant as P < 0.05 and statistically highly significant as P < 0.001 (less than one in a thousand chance of being wrong). The asterisk system avoids the woolly term "significant".

What is the difference between p-value and significance level? ›

The p-value represents the strength of evidence against the null hypothesis, while the significance level represents the level of evidence required to reject the null hypothesis. If the p-value is less than the significance level, the null hypothesis is rejected, and the alternative hypothesis is accepted.

How do you report p-values? ›

For P values less than . 001, report them as P<. 001, instead of the actual exact P value. Expressing P to more than 3 significant digits does not add useful information since precise P values with extreme results are sensitive to biases or departures from the statistical model.

How to interpret p-value in t test? ›

If a p-value reported from a t test is less than 0.05, then that result is said to be statistically significant. If a p-value is greater than 0.05, then the result is insignificant.

Is the p-value of 0.07 significant? ›

At the 0.05 confidence level, all p-values greater than 0.05 are considered to be too random. For example, a p-value of 0.07 indicates that there is a 7% probability that the observed results are accidental, and you can't reject the null hypothesis in this case.

How do you interpret F value and p-value? ›

A big F, with a small p-value, means that the null hypothesis is discredited, and we would assert that there is a general relationship between the response and predictors (while a small F, with a big p-value indicates that there is no relationship).

When to use 0.01 and 0.05 level of significance? ›

How to Find the Level of Significance? If p > 0.05 and p ≤ 0.1, it means that there will be a low assumption for the null hypothesis. If p > 0.01 and p ≤ 0.05, then there must be a strong assumption about the null hypothesis. If p ≤ 0.01, then a very strong assumption about the null hypothesis is indicated.

What does p-value less than 0.001 mean? ›

Most authors refer to statistically significant as P < 0.05 and statistically highly significant as P < 0.001 (less than one in a thousand chance of being wrong).

Is the AP value of 0.8 significant? ›

For example, a P value of 0.0385 means that there is a 3.85% chance that our results could have happened by chance. On the other hand, a large P value of 0.8 (80%) means that our results have an 80% probability of happening by chance. The smaller the P value, the more significant the result.

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Allyn Kozey

Last Updated:

Views: 5271

Rating: 4.2 / 5 (63 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Allyn Kozey

Birthday: 1993-12-21

Address: Suite 454 40343 Larson Union, Port Melia, TX 16164

Phone: +2456904400762

Job: Investor Administrator

Hobby: Sketching, Puzzles, Pet, Mountaineering, Skydiving, Dowsing, Sports

Introduction: My name is Allyn Kozey, I am a outstanding, colorful, adventurous, encouraging, zealous, tender, helpful person who loves writing and wants to share my knowledge and understanding with you.