By: Khalil Zlaoui

# ...

Scientific studies often begin with a hypothesis about the way the world works. The hope is that by analyzing data we can find empirical evidence to uncover the truth, by either supporting or refuting our original hypothesis.

For decades, the scientific community has relied on the *p-value *as the sole indicator of that truth. But a significant *p-value *is not proof of strong evidence, and in fact, it was never intended to be used as such. Its misuse and misinterpretation has led to serious problems, including an interdisciplinary replication crisis that we describe in a previous Said&Dunn post .

Benjamin et al. (2017) recently made a proposal to change the default *p-value *threshold (from 0.05 to 0.005) for claims of new discoveries, a proposal met with some controversy. Some journals, like *Basic and Applied Social Psychology*, have also gone so far as to ban the use of *p-values*.

Is tightening the *p-value *threshold a good idea? Should more journals ban *p-values *altogether? After wading into the waters around the promises and pitfalls of *p-values*, here are four main conclusions we have drawn.

**1) ***P-values ***should be interpreted with caution**

*P-values*

In 2016, the American Statistical Association (ASA) argued that while the *p-value *is a useful measure, it has been misused and misinterpreted in weighing evidence. The ASA board therefore released six principles to guard against common misconceptions of *p-values*:

As models are constructed under a set of assumptions, a small*P**-values*can indicate how incompatible the data are with a specified statistical model.*p-value*indicates a model that is incompatible with the null hypothesis, as long as these assumptions hold.

- A common misuse of
*p-values*is that they are often turned into statements about the truth of the null hypothesis.They also do not indicate the probability that data were produced by random chance alone.*P**-values*do not measure the probability that the studied hypothesis is true.

**Scientific conclusions and business or policy decisions should not be based only on whether a**Conclusions based solely on*p-value*passes a specific threshold.*p-values*can pose a threat to public-health. In addition to model design and estimation, factors to be considered in decision-making include study design and measurement quality.

**Proper inference requires full reporting and transparency.**Conducting several tests of association in order to identify a significant*p-value*leads to spurious results.

**A**A smaller*p**-value*, or statistical significance, does not measure the size of an effect or the importance of a result.*p-value*is not an indicator for a larger effect.

**By itself, a**. A*p**-value*does not provide a good measure of evidence regarding a model or a hypothesis*p-value*near 0.05 is only weak evidence against the null.

**2) ****There are alternatives to ***p-values*

*p-values*

An interesting alternative to p-values is using a Bayesian approach, a method of statistical inference that includes a subjective “prior” belief about the hypothesis, based on Bayes theorem.

From Bayes theorem: *POST*_{H1}=*PRIOR*_{H1} X *BF*, where *POST* _{H1} is the posterior odds in favor of H_{1} (**the alternative hypothesis**) and *BF* = sampling density of data under H_{1} is the Bayes factor.
sampling density of data under H_{0}

Interestingly, Bayes Factors can be equated with *p-values*. The correspondence between *p-values *in the frequentist world (meaning the statistical inference framework that is most commonly used) and Bayes Factors in the Bayesian world can reshape the debate about *p-values *and help reconsider how strongly they can support evidence to reject the null.

Under some reasonable assumptions, a *p-value *of 0.05 in the frequentist world corresponds to Bayes Factors in favor of the alternative hypothesis ranging from 2.5 to 3.4. In the Bayesian world, this is considered as weak evidence against the null. Based on the correspondence between p-values and Bayes Factors, Benjamin et al. (2017) proposed to redefine statistical significance at 0.005.

**3) ****Tightening the ***p-value *threshold doesn’t fully solve the replication crisis and could lead to other problems

*p-value*threshold doesn’t fully solve the replication crisis and could lead to other problems

A two-sided 0.005 *p-value *corresponds to Bayes Factors in favor of the alternative ranging from 14 to 26, which in Bayesian considerations corresponds to substantial to strong evidence. Benjamin et al.’s proposal to change the p-value threshold was made to help address the replication crisis, where too few studies were able to replicate the findings of the original study. But moving to this more stringent threshold comes at a price – the need for larger samples and possibly unacceptable false negative rates. Is such a trade-off worth it?

It’s important to keep in mind that *p-values *are not the only cause for the lack of reproducibility in science. While *p-values *might be an important contributor, there are other real issues affecting replication, including: selection effects, trends towards multiple testing, hunting for significance or p-hacking, violated statistical assumptions, and so on. Tightening the p-value from 0.05 to 0.005 will not necessarily address these issues.

**4) But there are alternatives we should all be using**

Some statisticians have argued in favor of estimation (putting emphasis on the parameter to estimate an effect) over testing (putting emphasis on rejecting or accepting a hypothesis based on *p-values*). If the interest lies in testing an effect, researchers could instead rely on confidence, credibility, or prediction intervals.

So in the end, we think that rather than ditch the p-value altogether, we should shift our focus from *p-values *to study design, effect size and confidence intervals, which we hope can help us better understand the evidence to support our hypotheses and ultimately uncover the truth about the way the world works.