Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars

In this book, Deborah G. Mayo (who has the rare distinction of making an impact on some of the most influential statisticians of our time) delves into issues in philosophy of statistics, philosophy of science, and scientific methodology more thoroughly than in her previous writings. Her reconstruction of the history of statistics, seamless weaving of the issues in the foundations of statistics with the development of twentieth-century philosophy of science, and clear presentation that makes the content accessible to a non-specialist audience constitute a remarkable achievement. Mayo has a unique philosophical perspective which she uses in her study of philosophy of science and current statistical practice.[1]

I regard this as one of the most important philosophy of science books written in the last 25 years. However, as Mayo herself says, nobody should be immune to critical assessment. This review is written in that spirit; in it I will analyze some of the shortcomings of the book.

* * * * * * * * *

I will begin with three issues on which Mayo focuses:

Conflict about the foundation of statistical inference: Probabilism or Long-run Performance?
Crisis in science: Which method is adequately general/flexible to be applicable to most problems?
Replication crisis: Is scientific research reproducible?

Mayo holds that these issues are connected. Failure to recognize that connection leads to problems in statistical inference.

Probabilism, as Mayo describes it, is about accepting reasoned belief when certainty is not available. Error-statistics is concerned with understanding and controlling the probability of errors. This is a long-run performance criterion. Mayo is concerned with "probativeness" for the analysis of "particular statistical inference" (p. 14). She draws her inspiration concerning probativeness from severe testing and calls those who follow this kind of philosophy the "severe testers" (p. 9). This concept is the central idea of the book.

Severe testers avoid the question in (i) by arguing that the probabilism associated with subjective Bayesianism (Kadane, Schervish, and Seidenfeld 1999) and likelihoodism (Royall 1997) are not necessary, and that long-run performance (Neyman and Pearson 1967) is not sufficient to provide a foundation for statistical inference. According to severe testers, probabilism, or long-run-performance-inspired accounts, make the mistake of evaluating one statistical theory by the perspectives and standards of another. What should be done, according to the severe tester, is to take refuge in a meta-standard and evaluate each theory from that meta-theoretical standpoint. Philosophy will provide that higher ground to evaluate two contending statistical theories. In contrast to the statistical foundations offered by both probabilism and long-run performance accounts, severe testers advocate probativism, which does not recommend any statement to be warranted unless a fair amount of investigation has been carried out to probe ways in which the statement could be wrong.

Severe testers think their method is adequately general to capture this intuitively appealing requirement on any plausible account of evidence. That is, if a test were not able to find flaws with H even if H were incorrect, then a mere agreement of H with data X₀ would provide poor evidence for H. This, according to the severe tester's account, should be a minimal requirement on any account of evidence. This is how they address (ii).

Next consider (iii). According to the severe tester's diagnosis, the replication crisis arises when there is selective reporting: the statistics are cherry-picked for x, i.e., looked at for significance where it is absent, multiple testing, and the like. Severe testers think their account alone can handle the replication crisis satisfactorily. That leaves the burden on them to show that other accounts, such as probabilism and long-run performance, are incapable of handling the crisis, or are inadequate compared to the severe tester's account. One way probabilists (such as subjective Bayesians) seem to block problematic inferences resulting from the replication crisis is by assigning a high subjective prior probability to the null-hypothesis, resulting in a high posterior probability for it. Severe testers grant that this procedure can block problematic inferences leading to the replication crisis. However, they insist that this procedure won't be able to show what researchers have initially done wrong in producing the crisis in the first place. The nub of their criticism is that Bayesians don't provide a convincing resolution of the replication crisis since they don't explain where the researchers make their mistake.

Severe testers like Mayo treat likelihoodists as close relatives of Bayesian, so it isn't surprising that she is equally critical of the likelihoodists' treatment of evidence crucial to resolving the replication crisis. The law of likelihood states that the first hypothesis is less supported than the second one given the data when Pr(D|H1) < Pr(D|H2). The goal of the likelihoodists is to compare the strength of evidence between the two models given the data. According to the likelihoodist, the false sense of the p-value being objective has generated the replication crisis, because the replication crisis rests on adjusting the p-value. For a given statistical model, the p-value represents the probability that the statistical summary would be greater or equal to the observed results when the null hypothesis is true. In contrast, there is no room for such a maneuver within the likelihood account, since it depends on the law of likelihood, in which one hypothesis is being compared with another relative to data. Since, according to the likelihood account, there is no role of the p-value in the likelihood framework, the replication crisis does not arise immediately. However, the severe tester contends that the very idea of the likelihoodist's comparative account of evidence is problematic. Mayo, however, citing George Barnard, argues that for a likelihoodist, "there always is such a rival hypothesis viz., things just had to turn out the way they actually did" (Barnard 1972, 129). It is correct that even in the presence of two competing models, there is a possibility of having a saturated model, where there are as many parameters as there are data. In this sense, Barnard's claim is correct. However, this holds for any statistical paradigm using statistics. So, it is unclear whether Barnard's comment exposes any such shortcoming.

However, it is worth noting that the likelihood ratio, LR (which is a ratio of two likelihoods), is an inadequate measure of evidence when there are estimated parameters. The more parameters that are estimated, the more biased the LR becomes. This is the essence of Hirotogu Akaike's contribution (Akaike 1973, Forster and Sober 1994).

The problem with the long-run performance-based frequency approach, according to Mayo, is that it is easy to support a false hypothesis with these methods by selective reporting. The severe tester thinks both Fisher's and Neyman and Pearson's methods leave the door open for cherry-picking, significance seeking, and multiple-testing, thus generating the possibility of a replication crisis. Fisher's and Neyman-Pearson's methods make room for enabling the support of a preferred claim even though it is not warranted by evidence. This causes severe testers like Mayo to abandon the idea of adopting long-run performance as a sufficient condition for statistical inferences; it is merely a necessary condition for them.

The severe test is central to Mayo's account, where an adequately stringent notion of the test rests on combing both weak and strong severity principles. The weak severity principle provides two key features for error-statistics. First, it informs us that in a severely tested hypothesis, the probability that the test procedure would pass a hypothesis if it were false would be low. Second, it tells us the probability the data will agree with the alternative is very low. However, the positive part of the severity principle concludes that the data would provide good evidence for the hypothesis when the test procedure severely passes the hypothesis with the data in question.

There are at least five interrelated strategies severe testers can implement to prevent any future replication crisis. The first is to control error-probabilities and apply procedures consistent with the philosophy of severe testers. Mayo writes, "erroneous interpretations of data are often called its error probabilities" (p. 9). This strategy's one core idea, as characterized by Birnbaum (1970), is to develop "techniques for systematically appraising and bounding the probabilities . . . of seriously misleading interpretation of data [which helps bound error probabilities] (p.55)." Strictly speaking, the error-statistical tool kits, which buttress the severe tester's account to some extent, also include confidence intervals, randomization, significance testing, and their ilk. The second way to avoid the replication problem is for the statistical test to satisfy the minimal requirement on an account of evidence (stated earlier). The third is to fix the p-value by increasing sample size n, where n must be large. This makes cut-off shorter. Thus, the observed data x will be closer to the null hypothesis than other alternatives. The fourth is that no optional stopping rule (e.g., flip a coin until you land with 20 heads) is allowed, but there is a pre-plan for stopping an experiment before the investigator can see the data. The fifth and final way to avoid the replication problem is to acknowledge that long-run frequency-based performance will only be necessary and not sufficient for statistical inference. These five strategies are central to the severe tester's account encompassing the severity principle. The claim of the severe tester is that the application of these strategies should minimize the chance of another replication crisis.

* * * * * * * * *

The book consists of six Excursions (Parts), sixteen Tours (Chapters) including "three leading museums of statistical sciences" of great eminence. I take those three major accounts to be classical statistics, subjective Bayesianism, and severe testers. Excursion 1, "How to Tell What's True about Statistical Inference," consists of two chapters: "Beyond Probabilism and Performance" and "Error Probing Tools versus Logics of Evidence." In chapter 1, Mayo argues that we need to go beyond the debate about foundations between subjective Bayesians and likelihoodists on the one hand and frequentists on the other, because both statistical paradigms generate shortcomings. Surprisingly, she assumes likelihoodists are not frequentists. Chapter 2 is devoted to discussing the law of likelihood (LL) and why the likelihood principle (LP) needs to be abandoned from the severe tester's perspective. The LP says that the likelihood function captures all the information of evidence affecting a model.

Excursion 2, "Taboos of Induction and Falsification," consists of chapter 3, "Induction and Confirmation" and chapter 4, "Falsification, Pseudoscience, Induction." Excursion 3, "Statistical Tests and Scientific Inference," consists of: chapter 5, "Ingenious and Severe Tests," chapter 6, "It's the Methods, Stupid," and chapter 7, "Capability and Severity: Deeper Concepts." In those chapters, Mayo discusses the 1919 eclipse test and its connection to severe tests, error-probabilities, p-values, and Neyman's performance philosophy.

Excursion 4, "Objectivity and Auditing," consists of: Chapter 8 "The Myth of 'the Myth of Objectivity,'" Chapter 9 "Rejection Fallacies: Who's Exaggerating What?" Chapter 10, "Auditing: Biasing Selection Effects and Randomization," and Chapter 11 "More Auditing: Objectivity and Model Checking." Chapter 12 revisits Box's illustrious slogan "all models are false." Agreeing with Box that models are generally false, as they involve idealizations, the severe tester continues to think we are still able to "find out true things about the world or correctly solve problems" with the help of false models (p. 297). Excursion 5 consists of Chapter 13 "Power: Pre-data and Post-data," and introduces the pre-data/post-data distinction in testing a model and contends that even though severity has some connection with the power of a test, it is different. Chapter 14 goes into much more detail about power and shows how power could be useful in making inferences. Chapter 15 deconstructs the debate between Neyman's performance and Fisher's fiducial probability. Excursion 6 consists of two chapters, 16 and 17, where the severe tester explores how she would be able to retrieve the foundation of statistical inference caused by the division between the subjective Bayesian emphasis on probabilism and the frequentist's spearheading of the performance-driven statistical paradigm. At the end of chapter 17, the book concludes with seven misconceptions about statistical inference, broadly construed, and leaves the reader a homework assignment to investigate how a severe tester should be able to address each.

* * * * * * * * *

A continuing feature of the severe tester's account is that it wants to build up the testing of a global theory by piecemeal testing (1996, p. 190). This book maintains that emphasis. Mayo writes, "our interest in science is often just the particular statistical inference" (p. 14). While contrasting the logical empiricists' tradition of "rational reconstruction," she writes that she is interested in "moving from the bottom up, as it were" (p. 85). However, I argue that this bottom-up account, tied to making an overall inference, falls prey to the charge of violating the probability conjunction rule. The general statement of the rule is that the probability of an entailed proposition (e.g., just S1) cannot have less probability than the entailing proposition (e.g., S1 & S2).

Both Popper and the severe-tester share the idea that theories need to be subjected to severe testing for the growth of objective knowledge. However, when Popper goes bigger to test a global theory, the error-statistician goes smaller and builds up the severe testing of a global theory from the severe testing of its local hypotheses (Mayo 1996, Mayo and Spanos 2010). But, this is not possible (Bandyopadhyay, Brittan, and Taper 2016). In the book under review, the severe tester recognizes the problem stemming from severely testing a global theory, which is baptized as "the drag-down" effect of having more local hypotheses than their mother theory (p. 15). Mayo does not discuss a theory as such, but rather "an overall inference." It is hard to know what the difference would be between making a reliable inference to a global theory and just making an inference. For consistency, I would stick to her current usage, i.e., the overall inference. To address this shortcoming, the severe tester proposes the idea of the "lift-off" of a global theory based on an argument from coincidence (pp. 15-16). Instead of making a reliable overall inference dependent on severe testing of its local hypotheses, thus forcing the test to be probability-based to quantify the uncertainty of an inference, the severe tester insists on a qualitative test of an overall inference, in which different independent tests/data converge to a severity assessment of the overall inference. The severe tester's rejoinder emphasizes that, this way, the overall inference becomes more reliable and precise than its piecemeal parts. Thus, the severity assessment of the overall inference seems to be maintained while escaping the drag-down effect due to the passing of several local hypotheses. Under which condition(s), should a specific meta-theoretical lift-off (e.g., a qualitative convergent agreement of several data pointing to the reliability of an overall inference) be prized over a drag-down effect of several local hypotheses that contribute to the severity assessment of that overall inference? The severe tester does not tell us.

This leads me to discuss the severe tester's rejection of both probabilism and frequency-based long-run performance, especially the latter. It is understandable why Mayo finds fault with probabilists, since they are no friends of Bayesians who take probability theory to be the only logic of uncertainty. So, the position is consistent with the severe tester's account proposed in Mayo's last two influential books (1996 and 2010.) What is surprising is that her account rejects the long-run performance view and only takes the frequency-based probability as necessary for statistical inference. The interpretation of probability, according to severe testers, is not important. But, what is crucial for a severe tester is how probability ought to be used in inferences (p. 13). The startling feature of this claim is that the current book is poised to provide a meta-theoretical higher ground, borrowed from philosophy, where the interpretation of probability should be crucial. Yet, severe testers like her disregard the interpretation of probability to provide a philosophical underpinning of her current approach. Presumably, their justification is that in the practice of science, where the severe tester's account plays key roles, the interpretation of probability is of secondary importance. This response leaves us with a residual problem as the book deliberately overlooks the significance of the interpretation of probability, which is clearly dear to the hearts of both philosophically-minded statisticians and scientifically-minded philosophers. To remind ourselves just how important the probability interpretation is in statistical inference, consider Kyburg and Teng's statement:

the house of statistics is divided according to the interpretation of probability that is adopted. We need to settle on an interpretation of probability in order to settle on a global approach to statistical inference. . . . Given the approach to probability that we settle on, we shall see that each of these approaches to statistics can find its useful place in our inferential armament. (Henry Kyburg, Jr. and Choh Man Teng 2001, p. 196; emphasis added)

Mayo (1996, pp. 121-22) tries to capture intuitions of the improbability of a phenomenon' occurring if the alternative is true and the high probability of that phenomenon' occurring under the correct hypothesis. She insists that it is not the frequency of an occurrence that plays a crucial role in an error-statistical inference. Consider Hacking's 1983 example. We see the same type of structure of a cell or bacterium using two different types of microscopes, an electron microscope and an optical microscope. Hacking's argument is that what we see through two types of microscopes must be real things, not artifacts of the instruments. It would be very unlikely that they are two different types of things when we observe them under two types of microscopes supported by two independent branches of Physics. Mayo wants to capture this intuition of observing the same type of organism/substance under two types of microscopes. In her parlance, we can arrive at such an inference that we very likely see the same thing; otherwise, it would be too improbable that such a phenomenon could occur under this scenario. She argues we have no reason to fall back on how many times such occurrences of relevantly similar types took place in the past. So, knowing their frequency is irrelevant in making a reliable error-statistical inference about a specific type of cell or bacterium. Here her account of severity becomes important. Passing a test severely with such and such an outcome provides good reason for the correctness of that hypothesis. Mayo is leery about applying the frequency of a class to find the probability of a single event. When people, including myself, criticize her for ignoring the frequency-based prior probabilities in making inferences, she responds by saying that we commit the "fallacy of instantiating probabilities."

In her book, the severe tester contends that the long-run frequency-based performance is only necessary. In the above example, if it is irrelevant for a severe tester to know how many times such occurrences of relevantly similar types of events occur, then how could long-run-frequency-based intuition even be necessary? No rationale is provided in this book. If the response to this question is that the severe tester's view on the frequency-based interpretation of probability has undergone a change over the years, then it would be a legitimate response. However, it makes the reader wonder whether we can no longer provide an explanation of why the same microbes are observed via both types of microscopes with an account of severity. Requiring long-run-frequency-based performance to be necessary for any kind of inference no longer licenses our inference that we observe the same microbes via two types of processes if we are required to use her account of severity.

Philosophers tend to distinguish different questions so that one question is not confounded with the other. Two such questions are "the belief question" (what I prefer to call "the confirmation question") and "the evidence question," based on an insight suggested first by Royall. The distinction between them goes like this. Suppose a subject's pap-smear comes out to be positive, where "H" stands for the hypothesis that Mary is suffering from cervical cancer. Given the pap-smear result, one could ask the confirmation question: "What should an agent believe about H and to what degree?" This is contrasted with the evidence question: "What do the data say about evidence for H against its alternative?" These two distinct questions are statistical-paradigm independent, i.e., whether one is a Bayesian or not, one admits of such a distinction. One needs to use a specific probabilistic measure to address each of them separately. According to Royall, Bayesians address the confirmation question, whereas likelihoodists address the evidence question. However, severe testers choose to overlook this key epistemological difference. They instead collapse the distinction and lump together likelihoodists and those who work on confirmation theory under the heading "Probabilism" (p. 82). Bayesian creedal probability and Royallian likelihoodism share some structural similarities, but they are miles apart philosophically. Hence, it is an oversight to put them together in the same basket.

Overlooking varieties of the likelihood account is also tied to this issue of overlooking the distinction where it needs to be made. The severe tester's motivation of lumping together Bayesians and likelihoodists like Royall is that both respect the likelihood principle, whereas error-statisticians reject it. Interestingly, evidentialism, which is a generalized version of the likelihood account, rejects the likelihood principle (Taper and Lele 2011), which the current book neither discusses nor mentions.

I criticized the severe tester's account for four reasons: (i) Not being able to provide a rationale for ignoring the probability conjunction rule, and then trying to skirt the problem by recommending a lift-off with the use of a convergent argument without any meta-theoretical justification that this lift-off is warranted; (ii) not being able to provide a compelling reason for deliberately disregarding the interpretation of probability, when the latter is deeply philosophical, and Mayo wants to provide a philosophical grounding for her arguments; (iii) revising a key argument proposed in an earlier work (1996) concerning severity in the current book in which long-run frequency becomes necessary for statistical inference without offering any good reason for this major shift in position; and finally (iv) not being adequately careful to respect the confirmation-evidence distinction by lumping Bayesians and likelihoodists together under one banner, when each is clearly different from the other.

* * * * * * * * *

Philosophy is a deeply critical discipline in which hundreds of schools of thought contend without always yielding hundreds of flowers. Hopefully, our situation is encouraging, rather than disturbing. Mayo's book has several merits. It brings a great deal of current statistical research to the fore and subjects it to philosophical analysis. It represents a movement of a continued cutting-edge dialogue between philosophers and statisticians coming from different backgrounds, enriching both philosophy and statistics. Clearly, Mayo is at the forefront of this movement. The critical comments I have made in this review indicate the seriousness with which I take her position. They should not be taken to undermine either the compelling nature or the importance of her book. The book both clearly states and defends the severe tester's position she passionately advocates. It has also advanced the debate between several statistical paradigms in a fascinating way. It deserves a wide audience among those interested in the philosophy of statistics and the testing of scientific hypotheses. I congratulate Mayo; her book is a must-read for anyone interested in the ongoing research on the philosophy of statistics.

ACKNOWLEDGEMENTS

I would like to thank Kevin Beiser, Gordon Brittan, Jr., Andrew Gelman, Mark Greenwood, Nolan Grunska, Deborah Mayo, Adam Morton, and Bradley Snow. I am especially indebted to Mark L. Taper for numerous conversations. I am also thankful to the late Gary Gutting who invited me to write this review.

REFERENCES

Akaike, H. (1973), "Information Theory as an Extension of the Maximum Likelihood Principle", in B. N. Petrov and C. Csaki (eds.), Second International Symposium on Information Theory. Budapest: Akademiai Kiado, 267-281.

P.S. Bandyopadhyay, and G. Brittan, Jr. (2006). Acceptance, Evidence, and Severity, Synthese, 148, 259-293.

P.S. Bandyopadhyay, and M. R. Forster (2011). (Eds.) Handbook of Philosophy of Statistics. Elsevier.

P.S. Bandyopadhyay, G. Brittan, Jr., and M. L. Taper (2016). Belief, Evidence, and Uncertainty: Problems of Epistemic Inference. Springer-Verlag.

P.S. Bandyopadhyay. (2019). Bayes Matters: Science, Objectivity, and Inference. (A book-length manuscript.)

D. Cox, and D. Mayo. (2010). "Objectivity and Conditionality in Frequentist Inference." In Mayo and Spanos (2010), 276-304.

A. Gelman, and C. R. Shalizi (2013). "Philosophy and the Practice of Bayesian Statistics." In British Journal of Mathematical and Statistical Psychology, 66. 8-38.

M. Forster, and E. Sober (1994). "How to Tell when Simpler, More Unified, or Less Ad Hoc Theories will Provide More Accurate Predictions." In British Journal for Philosophy of Science. 45, 1-35.

J. Kadane, M. Schervish, and T. Seidenfeld (1999). Rethinking the Foundations of Statistics. Cambridge University Press.

H. Kyburg, and C. M. Teng. (2001). Uncertain Inference. Cambridge University Press.

D. Mayo. (1996). Error and the Growth of Knowledge. University of Chicago.

D. Mayo, and A. Spanos. (2010). Error and Inference. Cambridge University Press.

J. Neyman, and E. Pearson (1967). Joint Statistical Papers of J. Neyman and E. S. Pearson. University of California Press.

R. Royall. (1997). Statistical Evidence. Chapman and Hall, New York.

M. L. Taper, and S. Lele. (2011). "Evidence, Evidence Functions, and Error Probabilities." In Bandyopadhyay and Forster eds. (2011), 513-532. Elsevier.

[1] I owe this point to Andrew Gelman in an email communication.