Reliable Reasoning: Induction and Statistical Learning Theory

Reviewed by Kevin Kelly, and Conor Mayo-Wilson, Carnegie Mellon University

Harman and Kulkarni's Reliable Reasoning is a welcome attempt to relate machine learning to the philosophy of induction at an introductory level suitable for undergraduates or for professional philosophers and scientists who desire a painless introduction to the subject.  In clear, helpful figures and engagingly informal prose, the slender volume summarizes the main results and concepts of statistical learning theory, a particular statistical framework developed by V. N. Vapnik, A. J. Chervonenkis and others for machine learning and other applications (Vapnik and Chervonenkis 1974, Vapnik 1998, 1999, 2000).  Harman and Kulkarni also explain how statistical learning theory might shed light on philosophical topics like the problem of induction and the role of simplicity in theory choice.  The book is mainly an informal précis of Vapnik (2000), with some important differences.

The introduction begins with the philosophical problem of induction, characterized as the problem of determining the sense, if any, in which inductive inference is reliable.  It ends with a useful summary of coming attractions.  In between is a discussion of network models of "reflective equilibrium".  We would have preferred to see a more systematic comparison of the Bayesian and classical approaches to model selection, since the authors adopt the classical viewpoint without argument.

Chapter two narrows the focus from inductive inference in general to a very particular sort of statistical problem called pattern recognition.  Think of the problem of getting a computer to determine whether an arbitrary scrawl is intended as the letter Q (e.g., for automated reading of written addresses at the post office).  Each scanned, written character amounts to a pattern of gray-scale pixels on a square, two-dimensional, pressure-sensitive array.  Enumerating the pixels in the square array in some arbitrary order, one can view each written character as a finite sequence specifying the grey scale value (from 0 to 1) of each successive pixel, such as (.5, .7, … , .3).  There is a further fact of the matter whether the scrawl was produced with the intention to write Q, so the relevant facts are summarized by the sequence (Q, .5, .7, … , .3) or by the sequence (not-Q, .5, .7, … , .3).  For brevity, call such sequences instances.  A classification rule sees only the pixel tones (.5, .7, … , .3) and returns "yes" or "no" depending on whether it views the given pixels as an attempt to produce Q.  A mis-classification implies a cost or loss, which for now can be assumed to be zero for a correct guess and one for an incorrect guess.[1]

Statistical learning theory assumes that there is an unknown probability distribution p over instances so that, for example, p(Q, .5, .7, … , .3) denotes the chance that an attempt at producing Q resulted in pixel values (.5., .7, … , .3).  It is assumed that, furthermore, p does not change through time or depend on the outcomes of earlier trials.  This assumption is summarized by saying that letter-generation trials are independent and identically distributed (IID for short).  Each classification rule has a well-defined but unknown expected loss (relative to p) called the risk of the rule.  Accuracy or reliability of a classification rule is understood in terms of risk and the aim of statistical learning theory is to infer classification rules that minimize risk.  That is not the same as finding true rules or theories, an important point discussed in our closing comments.

The authors next turn their attention to the problem of learning classification rules with low risk.  To a learner of a pattern classification rule, the underlying probability distribution p and, hence, the rule of least risk are unknown.  Attempts at letter production are randomly "sampled" (according to p) with the intended letter truthfully specified by a teacher.  Using the sample, the learner can estimate the risk of a given rule by calculating the average number of sampled characters the rule mis-classifies.  Then, given a collection C of possible rules to choose among, the learner can choose a rule in C that minimizes this estimated or empirical risk.  The standard terminology for this rule-learning technique is empirical risk minimization (ERM); Harman and Kulkarni call it "enumerative induction" or "inductive learning".

The next topic is VC dimension, a pivotal concept in statistical learning theory.  Say that a set C of possible classification rules shatters a given sample just in case for each subset S of the sample there is a rule c in C such that S is the set of all items in the sample for which c returns "yes".  The VC dimension of a set C of classification rules is defined as the size of the largest sample shattered by C.  Classes with larger VC dimension "accommodate" more possible samples, suggesting some connection between VC dimension and simplicity, or explanation.  A major result of statistical learning theory is the following, two-part theorem.  First, if the VC dimension of C is finite, then it is possible to calculate, as a function of VC dimension, a sample size at which it is as probable as desired that ERM returns a rule whose true risk is as close as desired to the risk of the best rule in C.  Second, if the VC dimension of C is infinite, then such sample size bounds do not exist.  The chapter closes with a comparison of VC dimension with Popper's notion of "degrees of falsifiability", a topic discussed also by Vapnik (2000, p. 47).

At the outset of the third and main chapter, Harman and Kulkarni acknowledge the importance of the above theorem, but they also emphasize a limitation.  Say that a Bayes rule is a classification rule of least risk relative to the true distribution p.  In the absence of prior knowledge about p, the only way to guarantee a priori that collection C contains rules arbitrarily close in risk to the risk of a Bayes rule is to allow C to have infinite VC dimension.  In that case, the second part of the above theorem applies, so ERM can no longer be guaranteed a priori to achieve a good approximation to a Bayes rule with high chance, no matter how large the sample.  When C has infinite VC dimension (e.g., polynomial classification rules), the authors prefer a version of Vapnik's structural risk minimization (SRM) strategy rather than ERM.  The SRM method splits C into nested subsets C0, C1, … , Cn, … of finite, but ascending VC dimension (e.g., Ci consists of rules of polynomial degree no greater than i).  Given a sample and an arbitrary rule c in C, one can, in principle, find the empirical risk of c and the first class Ci containing c.  The authors define the SRM score of c to be an arbitrary increasing function of i and of empirical risk.  Then one chooses, for predictive purposes, a rule c in C whose SRM score is least, in light of the sample (p. 61).

The relevant reliability property of SRM, according to Harman and Kulkarni, is universal consistency, where "consistency" is intended in the statistical sense of convergence to the truth in probability rather than in the philosophical sense of logical compatibility.  Loosely speaking, a method is universally consistent if the following condition holds:

As more and more data are obtained, the expected error of rules selected by the method approaches, in the limit, the expected error of the best rule, the Bayes rule (p. 58).

Harman and Kulkarni usefully distinguish universal consistency from a stronger property called uniform consistency.  A method is uniformly consistent if one can calculate a priori how large a sample must be for the method to produce a rule whose risk is as close to that of a Bayes rule with a specified chance.  The authors observe that:

[u]niversal consistency does not imply uniform [consistency].  There may be no bound on the amount of data needed in order to ensure that (with probability approaching 1) the expected error of the rules endorsed by the method will be within epsilon of the expected error of the Bayes rule (p. 58).

Nonetheless:

Universal consistency is clearly a desirable characteristic of a method … Although this does not guarantee a rate of convergence, it can be shown that no method provides such a guarantee (p. 58).

For brevity, we will henceforth call a method (merely) consistent if it is universally consistent in the above sense.

The main philosophical applications of the book occur at the end of chapter three.

First, the authors interpret Goodman's notion of "projectability" (1983) in terms of both ERM and SRM, the relevant point being whether projectability is a restriction on possible hypotheses (ERM case) or a bias based on VC dimension, simplicity, or some other notion (SRM case).  Second, they observe that VC dimension is distinct from Popper's degrees of falsifiability and argue that VC dimension better accords with scientific method.  The book's most ambitious application is an attempted refutation of "demonic skepticism", according to which SRM is claimed to provide a "reason to think" that we are not brains in vats (p. 74).  We discuss the argument below.

In the fourth and final chapter, the authors sketch, very briefly, neural networks, support vector machines (SVMs), and transduction.  SVMs can be understood, roughly, as separating positive and negative examples with a thick "hyper-slab" rather than a thin hyper-plane in order to reduce the VC dimension in high-dimensional classification problems.  A problem in which positive and negative instances have no separating hyper-plane can sometimes be transformed into a higher-dimensional problem in which they do.  That increases the VC dimension of the set of separating hyper-planes, but the VC dimension can sometimes be reduced once again by adding thickness to the hyper-slabs.

We close with a couple of concerns.  Our first concern is that the authors do not present the standard formulation of SRM theory, limiting its usefulness as an informal introduction to the subject.  Recall that the sole reliability property attributed to SRM by the authors is (mere) consistency, which is a large sample convergence property.  On the contrary, Vapnik views SRM primarily as a small sample method (2000, p. 94).  Furthermore, although the authors portray SRM as a technique to employ when VC dimension is infinite, Vapnik recommends using SRM even if the VC dimension of C is finite, if the available sample size is too small to guarantee a tight bound on expected loss (ibid.).  The core of Vapnik's theory is the following theorem (Vapnik 2000, p. 93), which the authors do not mention.  Let h be the VC dimension of rule collection C, let n be a fixed sample size and let r be arbitrarily close to 1.  Then:

assuming IID sampling and bounded, non-negative loss, there is a function f(n, h, r), such that for each rule c in C and for each probability distribution p, the chance is at least r that the true risk of c in p is no greater than:

f(n, h, r) + the empirical risk of c at sample size n in p.

Like the authors, Vapnik defines SRM to start out with a (nearly) arbitrary choice of nested rule sets C0, C1, … , Cn, … of finite VC dimension.  But according to Vapnik, SRM then locates, on a given sample, the collection Ci for which the preceding sum is minimal (2000, p. 95).  Then SRM derives its prediction from some rule in Ci with minimal empirical risk.  Thus, by definition:

[t]he SRM principle takes both factors [empirical risk and VC dimension] into account, by choosing the subset [Ci] … for which minimizing the empirical risk  yields the best bound on actual risk (Vapnik 2000, p. 95, our emphasis).

That, rather than mere consistency, is the (short-run) reliability property of SRM intended by Vapnik (1998, 1999, 2000).  In order to control risk in the short run, SRM balances estimated risk or fit against VC dimension or capacity for accommodating arbitrary samples.  That risk minimization involves a balance between fit and capacity is a

familiar theme in statistical estimation theory and SRM may be viewed as a generalization of risk minimization methods such as the Akaike Information Criterion (AIC) (Vapnik 2000, p. 116).[2]

Vapnik also proves a convergence theorem for a variant of SRM in which the class Ci is chosen as an a priori function of sample size, rather than in light of the sample, itself (Vapnik 2000, p. 97).  The result, however, concerns asymptotic convergence rate rather than mere convergence, as the authors report.  Thus, Reliable Reasoning presents neither of the reliability properties of SRM emphasized by Vapnik, himself, both of which are stronger than mere consistency.

Our second concern is that the authors extend SRM methodology from the task of predicting the next datum to that of inferring true rules or theories, which SRM was never intended to do.  Indeed, given knowledge of the class Ci to which the true hypothesis belongs, the logic of risk minimization typically urges selection of a false hypothesis from an earlier class with lower VC dimension:

[e]ven if one knows the actual degree of the … polynomial, one often has to choose a smaller degree for the approximation, depending on the available number of observations (Vapnik 2000, p. 116).

That is hardly a shortcoming in a theory of accurate prediction, which is what Vapnik explicitly intended to provide.  But it is a serious shortcoming when the aim is to explain or to justify inferences to true theories.  The mismatch between available means and desired ends is most apparent when Harman and Kulkarni use SRM theory to argue that:

in fact we may have reason to think that we are not [brains-in-vats] (p. 74).

The argument is not that SRM predicts more reliably using scientific hypotheses than brain-in-a-vat hypotheses, for the scientific hypotheses are predictively identical to the brain-in-a-vat rules that simulate them (as the authors, themselves, recognize).  Nor does the authors' version of SRM reliably infer that we are not, in fact, brains-in-vats, since it dogmatically concludes that we are not brains-in-vats no matter what is observed.  In fact, the argument hinges upon extraneous (and implausible) side-constraints on SRM that have no basis in its reliability either as a predictor or as a technique for inferring true theories.  For example, the authors assume that all brain-in-vat hypotheses have exactly the same simplicity even though the scientific hypotheses they simulate have infinitely various degrees of simplicity.

It is hardly a compelling objection to SRM theory that it fails to refute demonic skepticism.  A more practical concern is that SRM is not even a reliable predictor when the predictions concern counterfactual outcomes of potential acts or policies.  SRM is capable of inferring an accurate regression rule for extrapolating lung cancer incidence from nicotine-stained clothes, but that rule would not accurately predict the incidence of lung cancer after the implementation of a government laundry policy.  That is because risk minimization is defined with respect to an underlying sampling distribution and enactment of a policy can alter that distribution, undermining Vapnik's bound on predictive risk with respect to the unaltered distribution.

A more promising approach for dealing with counterfactual predictions is to infer causal theories from statistical data (Spirtes et al. 2000, Pearl 2000) and to use these theories to infer correct predictions (e.g., that tooth polish does not cure cancer).  The proposed methods for causal inference all incorporate some bias toward causal theories with fewer causes, which is reminiscent of Ockham's razor or inference to the best explanation.  In some standard applications there exist consistent methods for causal discovery, but no uniformly consistent such methods (Robins et al. 1999).  However, consistency, alone, does not imply anything at all about short-run inferences, since alternative methods drawing alternative conclusions in the short run would also converge to the truth in the limit (Salmon 1967).  It remains, therefore, to explain how a fixed bias toward simplicity could be more reliable than alternative biases, even though the causal truth might be quite complex.

One such approach, not mentioned by the authors, is based upon efficient or straightest possible convergence to the truth, where straightness is a matter of minimizing the number of times the method retracts earlier conclusions as well as the times (sample sizes) at which these retractions occur, an extension of ideas proposed by Hilary Putnam (1956).  It can be shown that, given that convergence to the truth is feasible at all, the efficiently reliable causal discovery strategies are exactly the strategies that follow a version of Ockham's razor at each stage of inquiry (Kelly 2007).  That does not rule out brains-in-vats (no reliability argument could), but it does explain, at least, how Ockham's razor leads one to true causal predictions better than competing strategies, as Ockham's razor is necessary rather than merely sufficient for efficient convergence when theoretical differences imply some (arbitrarily small) observational difference.  This new and more promising approach illustrates that SRM theory is not the only way to implement the authors' reliabilist program (p. 2) and, indeed, is not the best way to do so, if reliable inference to true theories is to be explained (e.g., inference to the best explanation).

References

Forster, M. and E. Sober.  "How to Tell When Simpler, More Unified, or Less Ad Hoc Theories will Provide More Accurate Predictions," The British Journal for the Philosophy of Science 45.  1994: 1-35.

Goodman, N.  Fact, Fiction, and Forecast, fourth edition.  Cambridge: Harvard University Press, 1983.

Harman, G.  "The Inference to the Best Explanation," Philosophical Review 74.  1965: 88-95.

Kelly, K.  The Logic of Reliable Inquiry.  Oxford: Oxford University Press, 1996.

Kelly, K.  "Ockham's Razor, Empirical Complexity, and Truth-Finding Efficiency," Theoretical Computer Science 383.  2007: 270-289.

Pearl, J.  Causality: Models, Reasoning, and Inference.  Cambridge: Cambridge University Press, 2000.

Putnam, H.  "Trial and Error Predicates and a Solution to a Problem of Mostowski," Journal of Symbolic Logic 30.  1965: 49-57.

Robins, J., R. Scheines, P. Spirtes, and L. Wasserman.  "Uniform Consistency in Causal Inference," Biometrika 90.  1999: 491-515.

Salmon, W.  The Logic of Scientific Inference.  Pittsburgh: University of Pittsburgh Press, 1967.

Spirtes, P., C. Glymour, and R. Scheines.  Causation, Prediction, and Search.  MIT Press, 1993.

Vapnik, V.  The Nature of Statistical Learning Theory.  New York: Springer, 2000.

Vapnik, V.  Statistical Learning Theory.  New York: Wiley, 1998.

Vapnik, V.  "An Overview of Statistical Learning Theory," IEEE Transactions on Neural Networks 10.  1999: 988-999.

Vapnik, V. and A. Chervonenkis.  Theory of Pattern Recognition (in Russian).  Moscow: Nauka, 1974.

[1] The authors also explain how to extend the theory to continuous loss functions.  A more precise discussion may be found in Vapnik (2000).

[2] Forster and Sober (1994) have already invoked risk minimization (and AIC in particular) as a potential explanation of Ockham's razor in statistical estimation problems, so it is regrettable that the connection is not drawn in Reliable Reasoning.