Statistical Modeling: The Two Cultures
by Leo Breiman, 2001
Quick summary and comments
August 6, 2021
This paper, appearing in the journal Statistical Science, is another one aimed at statisticians, and ringing an alarm bell that statistics isn’t meeting the challenges arising from modern data. “Modern” at this point is 2001– well into the computer era of data collection and analysis. It’s a particularly interesting sequel to Deleeuw’s 1994 paper as it also addresses the role of (statistical) models.
The abstract tells much of the story:
There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This committment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.
The paper starts with a simple diagram:
The vector x are the independent variables, and y is a vector of dependent (or response) variables. Nature is a black box that creates an association between the two. Nothing controversial about this so far.
There are two goals in analyzing the data: Prediction: To be able to predict what the responses are going to be to future input variables; Information: To extract some information about how nature is associating response variables to the input variables.
The Data Modeling Culture (98% of all statisticians) assumes a stochastic data model (i.e., a probability model) for what goes on inside the black box. Model validation is a question of goodness-of-fit– a hypothesis testing problem.
The Algorithmic Modeling Culture (2% of statisticians plus lots of people in other fields) takes the optimal recovery tack: find a function that operates on x to predict y. Model validation is all about predictive accuracy.
Breiman goes on to recount some personal experiences as a consultant and the culture shock upon returning to academia:
I am deeply troubled by the current and past use of data models in applications, where quantitative conclusions are drawn and perhaps policy decisions made.
In Section 5, The Use of Data Models:
Statisticians in applied research consider data modeling as the template for statistical analysis: Faced with an applied problem, think of a data model. This enterprise has at its heart the belief that a statistician, by imagination and by looking at the data, can invent a reasonably good parametric class of models for a complex mechanism devised by nature. Then parameters are estimated and conclusions are drawn. But when a model is fit to data to draw quantitative conclusions:
- The conclusions are about the model’s mechanism, and not about nature’s mechanism.
- If the model is a poor emulation of nature the conclusions may be wrong.
Ok. I take issue with a couple of things here. Why is it so bad that a statistician should invent a model after examining the data? Lots of scientists formulate models based on data. Is there something particularly offensive about formulating a probability model? The only potentially offensive thing I see here is the idea that the probability model is parametric. What about non-parametric models? Are they offensive too? Also, the “conclusions are about the model’s mechanism and not about nature’s” and “if the model is a poor emulation of nature, the conclusions may be wrong” is true of any model, not just probabilistic ones. Does that mean climate scientists shouldn’t use climate models?
Perhaps it’s unfair of me to come along 20 years after the publication of Breiman’s paper, after these ideas have become more mainstream, and complain that much of this is obvious or presented in an oversimplified way that encourages outrage on the part of like-minded people. It’s like complaining that old episodes of I Love Lucy are trite because subsequent sitcoms modeled themselves on I Love Lucy. I Love Lucy isn’t trite: it’s the original.
This section finishes with more discussion of problems with data modeling in practice. It is followed by a section titled The Limitations of Data Models.
Section 7 is about algorithmic modeling. There is a nice, if brief, history in Section 7.1. In Section 7.2, Breiman discusses Theory in Algorithmic Modeling:
The approach is that nature produces data in a black box whose insides are complex, mysterious, and, at least partly unknowable. What is observed is a set of x’s that go in and a subsequent set of y’s that come out. The problem is to find an algorithm f(x) such that for future x in a test set, f(x) will be a good predictor of y,
The theory in this field shifts focus from data models to the properties of algorithms. It characterizes their “strength” as predictors, convergence if they are iterative, and what gives them good predictive accuracy. The one assumption made in the theory is that the data is drawn i.i.d. from an unknown multivariate distribution.
The emphasis on the last sentence is mine because to me this seem huge, and yet no further remark is made about it. This is a GIGANTIC assumption that can only be made if one doesn’t care at all about what Breiman earlier called “information”. In other words, if all you care about is predictability (and if relationship aren’t changing between the time you build your algorithm and the time you apply it), then you can get away with this i.i.d assumption and it frees you from the pesky problem of having to understand what your model is actually telling you.
In the next section he discusses the problem in terms of 1) the multiplicity of “good” models, 2) the conflict between simplicity and (predictive) accuracy, and 3) the curse or blessing of dimensionality. He discusses ensemble predictors as a way of balancing some of these opposing forces.
In Section 11 Breiman closes the loop:
The dilemma posed in the last section is that the models that best emulate nature in terms of predictive accuracy are also the most complex and inscrutable. But this dilemma can be resolved by realizing that the wrong question is being asked. … Framing the question as the choice between accuracy and interpretability is an incorrect interpretation of what the goal of statistical analysis is. The point of the model is to get useful information about the relation between the response and predictor variables. Interpretability is a way of getting information. But a model does not have to be simple to provide reliable information about the relation between predictor and response variables; neither does it have to be a data model.
The goal is not interpretability, but accurate information. (Emphasis is mine, again.)
The remainder of this penultimate section closes with examples showing that Random Forests (a true success story) is an algorithmic model capable of providing much useful information and good predictions as well. Point taken.
My take away from this as it applies to CLIVAR and climate science: I am unconvinced that the algorithmic culture/approach is better in any general or specific sense. But there is a caveat that I will explain in the next paragraph. Before I go there though I will say that, in my view, the problem with Breiman’s argument is the squishiness of the word “information”. He never says what this actually means. The closest he comes is to say something about “how nature is associating the response variables to the input variables”. What does “associating” mean? How do you measure that? Covariance and correlation come to mind. So does mutual information. But these are all properties of a joint distribution– a data model. I do not think prediction error alone is sufficient to provide understanding in the sense we seek in climate science. We are interested in discovering physical mechanisms (or aspects thereof) from data. In the end if that’s quantified through a complex predictive function, that’s a model too, and just because it predicts well under i.i.d. assumptions doesn’t mean it reflects reality because i.i.d. is a fiction in the physical world.
Now for the caveat. I think predictablity and probability model fit are related. Surprise. They are different aspect of the same thing. I’ve been hanging around with Houman Owhadi at Caltech. In various papers he has shown that this enterprise can be viewed from at least four different perspectives including Gaussian Processes (a.k.a. data model) and optimal recovery (a.k.a. algorithmic model). There is deep mathematics connecting them that I don’t pretend to understand. But, in my opinion, it does seem to resolve the dispute in a useful way by focusing attention not on which one is better, but on how different interpretations provide different insights.