On the scientific method and its application to the science of deep learning

I have recently found myself involved in many discussions centered on the question of how we might develop a satisfactory science of modern machine learning. It certainly seems likely that such a science should be possible, as every important human feat of engineering has eventually admitted an explanatory science. Given such a science, engineering artifacts are revealed as points in a space of viable constructions: with an understanding of mechanics and materials science, we see the Taj Mahal and the Colosseum are two points in the space of stable freestanding structures made of stone, and we can describe the boundaries of this set. It is of course one of the great scientific questions of our time whether current foundation models and their inevitable successors may be located within such a scientific framework.

While there is widespread support for work developing the “science of deep learning,” there is little consensus as to what this science will look like or what constitutes meaningful progress. Much of this disarray is of course inevitable and healthy: deep learning is complex, complementary approaches will be necessary, and we do not know what, exactly, we are looking for. Much of the confusion, however, is evitable and unhelpful: even when searching for an unknown object, it helps to search methodically. This essay is a discussion of the method of search.

It appears to me that the present disorder lies mostly downstream from confusion about some basic questions: what is science, and how do you do it? Actually, we don’t care about science merely because it is Science, but rather because it is a technique we may use, so what we are really asking is: how do you make useful sense of something mysterious in the world, and when the mystery is great, how do you go about making progress anyways? While these questions are basic, they are by no means easy. I would like to share some thoughts on these questions informed by my experience and my own process of trial and error.

Many great thinkers of the last century have offered insightful discussions of the scientific mindset and process, and I strongly recommend sitting down with Popper’s notion of falsifiability, Kuhn’s depiction of the scientific process, Feynman’s joyful empiricism, and Yudkowsky’s techniques for clear thinking. All have greatly shaped my own views, and I have little to say about the general process of science that one of them has not already said better. I would like to contribute just one idea concerning the so-called scientific method.

We learn in grade school that the process of science follows a defined sequence of steps: observation, hypothesis, experiment (with at least three trials), analysis of data, and acceptance or rejection of the hypothesis. However, any practicing scientist knows that this is not really how science works. The anatomy of a scientific project usually bears little resemblance to this tidy storybook picture.¹ Useful projects take such a great diversity of forms that it can seem like anything at all goes. One might then fairly wonder: are there, after all, any truly essential steps, or is one approach as good as another? This is an important question for any field that hopes to make material progress towards understanding a great mystery.

It seems to me that, yes, there are two essential steps to the scientific method. Step A is to figure something out. Step B is to check and make sure you’re not wrong. You can do these steps in any order, but you have to do them both.

Figure 1: The scientific method.

All the good science I know does both of these steps. If you only do the first step — figuring something out but not adequately checking that you’re right — you’re doing philosophy, or theology, or another speculative practice. Such speculation can be beautiful and useful, but at the end of the day, it can rarely be built upon. If you only do the second thing — performing an empirical check but not figuring anything out — you’re usually doing engineering.²

There are no rules whatsoever as to how you do the first step. You are allowed to figure things out via educated guess, long experience, mathematical derivation, meditation, or divine inspiration. This is often where the ingenuity of the theorist enters into play. There is only one rule with the second step: you have to do a good job checking you’re not wrong, ideally good enough to convince other people. This, too, can be done in many different ways — direct experiment, elimination of alternatives, checking new predictions — but it is absolutely essential that it is done adequately. It can be very difficult to do this well, or even to figure out how to do it! This is the nerve-wracking step where one makes sure that one isn’t fooling oneself. This is usually where the ingenuity of the experimentalist comes into play.³

Most of the poor science of which I am aware fails in one of these two steps. It is, of course, very common for a study to fail to adequately demonstrate its central claims: this is understandable, as true, interesting facts are quite difficult to find! On the other side, uninteresting, hypertechnical experiments (which have unfortunately become the norm in certain crowded areas of physics) perform their empirical checks just fine, but when the lab equipment is back in its boxes, it is unclear what general fact has been figured out. (Yep, our Rube Goldberg machine of diffraction gratings, modulators, cavities, and heavy atoms works as predicted! Anyone remember what it was all for?) I also see studies that claim to figure something out about a system of interest (e.g., deep learning), but which fail to state a specific enough conjecture to go and test it, which is a partial failure in both steps. (More on that later.)

It is worth clarifying that a scientist does not have to do both steps of the scientific method in every scientific paper. An individual contribution may be entirely the proposal of a new idea, as in Darwin’s book or most of the seminal theoretical physics papers of the early 1900s, or it may consist entirely of measurements, as in Tycho Brahe’s meticulous observations of the planets or careful modern measurements of fundamental physical constants. It is quite legitimate for a contribution to disprove an existing belief without offering a replacement. The important thing is that a scientist doing one or the other recognizes that they are part of a conversation with the other part: the person proposing the ideas expects other people to go and test them and so tries to make it easy for them to do so, and the people checking the ideas know what the implications are if the experiment comes out one way or the other. Every scientific contribution should understand itself as part of a project that does do both steps of the scientific method.

Why are both of these steps necessary for the progress of science? If a line of research does not purport to have figured out any general fact, it is unclear what has been learned or how to build on it. On the other hand, if it does not adequately check its claims, then until it does, it will carry a shadow of fundamental doubt that prevents others from productively building on it. Furthermore, and often more significantly in practice, the act of checking usually contains within it the act of application, as the most convincing way to check a claim is often to operationalize it to do something of interest. Science is an edifice that builds on itself. It usually consists of so many layers that each piece of brickwork must be quite solid to support future building, and each brick must be crafted with future bricks in mind.

If you really believe me that the essential scientific method has only two steps, you might ask: what’s all this other mumbo jumbo about preregistered hypotheses, repeated trials, cognitive biases and what not? There are a few things going on. First off, while these two steps are simple, actually doing them is hard, so we have some established techniques that sometimes make them easier. Some of these help with the figuring out, but that’s mostly a dark art.⁴ Most of these are techniques for the checking of our knowledge: preregistered hypotheses, multiple trials, the inclusion of error bars, blind review, and most of the other rituals of science are ways to do step B. None is essential, but they are useful techniques for checking our answer and avoiding fooling ourselves.

Secondly, some of these techniques are field-specific. For example, in psychology, social science, and nutrition, it’s notoriously tempting to choose one’s hypothesis after seeing the data. In these fields, the space of hypotheses is usually large and the resolving power of evidence is usually weak, so choosing the most-supported claim post-hoc often amounts to p-hacking, and preregistering hypotheses makes it more likely to not be totally wrong. In (non-string-theory) physics, the situation is usually the opposite — hypotheses are few and evidence is abundant — so preregistration isn’t necessary.

These steps apply even in a nascent field that knows very little. While no method is sufficient to guarantee useful progress in the search, failure to do both steps all but guarantees no progress will be made. By way of analogy, when searching a large unfamiliar house for a desired object, one’s chances of success are greatly improved by a methodical search that progressively expands an explored volume. It is possible to learn that it’s not in this cabinet, say, but only if one first identifies the cabinet as a useful unit of exploration, then does a sufficiently thorough search that the cabinet does not need to be revisited in the future. Failures are progress, but only when they are sufficiently clear and careful so as to reassure other searchers that the space of possible truths is meaningfully reduced.

In summary, the scientific method consists of two steps: you must figure something out, and you must adequately check that you are not mistaken. It seems to me that these are the two things we must demand of any useful scientific project.

How to recognize the scientific method in the science of deep learning

Deep learning presents a large number of important mysteries. What exactly we believe these mysteries are has evolved over time and will continue to do so, but all present agree that the mysteries are there. Certainly it seems that the practice of deep learning involves far more arbitrary choices and yields far more surprising results than it ought to if we knew better. There is thus probably much ground to gain. Because the success of deep learning is an empirical phenomenon and we wish to explain it, this is very much a scientific question, and we will be wise to consciously use the methods of science to structure our search.

We are gradually learning to do genuine science in the study of deep learning, and the rewards have been proportionate. However, a great deal of research effort ostensibly in service of our understanding of deep learning is expended in directions which are quite far from science and which consequently make little real progress. This is not a personal vendetta of mine: I have found this to be the consensus of virtually every researcher in the field with whom I have discussed the subject. In fact, this number includes many interviewees who have described their own work to me as of this ineffectual type, usually with a palpable air of despondence! (By contrast, the deep learning researchers I know who have caught the “science bug” tend to be energized and optimistic.) We as a field are due for a serious discussion of our methods and search strategy, and we can be optimistic that effectual methods are quite achievable.

How can we recognize the scientific method in the study of deep learning? We should look for work which (a) purports to figure out something particular and clear about deep learning, and (b) reports simple, convincing experiments that compellingly support it. The things figured out can really take any form so long as they are clearly stated and seem useful. They may be empirical (”A causes B”; “C phenomenon reliably happens”), mathematical (equations, limits, new mathematical objects), or even metascientific (e.g., “D is a useful proxy model for deep learning”). Qualitative claims are fine, but quantitative claims are best, because they may be verified with great confidence and may usually be applied in a large number of cases. Qualitative claims are rarely verified reliably (and often fold under later scrutiny) because of the sheer number of possible causes in a system as complex as deep learning. In assessing the progress of deep learning, we should count the number of interesting, easily verifiable quantitative claims one can make about deep learning systems, and individual researchers should seek to add to this count. This is how we will mark our progress in our search.

So, how does most “science of deep learning” work do on this minimal rubric of scientific method? By way of illustration, let me run through some of the major research trends in the last five years of deep learning theory. I will start with some failings before ending with some victories.

On formality and rigor to the detriment of insight. It is an item of little controversy (at least when discussing off the record) that a great number of deep learning theory papers are impressively rigorous and mathematically complex, but ultimately shed little to no light on the mystery originally motivating the endeavor. Papers of this sort tend to share certain features: asymptotic notation obscures large hidden constants; theorem statements require significant parsing in order to extract the essence of the result; the problem setup introduces complexity for the sake of a more impressive result rather than simplifying the setup for clarity and insight; few or no experiments are reported, and certainly none with nonlinear networks. Any seasoned deep learning theorist has read numerous such papers.

It seems to me that this pattern is the result of mistaking the scientific study of deep learning for a discipline of mathematics, which then requires formality and rigor. It emphatically is not: we are faced here with great and glaring empirical mysteries, and experiments are cheap and easy.⁵ It seems virtually guaranteed that we will first understand deep learning through quick-and-dirty nonrigorous arguments which may later be formalized, as a path through the woods is first blazed and then only later paved over in asphalt. Formality and rigor are a hindrance if they make it harder to understand the real nature of what you have figured out. As the saying goes, all theorems are true, but only some are interesting, and taking stock of the present state of our knowledge, it is greatly preferable to have an interesting nonrigorous result than an uninteresting theorem.

Papers of this mathematical sort rarely include experiments: sometimes there are experiments tacked on at the end, but they are usually an afterthought. If the study of deep learning were mathematics, this would be understandable, as a proven theorem requires no empirical demonstration. Because the study of deep learning is a science, however, neglecting experiments is utter folly. Unless the proven theorem totally resolves an important question in a completely realistic setting, experiments can extend a result’s scope of applicability, show its limits, check assumptions, or simply make it easier to understand the stated claim. If our goal is to understand deep learning, it sure seems wise to check empirically whether whatever you suppose you have figured out applies to deep learning! If it doesn’t, or the fit is worse than you expected, this merits explanation. If it does, then the contribution is all the greater.

The overly mathematical nature of much deep learning theory research is natural and understandable given the field’s history: most workers come from TCS, statistics, or mathematics. Nonetheless, the game has changed, and we should require less rigor, more insight, and more empirics from contributions.

Progress in the study of the dynamics of neural network training. It seems to me that much of the lasting progress of the last five years — the stuff that’s really stuck around and been built on — is essentially all of a particular type which marries theory and empirics. This strain of research is characterized by several trends:

It demands satisfying equations and quantitative predictions, and it is willing to study very simple cases and study dumb-seeming quantities in order to make this happen.
While the equations do not usually come from experiments, they are easily verified by experiments.
Assumptions are checked empirically. Assumptions that are both good and useful are celebrated and kept around.
It is humble: it studies only what it can describe well, and does not make premature claims about downstream topics.

Almost all of the work in this vein describes the dynamics of neural network learning rather than the performance, which is what I mean by humility. Some touchstone topics in this vein include the theories of deep linear networks, the NNGP and neural tangent kernel, the maximal update parameterization, the edge of stability, the generalization of kernel ridge regression, and the study of hyperparameter scaling relationships (see, e.g., here). All these ideas presented new mathematical quantities derived from model training which closely follow simple equations. All have proven solid enough to build further understanding on top of.

This variety of deep learning theory does not yet have an accepted name. We should give it one. Because of its similarities to the physical sciences, the “physics of deep learning” is a candidate term, though this carries historical baggage, has already used to describe many things including other research directions, and risks alientating the mathematicians and statisticians who have contributed to this productive type of work and will do so in the future. “Analytical interpretability” may be a good term for this, conveying both the high bar of analytical results with the promise of interpreting the training process of deep learning. Plus, it abbreviates to “AI.”⁶ Since this line of work includes virtually all extant examples of quantitatively predictive theories for deep learning, I like to think of it as “experiment-dots-on-theory-curves” or “dots-on-curves” theory.

What about empirical science? The above discussion is centered largely on deep learning theory. The science of deep learning is a rather greater endeavor, and has had some successes. Mechanistic interpretability cannot do what the physics of deep learning could (and vice versa; both are necessary), but it has been quite admirable as a scientific endeavor: it has made good use of the scientific method to coordinate a large-scale search over a difficult space. Deep learning theory should learn from it.

On the even more empirical side, phenomena like adversarial examples and the lottery-ticket hypothesis were excellent empirical observations, though much of the followup work on these topics makes less use of the scientific method and has accreted into less lasting knowledge. The observation of scaling laws in neural network performance is perhaps the one extant example of a robust and important equation extracted purely from neural network empirics. This was an excellent observation, and it remains unexplained.

Most “observations” in deep learning are of the type that “X method works for Y task.” Much fruitful dialog could be had between deep learning scientists and practitioners if the practitioners were more proactive in aggregating interesting phenomena and handing them to the scientists, and likewise if the scientists were more proactive in asking for them. Of course, most practitioners are too laser-focused on building AGI to care about theory, so I am dubious this will happen.

“Hail Maries.” Lastly, there have been a few ideas proffered of a type that I’ll call “hail Maries” after the long-bomb football pass. These ideas try to take a big step forwards all in one step with a very good guess: they tend to be summarizable by statements of the form “hey, what if deep learning is actually just X?” A good example of this is the much-embattled “information bottleneck” theory. Even though the IB was conclusively disproven soon after its proposal, I strongly applaud the bold, testable hypothesis and honest attempt to figure something out. Attempts to jump ahead in the story like this are likely to be wrong, but they are very much permitted in science. Much of the development of quantum mechanics consisted of bold, unprecendent guesses! Remember that there are no rules as to how one must figure something out: intuition-guided guesswork is quite allowed. In our field, there are few ideas and much energy available to test them, so I would like to see more bold guesses of this type. We should expect to see more such leaps as time goes on, and some of them will turn out to be right.

Conclusions

What now? We are making steady progress towards a theory of deep learning, a theory that we presumably hope to bend to the benefit of humankind. It has been over a decade since AlexNet, and we have tried much. Most of this has failed, but some of it has succeeded. It is a good time now to step back, notice the patterns in our successes, reassess our strategies, and seriously refocus our effort. Let’s get moving.

To list some deviations I have seen firsthand: sometimes the hypothesis changes dramatically, or only becomes clear at the end, or is absent altogether. Sometimes a single trial suffices, and sometimes one needs millions. Sometimes the hypothesis and conclusions are obvious and the data gathering is the whole project. Sometimes the conclusion has little to do with the initial aims of the project. I have seen very few scientific projects follow the script of the “science fair scientific method,” and these few usually turned out poorly! For example, I’ve rarely seen an interesting hypothesis confirmed by experiment when the scientists weren’t already damn near sure it was going to be true. ↩
Of course, the creator of any engineered artifact can rightly claim to have “figured out” that such an artifact is possible. Sometimes this is quite an interesting discovery! The boundaries between engineering and science are not clear, and we do not need them to be. ↩
If I were to add a third step, it would be convincing other people. The community is the ultimate judge of whether you have figured something out and whether your experiments show that it is not wrong. A colleague points out that this is similar to Dorothy Sayer’s third step of the creative process: sharing your work with the world and thereby having an effect on other people. To me, peer review feels like an important part of the scientific process, but feels secondary to the scientific method – even alone on a desert island, you could do science as I describe it here – and in any case, it’s not like anyone is making important scientific progress and not sharing it, so I feel comfortable omitting it. ↩
Yudkowsky attempts to make this mysterious process more mechanical in some of the Sequences, but it’s still quite difficult to come up with hypotheses. ↩
It is a very interesting question, perhaps worthy of discussion elsewhere, what the merits of rigor and formality are in an exploratory endeavor. Most obviously, a proven theorem is always correct and will not fail unexpectedly, so all else equal, a theorem is preferable to an equivalent nonrigorous claim. When, though, does the extra solidity justify the price in labor? It seems to me that rigor and formality are most useful when the class of objects one wishes to describe is very large and of unknown character, and bizarre or pathological cases are prevalent and important. For example, the space of groups is very large and diverse, and so without axioms to work from, we are lost. Similarly, real analysis requires formality because it turns out the set of all univariate functions on the reals is far stranger than expected, and we cannot rely on our intuitions. On the other hand, when one already has an intuitive feel for the set of objects one wishes to characterize, the guardrails of formality are not so necessary. It is for this reason that you need very little formal math to do physics. This is very much the case in which we find ourselves with deep learning. ↩
Hat tip to Alex Atanasov for coining the term. ↩