Research


Understanding deep learning will be one of the grand scientific challenges of the early 21st century. I want to develop fundamental theory to explain how neural networks learn. I'm a physicist by training and approach this as a physics problem.

This is a big project that in total will require at least some theory of the data, a picture of the dynamics of neural net optimization, and a picture of how the model pulls out patterns from the data during this optimization process. We've seen a lot of progress in the past decade, I'm optimistic.

This page gives a sketch of my major research threads, with each paper annotated to explain how it fits into its larger line of work and/or how I suggest you read it. The best articulation of my research philosophy's currently here.

A lot of my papers have come about from helping people doing (mostly) empirical ML research come up with good minimal theoretical toy models that explain effects they're seeing [1,2,3,4,5]. Some of these ideas came about from quick initial conversations! If you've got a curious empirical phenomenon you're trying to explain, please reach out.


Developing + using the theory of kernel regression

Kernel ridge regression (KRR) is a primitive learning algorithm that looks a lot like deep learning in some ways (e.g., as a model of overparameterized optimization). It's strictly simpler, and any question you can't answer for KRR, you can't answer for deep learning, so you should ask your questions about KRR first to get your thinking straight. I've done a lot of that. KRR is the only learning algorithm for which we have a simple, general, and verifiable theory of generalization. I've helped build this hammer and hit a lot of nails with it.


Predicting kernel regression learning curves from only raw data statistics (2025) [arXiv]
Dhruva Karkada*, Joey Turnbull*, Yuxi Liu, James B. Simon

Predicting kernel regression learning curves from only raw data statistics
We study KRR with rotation-invariant kernels. We give a theoretical framework that predicts learning curves (test risk vs. sample size) from only two measurements: the empirical data covariance matrix and an empirical polynomial decomposition of the target function $f_*$.

This is possible because it turns out the kernel's eigenfunctions are approximately Hermite polynomials (and approximately insensitive to the actual choice of rotation-invariant kernel). We capture this idea mathematically as the Hermite eigenstructure ansatz, prove it in informative limiting cases for Gaussian data, and validate it empirically for several real image datasets.

We did this project as a stepping stone towards a theory of MLPs. We needed a rudimentary theory of the structure of the data, and now we have one. As we show in the paper, even feature-learning MLPs seem to learn Hermite monomials in an order predicted by the HEA! This paper worked far better than any of us expected, which when doing science, is a good sign you're on to something.

This was my first paper functioning as a PI. It's also my best work -- my coauthors did a remarkable job.
More is better in modern machine learning: when infinite overparameterization is optimal and overfitting is obligatory (ICLR 2024) [arXiv]
James B. Simon, Dhruva Karkada, Nikhil Ghosh, Mikhail Belkin

More is better in modern machine learning: when infinite overparameterization is optimal and overfitting is obligatory
Our main contribution here a theory of the generalization of random feature models. It's a simple eigenframework which generalizes the one for KRR. With it, we conclude that random feature regression performs better with more parameters, more data, and less regularization, putting theoretical backing to these observations in modern ML. This gives a fairly solid mathematical picture to replace classical intuitions about the risks of overparameterization and overfitting. The eigenframework here can probably be used to answer a lot of other questions.
An agnostic view on the cost of overfitting in (kernel) ridge regression (ICLR 2024) [arXiv]
Lijia Zhou, James B. Simon, Gal Vardi, Nathan Srebro

In this paper, we work backwards from the KRR eigenframework, which is omniscient in the sense that it assumes knowledge of the target function, to an agnostic theory which does not. It's similar in goal to this blogpost.
Benign, tempered, or catastrophic: toward a refined taxonomy of overfitting (NeurIPS 2022) [arXiv]
Neil Mallinar*, James B. Simon*, Amirhesam Abedsoltan, Parthe Pandit, Misha Belkin, Preetum Nakkiran

Benign, tempered, or catastrophic: toward a refined taxonomy of overfitting
Classical wisdom holds that overparameterization is harmful. Neural nets defy this wisdom, generalizing well despite their overparameterization and interpolation of the training data. What gives? How can we understand this discrepancy?

Recent landmark papers have explored the concept of benign overfitting -- a phenomenon in which certain models can interpolate noisy data without harming generalization -- suggesting that that neural nets may fit benignly. Here we put this idea to the empirical test, giving a new characterization of neural network overfitting and noise sensitivity. We find that neural networks trained to interpolation do not overfit benignly, but neither do they exhibit the catastrophic overfitting foretold by classical wisdom: instead, they usually lie in a third, intermediate regime we call tempered overfitting. I found that we can understand these three regimes of overfitting analytically for kernel regression (a toy model for neural networks), and I proved a simple "trichotomy theorem" relating a kernel's eigenspectrum to its overfitting behavior.
On kernel regression with data-dependent kernels (2022) [arXiv]
James B. Simon

A short note connecting optimal kernel adaptation to the data to a Bayesian view of target functions. This hasn't been useful since, but maybe someday.
The eigenlearning framework: A conservation law perspective on kernel regression and wide neural networks (TMLR 2023) [arXiv][blog]
James B. Simon, Madeline Dickens, Dhruva Karkada, Michael R. DeWeese

The eigenlearning framework: A conservation law perspective on kernel regression and wide neural networks
This is the paper that started this whole sequence for me. Maddie and I managed to derive a simple, physicsy eigenframework for the generalization of KRR. It turned out others had beat us to it, but I was captivated by the beauty of the equations, so we published this paper giving what I think is still the simplest and most accessible version of the theory, plus a handful of applications, plus a surprising connection to quantum physics. If you're not familiar with these ideas and want to study generalization in machine learning, I strongly recommend you read this paper or one of the related ones we cite.

Stepwise learning dynamics

In certain regimes, neural nets show staircase-like loss curves over the course of optimization, with plateaus punctuated by sudden drops. This sort of dynamics is called "stepwise," "saddle-to-saddle," or "incremental" learning, and it has the flavor of greedy low-rank optimization. The dynamics in these cases is often easier to understand than in more traditional regimes, so it's worth understanding what's going on to look for generalizing insights.


Closed-form training dynamics reveal learned features and linear structure in word2vec-like models (NeurIPS 2025) [arXiv][blog]
Dhruva Karkada, James B. Simon, Yasaman Bahri, Michael R. DeWeese

Closed-form training dynamics reveal learned features and linear structure in word2vec-like models
We show that word2vec exhibits stepwise learning, and even better, the early steps correspond to interpretable semantic concepts in the vocabulary. This is a promising bridge for connecting fundamental theory to mechanistic interpretability.
Alternating gradient flows: a theory of feature learning in two-layer neural networks (NeurIPS 2025) [arXiv]
Daniel Kunin, Giovanni Luca Marchetti, Feng Chen, Dhruva Karkada, James B. Simon, Michael R. DeWeese, Surya Ganguli, Nina Miolane

Alternating gradient flows: a theory of feature learning in two-layer neural networks
We give an explicit algorithmic description of feature learning in some shallow neural networks trained from small init. In this picture, the model alternates between alignment phases, in which dormant neurons align with the current residual gradient, and growth phases, in which a newly grown neuron joints the community, which collectively settles into a new locally optimal configuration.
Saddle-To-Saddle Dynamics in Deep ReLU Networks: Low-Rank Bias in the First Saddle Escape (2025) [arXiv]
Ioannis Bantzis, James B. Simon, Arthur Jacot

Do general deep neural networks show stepwise learning dynamics from small initialization? Empirically, it's surprisingly hard to tell. Theoretically, it's also surprisingly hard to tell. In this paper, we take the first step by studying the first step a deep $\textrm{ReLU}$ net provably has at least one plateau-and-drop step, and we characterize this first step, showing that the growth direction is rank-one in almost all weight matrices.
On the stepwise nature of self-supervised learning (ICML 2023) [arXiv][blog]
James B. Simon, Maksis Knutins, Liu Ziyin, Daniel Geisz, Abraham J. Fetterman, Joshua Albrecht

On the stepwise nature of self-supervised learning
This is the paper that started this sequence for me. We noticed that self-supervised models sometimes exhibited stepwise loss curves, and I developed a kernel theory that explains why. The theory is exact in the neural tangent kernel regime. The theory and supporting empirics suggest that we can understand self-supervised models as building their representations a few directions at a time, working in a low-rank subspace at any given moment.

Scaling dynamics and hyperparameters of feature learning

A central tenet of the developing science of deep learning is that you should study your hyperparameters, because if you don't control for them, they'll confound whatever conclusion you're attempting to draw. In these papers, we study hyperparameters related to feature learning. These scaling relationships help lay down foundations for future science.


The optimization landscape of sgd across the feature learning strength (ICLR 2025) [arXiv]
Alexander Atanasov*, Alexandru Meterez*, James B. Simon*, Cengiz Pehlevan

The optimization landscape of sgd across the feature learning strength
We disentangle the network richness parameter $\gamma$ from the learning rate $\eta$, identifying scaling relationships and giving a phase diagram. We give a definition and characterization of the ultra-rich regime, which lives $\gamma \gg 1$ beyond the rich regime. Training in the ultra-rich regime often shows stepwise learning and is conceptually related to training from small initialization. It seems to be mathematically simpler than the rich regime but no less performant (at least given long enough to train), and we came away from this thinking it is a useful subject for future theoretical research.
A spectral condition for feature learning (2023) [arXiv]
Greg Yang*, James B. Simon*, Jeremy Bernstein*

A spectral condition for feature learning
We give a simple scaling treatment of feature learning in wide networks in terms of the spectral norm of weight matrices. If you want to understand the $\mu$-parameterization, this is probably the easiest place to start.

My thesis


Discretized Theories of Deep Learning at Large Width: an Eigenthesis (2024) [pdf]
James Simon

Discretized Theories of Deep Learning at Large Width: an Eigenthesis
The introduction here is probably a good resource for new students in DL theory to read to get some bearings. The rest is my papers stapled together with some personal history thrown in.

Other machine learning theory

Odds and ends.


FACT: the Features at Convergence Theorem (2025) [arXiv]
Enric Boix-Adserà*, Neil Mallinar*, James B. Simon, Mikhail Belkin

I quite like the neural feature ansatz, but it's not derivable from first principles, and I was (and remain) convinced it's not quite the right object to study. This paper gives an alternative which does provably come from a first-principles calculation and serves basically the same purposes.
You can just put up a poster at ICML and nobody will stop you (ICML 2023) [viral tweet]
Jimmy Neuron, Chet G. P. Tee, Ada Grahd

You can just put up a poster at ICML and nobody will stop you
Did you know you can just do things?
Reverse engineering the neural tangent kernel (ICML 2022) [arXiv][blog]
James B. Simon, Sajant Anand, Michael R. DeWeese

Reverse engineering the neural tangent kernel
This was a cute idea with some surprisingly nice math. It's where I first used Hermite polynomials, which would later return. It was a pretty idea, but I no longer believe in it.
SGD with a constant large learning rate can converge to local maxima (ICLR 2022) [arXiv]
Liu Ziyin, Botao Li, James B. Simon, Masahito Ueda

A cute observation about pathological cases of SGD loss landscapes.
Critical point-finding methods reveal gradient-flat regions of deep network losses (Neural Computation 2021) [arXiv]
Charles G. Frye, James Simon, Neha S. Wadia, Andrew Ligeralde, Michael R. DeWeese, Kristofer E. Bouchard

Critical point-finding methods reveal gradient-flat regions of deep network losses
This was my first ML theory paper!

My past life as a physicist

My academic roots are in physics, and that still informs how I see the world and the development of human knowledge.


Fast noise-resistant control of donor nuclear spin qubits in silicon (PRB 2020) [arXiv]
James Simon, F. A. Calderon-Vargas, Edwin Barnes, Sophia E. Economou

Fast noise-resistant control of donor nuclear spin qubits in silicon
Interleaved electro-optic dual comb generation to expand bandwidth and scan rate for molecular spectroscopy and dynamics studies near 1.6 µm (Optics Express 2021) [arXiv]
Jasper R. Stroud, James B. Simon, Gerd A. Wagner, David F. Plusquellic

Experimental optics work done during summer internship at NIST in 2017.
Simplified Josephson-junction fabrication process for reproducibly high-performance superconducting qubits (APL 2021) [arXiv]
Amr Osman, James Simon, Andreas Bengtsson, Sandoko Kosen, Philip Krantz, Daniel Perez, Marco Scigliuzzo, Per Delsing, Jonas Bylander, Anita Fadavi Roudsari

Simplified Josephson-junction fabrication process for reproducibly high-performance superconducting qubits
In the spring and summer of 2019 I worked in the lab of Prof. Per Delsing developing nanofabrication methods for Josephson junctions, ubiquitous components in superconducting circuitry. My main project was a study of how junctions age in the months after fabrication, but my biggest contribution was elsewhere: Anita Fadavi, Amr Osman and I developed a junction design that's faster to fabricate by one lithography step, or potentially several days of work.