This is me about to do a 100' rope swing off El Capitan in Yosemite Actually, I guess this was taken just after swinging, but the rest is right Well, actually, the swing was only 30', but everything else is definitely the truth Okay, fine, it wasn't El Capitan, it was just a rock in Delaware, but I swear the rest is real Fine, I didn't actually swing, I just held a rope, but the idea is the same Alright, alright, I'll come clean: the original story was true

Jamie Simon

I'm a scientist aiming to understand the training dynamics of deep neural networks. I currently run a small lab within the Redwood Center and am supported by Imbue, where I am a research fellow. I recently finished my PhD, advised by Mike DeWeese. In my free time, I like running, puzzles, spending time in forests, and balancing things.

A lot of my papers have come about from helping people doing (mostly) empirical ML research come up with good minimal theoretical toy models that explain effects they're seeing [1,2,3,4,5]. Some of these ideas came about from quick initial conversations! If you've got a curious empirical phenomenon you're trying to explain, feel free to reach out :)

If you'd like get emailed when I post new blogposts, you can sign up here.

Research

More is better: when infinite overparameterization is optimal and overfitting is obligatory

Explaining mysteries of modern ML with the toy model of RF regression

We give a theory of the generalization of random feature models. We conclude they perform better with more parameters, more data, and less regularization, putting theoretical backing to these observations in modern ML. This gives a fairly solid mathematical picture to replace classical intuitions about the risks of overparameterization and overfitting. Our picture of the generalization of RF models takes the form of some nice closed-form equations we think can be used to answer lots of other questions.

ICLR '24 [arXiv]

A spectral condition for feature learning

A simple picture of feature learning in wide nets

We give a simple scaling treatment of feature learning in wide networks in terms of the spectral norm of weight matrices. If you want to understand the "mu-parameterization," this is probably the easiest place to start ca. 2024.

[arXiv]

On the stepwise nature of self-supervised learning

A theory of the training dynamics of contrastive learning

We give a theory of the training dynamics of contrastive self-supervised learning and validate it empirically. We find that representations are learned one dimension at a time in a stepwise fashion -- that is, the rank of the model's final representations increases by one at each step. Our theory is derived for linearized models, but we clearly see our stepwise phenomenon even for ResNets trained on image data. Large-scale ML mostly uses unsupervised and self-supervised training these days, and we describe a behavior we think is fairly generic in SSL.

ICML '23 [arXiv]

You can just put up a poster at ICML and nobody will stop you

Exposing vulnerabilities in the conference review system

A perennial problem in machine learning research has been how to most efficiently have a poster at ICML. In this work, we show that one can bypass OpenReview via a simple “FedEx trick," similar to yet entirely different from the kernel trick in machine learning. (I'm putting this here as an easter egg to see if people actually read these. If you see this and saw + were amused by the original poster, feel free to drop me a line :))

ICML '23 [viral tweet]

The eigenlearning framework: a conservation law perspective on kernel regression and wide neural networks

We give a simple picture of the generalization of kernel ridge regression in terms of task eigenstructure and use it to solve some theoretical problems.

TMLR [arXiv] [code] [blog]

Reverse engineering the neural tangent kernel

A first-principles method for the design of fully-connected architectures

Much of our understanding of artificial neural networks stems from the fact that, in the infinite-width limit, they turn out to be equivalent to a class of simple models called kernel regression. Given a wide network architecture, it's well-known how to find the equivalent kernel method, allowing us to study popular models in the infinite-width limit. In work with Sajant Anand, we invert this mapping for fully-connected nets (FCNs), allowing one to start from a desired rotation-invariant kernel and design a network (i.e. choose an activation function) to achieve it. Remarkably, achieving any such kernel requires only one hidden layer, raising questions about conventional wisdom on the benefits of depth. This allows surprising experiments, like designing a 1HL FCN that trains and generalizes like a deep ReLU FCN. This ability to design nets with desired kernels is a step towards deriving good net architectures from first principles, a longtime dream of the field of machine learning.

ICML '22 [arXiv] [code] [blog]

Benign, tempered or catastrophic: a taxonomy of overfitting

How bad is neural network overfitting?

Classical wisdom holds that overparameterization is harmful. Neural nets defy this wisdom, generalizing well despite their overparameterization and interpolation of the training data. How can we understand this discrepancy? Recent landmark papers have explored the concept of benign overfitting -- a phenomenon in which certain models can interpolate noisy data without harming generalization -- suggesting that that neural nets may fit benignly. In this work with Neil Mallinar, Preetum Nakkiran, and others, we put this idea to the empirical test, giving a new characterization of neural network overfitting and noise sensitivity. We find that neural networks trained to interpolation do not overfit benignly, but neither do they exhibit the catastrophic overfitting foretold by classical wisdom: instead, they usually lie in a third, intermediate regime we call tempered overfitting. I found that we can understand these three regimes of overfitting analytically for kernel regression (a toy model for neural networks), and I proved a simple "trichotomy theorem" relating a kernel's eigenspectrum to its overfitting behavior.

NeurIPS '22 [arXiv]

Critical point-finding methods reveal gradient-flat regions of deep network losses

Exposing flaws in widely-used critical-point-finding methods

Despite how common and useful neural networks are, there are still basic mysteries about how they work, many related to properties of their loss surfaces. In this project, led by Charles Frye, we tested Newton methods (common tools for optimization and exploring function structure) on loss surfaces. We found that, as opposed to finding critical points as designed, in practice Newton methods almost always converged to a different, spurious class of points which we described. Giving simple visualizable examples to illustrate the problem, we showed that some major studies using Newton methods on loss surfaces probably misinterpreted their results. Our paper is here.

(2021) [Neural Computation] [arXiv] [code]

Simplified Josephson-junction fabrication process for reproducibly high-performance superconducting qubits

A faster method to make Josephson junctions

In the spring and summer of 2019 I worked in the lab of Prof. Per Delsing developing nanofabrication methods for Josephson junctions, ubiquitous components in superconducting circuitry. My main project was a study of how junctions age in the months after fabrication, but my biggest contribution was elsewhere: Anita Fadavi, Amr Osman and I developed a junction design that is faster to fabricate by one lithography step, or potentially several days of work.

(2021) [Applied Physics Letters]

Fast noise-resistant control of donor nuclear spin qubits in silicon

Better control schemes for for spin qubits

Qubits decohere and lose their quantum information when uncontrollably coupled to their environment. Nuclear spin qubits in silicon are extremely weakly coupled to their environment, giving them long coherence times (up to minutes), but that same weak coupling makes quickly controlling them difficult. Advised by Prof. Sophia Economou, I came up with schemes for driving nuclear spin qubits that give fast, noise-resistant arbitrary single-qubit gates. The most important gate is a long sweep that effectively turns uncertainty in electric field (charge noise) into uncertainty in time, which can be accounted for by corrective gates. We also show two-qubit gates.

(2020) [PRB] [arXiv]

Puzzles

While a senior in undergrad, I started a puzzlehunt called the VT Hunt with Bennett Witcher. It became a university tradition, with the 2019-22 VT Hunts each drawing 1000-2000 participants and raising money for charities, and I've stayed involved as a mentor. I've also helped concoct six other puzzle events starting in high school. A few of my favorite puzzles I've made are below. They're roughly ordered from easiest to hardest, so you can pick where to start.

Jamie Simon

Research

More is better: when infinite overparameterization is optimal and overfitting is obligatory

A spectral condition for feature learning

On the stepwise nature of self-supervised learning

You can just put up a poster at ICML and nobody will stop you

The eigenlearning framework: a conservation law perspective on kernel regression and wide neural networks

Reverse engineering the neural tangent kernel

Benign, tempered or catastrophic: a taxonomy of overfitting

Critical point-finding methods reveal gradient-flat regions of deep network losses

Simplified Josephson-junction fabrication process for reproducibly high-performance superconducting qubits

Fast noise-resistant control of donor nuclear spin qubits in silicon

Puzzles

COASTERS

FLAGS

FOREST SON

ADDERS MULTIPLYING

90 SHADES OF BLACK

NAME A MORE ICONIC SET OF COUPLETS

TOPOLOGY

MINING

TRANSIT

MAELSTROM

Blog

Science (research)

On the scientific method and its application to the science of deep learning (July 2025, 23 min read )

Backsolving classical generalization bounds from the modern kernel regression eigenframework (April 2025, 4 min read )

One kernel, many eigensystems (April 2025, 6 min read )

A complete characterization of the expressivity of shallow, bias-free ReLU networks (April 2025, 9 min read )

The eigensystem of the Gaussian kernel w.r.t. a Gaussian measure (March 2025, 7 min read )

The optimal low-rank solution for linear regression (November 2024, 1 min read )

Infinite-width autoencoders are cursed (October 2024, 12 min read )

It's hard to turn a low-rank matrix into a high-rank matrix (October 2024, 8 min read )

Insights into GPT-2's positional encodings (August 2024, 11 min read )

Experiments in self-assembly (July 2024, 7 min read )

Using the Laplacian to take a local average of a function (June 2024, 5 min read )

An eigenframework for the generalization of 1NN (June 2024, 6 min read )

Let's solve more learning rules (June 2024, 6 min read )

Creating and erasing AI watermarks (March 2024, 9 min read )

Reflections on introductory neuroscience reading (March 2024, 22 min read )

Reverse engineering the NTK (August 2022, 8 min read )

A theory of generalization for wide neural nets (October 2021, 7 min read )

The principle of least power dissipation (September 2020, 8 min read )

Multiplicative neural networks (August 2020, 16 min read )

Potpourri

The time I caught an egg in my mouth (July 2024, 1 min read )

Gravitree update: June 2024 (June 2024, 10 min read )

The sixth lake (June 2024, 2 min read )

Geometric patterns in croplands (April 2024, 11 min read )

Favorite quotes from Letters to a Young Poet (April 2024, 4 min read )

How many babies were born during the 2024 eclipse? (April 2024, 5 min read )

Roadtripping to North Dakota (April 2024, 9 min read )

Why I talk to strangers (March 2024, 2 min read )

Einstein vs. Bohr rap battle (January 2022, (5 min watch) )

Newton vs. Leibniz rap battle (January 2022, (5 min watch) )

Messing with the postal service (July 2020, 4 min read )

Common ground (July 2020, 5 min read )

The expected cost of breaking quarantine (May 2020, 12 min read )

Science (fun)

Understanding fractals from iterated maps (July 2024, 11 min read )

Household microscopy (June 2024, 2 min read )

Can a chemical reaction measure the size of its container? (October 2022, 15 min read )

Simulating cells fighting to the death (September 2022, 4 min read )

Time-reversed random walks (September 2022, 16 min read )

The gravitree (October 2021, 6 min read )

Could you propel a spacecraft using sports projectiles? (November 2020, 10 min read )

How would an upside-down candle burn? (August 2020, 4 min read )

What would happen if you made a planet out of fish? (June 2020, 11 min read )

How hard do you have to hit a chicken to cook it? (June 2020, 1 min read )