|
Home
Blog
Miscellaneous
Swimming Around the World Horse Riding |
The Information Bottleneck: Naomi on Understanding Training Podcast notes · The Information Bottleneck · March 2026 Overview Naomi is a Kempner Research Fellow at Harvard and incoming assistant professor at Boston University. Her research sits at an unusual intersection: she is not trying to make training work better, she is trying to understand why it works at all. This episode of The Information Bottleneck, hosted by Ravid and Allen, covers grokking and phase transitions, what the loss curve hides, the axes of scaling and whether they are interchangeable, data quality, multilingual representations, interpretability and sparse autoencoders, the case for language as the dominant modality, world models, and the problem of non-determinism in training runs. What follows are structured notes organized by topic. Why bother understanding training? The motivation Naomi leads with is biological. If you want to understand an organism, you cannot ignore its evolutionary history. That history explains vestigial traits, spandrels (structures that arise as side effects of other adaptations), and behaviors that were important early in development but are disadvantageous later. Discarding that history and treating the final organism as a complete description leads to systematically wrong conclusions. The same logic applies to language models. A head that shows a striking activation pattern at the end of training might be a vestigial scaffold that mattered early and has since become irrelevant, or it might be a side effect of some other optimization pressure, not a meaningful functional unit. Neurons that are highly selective for a specific class tend to predict worse overall model performance, yet they are important early in training as scaffolding. Treating the final weights as the whole story, without asking how the model got there, produces interpretations you cannot trust. The Jennifer Aniston neuron story illustrates the point from the other direction. A biological neuron that appeared to fire specifically for images of Jennifer Aniston turned out, on closer inspection, to be far messier: inconsistent, not uniquely selective, affected by context. The original finding looked clean because it was observed narrowly. Training dynamics research is a commitment to not making that mistake. Grokking: delayed generalization and what it tells us Grokking is the phenomenon where a model first achieves perfect training accuracy (memorization) and then, with continued training on the same data, suddenly generalizes. This is the opposite of the classical picture in which generalization and memorization are in tension and overfitting is the thing to watch for. The canonical demonstration is modular arithmetic. Below a certain data threshold the model memorizes without generalizing. Above it the model generalizes quickly. At the threshold the model is precarious: some runs generalize in the expected window, others take much longer, and a few generalize immediately because the initialization happened to be favorable. This sensitivity to initial conditions near the threshold is characteristic of a phase transition. What makes grokking theoretically interesting is that the transition from memorization to generalization is a transition between modeling strategies. Generalization in this context means the model has discovered an internal structure that correctly captures the rule, not just the training examples. Naomi frames this as a case of the model finding a new way to compress and represent regularity. The same transition logic appears in other settings: a model might start predicting like an n-gram language model and then abruptly develop genuine syntactic structure. These are transitions between qualitatively different internal organizations, not just quantitative improvements. Phase transitions and what the loss curve hides Looking at joint work with Ravid on syntactic attention structure in masked language models, Naomi describes finding two consecutive large drops in the loss during training, not one. The first corresponds to a reorganization of internal structure (heads beginning to specialize for syntactic roles like subject-verb agreement). The second, which depends on the first, corresponds to the emergence of complex grammatical competencies, including sensitivity to syntactic scope (knowing, for example, when "anymore" is grammatically licensed in English). Naomi's inference from this is that two observed phase transitions in the loss implies at least three, because there is always an initial edge-of-stability transition where the model finds its basin before anything else happens. And if there are three visible ones, there are probably many more invisible ones: individual concepts that get learned at specific moments in specific locations, each too small to register as a visible kink in the population-level loss. The smooth loss curve almost certainly contains a large number of small phase transitions that average out into apparent continuity. The population-level loss is not a reliable window into what is actually happening inside the model during training. What train-test curves no longer tell us The classical framework of train/validation/test curves and early stopping was designed to detect overfitting to the training distribution. Grokking, double descent, and related phenomena have not exactly invalidated this framework, but they have narrowed the domain where it is the right lens. Modern models trained on large diverse datasets tend to generalize well to similar held-out data. The framework becomes relevant again only when you care about specific subsets: edge cases, out-of-distribution inputs, particular benchmarks. This is where the interesting failure modes now live. A model that gets the number of Rs in "strawberry" right might get the number of Rs in "blueberry" wrong, not because of classical overfitting but because of how tokenization interacts with the model's internal counting procedure. Tokenization is a productive entry point here, Naomi notes, because it breaks the assumption that what is hard for humans is hard for models. As soon as tokenization enters the picture, a lot of apparently strange model behavior becomes less mysterious, and the analogy between training and human learning starts to dissolve. The axes of scaling: are they interchangeable? There are at least three distinct things people mean when they say "scale": scale data, scale model size, or scale training time (more passes over the same data). Chinchilla scaling laws formalize a tradeoff between the first two, but the relationship is not one of simple substitutability. Multipass training, where you run more steps without adding new data, is not the same as having more data. The more subtle point is that not all data is equal even within a fixed quantity. The GPT-2 paper's main contribution, Naomi argues, was not the architecture or the scale, it was removing data that was causing the model to make specific errors. The support that data provides for a particular generalization behavior is what matters, not the count. Models learning hierarchical syntactic rules are more likely to do so when the training corpus contains structures that are easier to model correctly using that hierarchy. Center-embedded sentences (structures that effectively require tracking nested dependencies) provide that kind of support for learning genuine syntactic competence. More data with the right structural properties is not the same as more data. Data quality as an empirical question Asked to define data quality, Naomi is honest that there is no clean principled answer. It is often recognized after the fact rather than predicted in advance. The best example she offers is code: intuition might say that including large amounts of code in a language model's training data would hurt natural language quality, since code is not language. In practice the opposite is observed. Code is unusually efficient at teaching long-range dependencies and structured reasoning, and removing it tends to degrade performance on complex language tasks that require those capabilities. The intuition was wrong, and only empirical testing revealed it. This is one of the central arguments for building a rigorous empirical science of training. Many of our intuitions about what models "need to eat" turn out to be incorrect, and the correct answers are not derivable from first principles at our current level of theoretical understanding. Multilingual models and the interlingua question Multilingual language models appear to develop something like an internal language-independent representation, sometimes called an interlingua. In English-dominant models like LLaMA, representations of text in Bulgarian or other languages look, in the model's internal space, like translated English and then re-rendered in Bulgarian, though not in a way that produces obviously translated-sounding output. The model is not literally translating through English, but the geometry of the internal representations is heavily shaped by English dominance in training. The data composition question for low-resource languages is genuinely open. One direction supported by recent work at Harvard is that you need enough diversity of both languages and tasks to give the model reason to develop representations that compose correctly across the two. The geometry needs to allow "Bulgarian" and "answer this biology question" to combine without the two interfering. How much data of what kind achieves that is an active research question. The more exotic behaviors seen in models like DeepSeek, which appear to mix English and Chinese during chain-of-thought reasoning, raise the question of whether different languages afford different expressive capacities for certain kinds of reasoning. That is speculative, but the observation that models switch languages mid-thought suggests the internal computation is not language-neutral even when outputs appear fluent. Interpretability: what it can and cannot show Naomi's position on interpretability is principled rather than fashionable. She believes understanding tools you use is intrinsically valuable, independent of immediate practical payoff. The analogy is aerodynamics: we spent decades understanding the aerodynamics of airplanes not only because it made airplanes better, but because understanding the tools you build is what doing science means. She also thinks interpretability has genuine practical potential, and discusses that separately. On sparse autoencoders specifically, her critique is direct. Sparse autoencoders are topic models trained on internal model representations rather than on raw text. Topic models are a classic clustering approach, and they are very good at extracting interpretable structure from language data because language data has lots of interpretable surface structure. The concern is that sparse autoencoders, being overcomplete (more dictionary entries than input dimensions), can easily find "features" that reflect structure in the input data rather than structure in the model's processing. Different random seeds produce substantially different features. If you could recover similar features from a simple two-layer model, you have not learned much about how a deep transformer processes its inputs. More promising, in her view, are approaches that try to approximate computation rather than just cluster representations. Anthropic's work on transcoders, which attempt to model the transformation a layer applies rather than the state it produces, is a step toward the right target. Even that is not causal, and she does not think interpretability must be causal to be useful. If you can identify signatures of the algorithm a model is using and use those signatures to predict out-of-distribution behavior, that is valuable even if the signature is not the mechanism that implements the algorithm. Showing that two models with identical in-distribution outputs use different internal rules, and that this predicts how they diverge on edge cases, is exactly the kind of result that could make interpretability practically useful. Language as the privileged modality Naomi describes herself as a language modality chauvinist. The core claim is that natural language is specifically designed, through millennia of co-evolution between speakers, to efficiently communicate everything humans care about. The things we care about, we talk about. Images are not cooperative in this sense: photons hit your retina without any intention to convey the most relevant information. Language is cooperative by design. One consequence is that image classifiers trained on datasets like ImageNet are already multimodal in a hidden sense: ImageNet is built on WordNet, an NLP ontology, and the categories it distinguishes are the ones humans care enough about to encode in language. The alignment between image and language representations that people find surprising is partly explained by the fact that image classifiers already embed a linguistic ontology from the start. The exceptions she acknowledges are code (unusually efficient for teaching long-range dependencies, as discussed) and one other text modality she was recently persuaded matters, which goes unnamed. DNA is discussed as a candidate and she is skeptical: if there were things in DNA we cared about enough to reason about verbally, we would talk about them and they would appear in language data. DNA as a modality for training a better language-output model seems unlikely to her, though she grants it might be useful for domain-specific biological modeling. The vision question is more contested. Ravid pushes back that humans are expert perceivers of the visual world, and Naomi grants this partially, but notes that generating and rendering visual scenes requires extensive training in humans (drawing is a skill) while speaking is essentially universal. The visual cortex is 500 million years old and the Broca's area analog is roughly 2 million years old, yet human civilization is built much more heavily on language than on vision. The evolutionary age of a capacity does not straightforwardly predict how much leverage it gives for building intelligence. World models Naomi finds the world models debate more frustrating than productive. The core disagreement in the field is often definitional: people are drawing different conclusions partly because they are using "world model" to mean different things. She prefers to operationalize the question. Instead of asking whether a model "has" a world model, ask whether its internal representations predict its behavior on out-of-distribution inputs. That is measurable. Whether the model "understands" physics or "really knows" the world is not. On whether predictive modeling (the dominant paradigm: predict the next token, predict the next frame) is the right foundation: she thinks prediction is central to cognition and necessarily part of any world model, because so much of how humans and animals navigate the world relies on predictive internal models. But the debate about whether language prediction alone is sufficient, or whether grounding in sensorimotor experience is necessary, is not one she thinks can currently be resolved empirically. The question of whether current models have consciousness faces the same problem: we lack the conceptual tools to answer it, not just the empirical tools. Non-determinism and what we can actually conclude from a training run The closing topic is one Naomi says she wants people to think about much more carefully. When we train a model and observe a behavior, we are observing one set of weights, produced by one training run, with one random seed. We are not observing the outcome of a training process in any generalizable sense. Nobody runs 50 full-scale training runs of a frontier model. The variance across runs is essentially unknown at scale. This matters because post-training increasingly involves reinforcement learning, which introduces additional sources of degeneracy. Models can converge to multiple qualitatively different solutions. A behavior that appears attributable to a data choice or architectural choice might instead be a random coincidence of initialization. The only way to have confidence that a behavior is a reliable consequence of your method, rather than a property of a specific lucky initialization, is to see it replicate across seeds and data orderings. The field rarely has that evidence for the behaviors that matter most. The M-dash example is a concrete illustration of a behavior that does appear reliably across models and seeds: many frontier models generate M-dashes at unusually high rates. Because this replicates so consistently, you can reasonably investigate why. The answer is tokenization: M-dashes without surrounding spaces, as used in American convention, save a token relative to the British spaced convention, and models under pressure to be efficient in token use tend to favor them. When a behavior repeats across many different training setups, you can start attributing it. When it appears in one model, you cannot. Recruiting Naomi is starting her lab and is actively recruiting students, including at Boston University. The group there has recently hired several people she describes as her favorite scientists working on language model training dynamics, and she is excited about the intellectual environment. If these questions interest you, her work is a good entry point. |