Home Blog Miscellaneous
Swimming Around the World Horse Riding

← Back to Miscellaneous

Panel Discussion: World Modeling Workshop 2026

Panel notes · World Modeling Workshop 2026 · Yann LeCun, Yoshua Bengio, Shirley Ho, Shuran Song, Alessandro Lazaric · Moderator: Randall Balestriero

Overview

The closing panel of the World Modeling Workshop brought together Yann LeCun, Yoshua Bengio, Shirley Ho, Shuran Song, and Alessandro Lazaric, moderated by Randall Balestriero. The discussion covers the input-space versus latent-space debate, hierarchical representations and where supervision for them comes from, planning in learned latent spaces, error accumulation in neural simulators, how to distinguish controllable from uncontrollable factors when learning from actionless video, causality and agency, and the relative merits of world models versus policies from sample efficiency and generalization perspectives. The session also addresses what "world model" should mean as a concept and how current LLM-based systems relate to it.

Input Space vs. Latent Space Reconstruction

The session opened with the moderator posing the central debate of the workshop: should world models perform reconstruction in input space or make predictions in a latent space?

Yoshua opened by noting that while humans reason in abstract spaces, the abstract space relevant to any problem is different, and a neural network's latent space may not carve at the joints each domain requires. Yann's response introduced his core argument for the session: world models should be hierarchical. Low levels make short-term predictions with high detail; higher levels make longer-range predictions with fewer details. The prediction horizon at which a level is trained defines its level of abstraction, not the choice of architecture. The representation appropriate for describing what is happening in a room is at the level of psychology and sociology, not particle physics. Beyond hierarchical structure, Yann emphasized a configuring module that sets the world model for the current situation. Cognitive evidence for something like this is that humans can solve only one conscious task requiring a world model at a time, implying a single configurable engine rather than many specialized ones.

Shirley raised a challenge: where does the supervision for learning hierarchies come from? The only supervision available is the data itself, and different viewpoints (wrist camera versus third-person camera on a robot) may produce different hierarchical structures. She argued that video is the most natural unifying modality for diverse dynamical systems, and that the appropriate tokenization of video would let unified representations emerge without requiring hierarchy to be pre-specified. Yann pushed back on the concern that specifying abstraction level by fixing the prediction horizon amounts to hardcoding the hierarchy. The analogous debate about fixed pooling in convolutional networks was once considered fatal and turned out to be irrelevant with sufficient data. Yoshua responded that dismissing the concern empirically (pooling-free nets were tried and failed) is not the same as answering it theoretically, and that systems capable of rich on-the-spot reconfiguration are what is needed for genuinely capable agents.

A subthread on Yoshua's scientist-AI proposal followed. Shirley asked whether that proposal amounts to a large language model trained on unbiased data. Yoshua resisted the framing: the goal is not to reproduce observed text but to understand why it was produced. An AI that hears "the earth is flat" should not increase its probability of asserting that claim; it should model the causal structure underlying why people say such things. An LLM reproducing its training data is not what you want for a system taking consequential decisions. The architecture might resemble a transformer trained with SGD, but the objective, the representation, and the relationship to the data are fundamentally different.

Planning in Latent Space and JEPA Training

Even with a well-trained JEPA-style world model, the latent space appropriate for dynamical prediction is not necessarily the space appropriate for gradient-based planning. The two objectives favor different geometric structures. Yann's proposal is a separate head on top of the prediction encoder: a smaller-dimensional embedding optimized for planning, with better smoothness and flatness properties than the full prediction space. There is no reason to require these spaces to coincide.

Error Accumulation in Neural Simulators

An audience question asked about the practical feasibility of training agents inside a neural world model, given that prediction errors accumulate over long rollouts. Yann's answer returned to hierarchy: with hierarchical planning, the number of steps at each level is small because each step corresponds to a coarser action, and error accumulation is correspondingly slower. Planning to take a flight back to New York does not require simulating every moment of muscle control; the high-level plan has two steps, each internally decomposable when needed. An audience member pressed the point that for neural simulation in the Dreamer sense, where a policy is trained entirely inside the world model, hierarchical planning is not available because local fidelity is required at the level of individual actions. Yann's response was that training a policy is just a way to accelerate the inference of planning, and that with a world model you can solve new problems zero-shot without training a policy at all.

Yoshua added that uncertainty estimation is necessary and neglected. A world model making only point predictions will be exploited by any optimizer: the optimizer will find plans that look good in the model but rely on model errors. Representing epistemic uncertainty explicitly, including uncertainty from finite training time and out-of-distribution states, allows the system to take conservative decisions and flag when it is operating outside its reliable region. This connects directly to reward hacking in reinforcement learning from human feedback: accumulated model error produces a mirage that the optimizer willingly exploits.

Shirley connected this to text as a supervision signal for hierarchy. Text descriptions naturally operate at different temporal granularities, providing implicit supervision at multiple levels of abstraction without requiring a manually specified horizon. Conditioning on "walk to the other side of the room" versus "walk forward one meter" naturally controls the prediction horizon, letting a model learn hierarchical structure from data. Yann noted a limit to this argument: animals without language produce hierarchical world models. A house cat has functional planning capacities that far exceed any current AI world model and does so without text. Language is a useful engineering shortcut for tasks humans care enough to describe verbally, but it is not a fundamental ingredient.

Abstract Representations and Hierarchies

Alessandro raised a structural point about what hierarchy actually requires. When you plan at a high level (go to the airport), the plan is only coherent if you have a reliable expectation that lower levels will handle the contingencies. The higher level depends on the lower level being stable and predictable. This reliability constraint is not automatically satisfied by learning hierarchical representations from data; it requires a guarantee about what the lower levels can be trusted to do, which is a control problem distinct from the representation learning problem. The challenge is not just what the primitive action at each level is, but how to learn what the core components of each level should be. Yann's answer was that ideally this should emerge from training, analogous to how convolutional networks develop hierarchical feature detectors through pooling and subsampling without explicit supervision on the structure of each level.

The moderator asked whether hierarchy needs to be explicitly enforced through separate loss terms at each level, or whether a single final objective suffices. Yann's position was that every stage needs its own prediction error and regularizer, and that the right granularity of a hierarchical stage is an open question. Alessandro added that hierarchical levels are not necessarily refinements of each other. The representation relevant for fine motor control of a limb may be entirely different from the representation relevant for navigating a room, not a coarsening of it. The levels may be complementary rather than nested, each selecting different aspects of the state depending on the task at hand.

The discussion of where hierarchy should live led Yoshua to express a view he described as a correction of a long-held mistake. Thinking of hierarchy as captured by layers in a neural network was a natural starting point for deep learning but is probably wrong. Human abstraction is not fixed by architecture: we construct new levels of description as we learn, and acquiring a new conceptual level does not require retraining everything from scratch. Abstraction should be understood as something dynamic and indirect, not hardcoded in network depth.

Causality, Agency, and Interventions

An audience question addressed causality and interventions. Yoshua clarified that his proposed framework uses a non-agentic core predictor: a model that makes predictions without being biased by any particular goal. Once this predictor exists, agentic systems can be built on top of it by asking what action leads to a desired outcome. The concern is not agency itself but hidden agency, where the AI has learned implicit goals that were not specified. Information-theoretic exploration, selecting actions to maximize information gain about the state of the world, solves one version of action selection without injecting hidden goals. This approach is used in experimental design and scientific discovery, and Yoshua noted it as theoretically solved even if underimplemented in current systems.

On how JEPA learns causal connections, Yann was direct: it does not, in any principled causal sense. It is fed state-action-next-state tuples, which is a causal model in structure, but the system does not perform causal inference. When you infer latent actions from passive data, you may be inferring actions that do not exist. Humans make this mistake too: half of four-year-olds believe wind is caused by the motion of leaves in trees. The existence of religion is evidence that humans are not calibrated causal inference machines. Yann did not take this as a reason to avoid building causal components into AI, but as a caution against assuming causal reasoning emerges for free.

Yoshua pushed back on the limits Yann was implying for interventional reasoning. Humans can reason about interventions they could never perform and counterfactuals about events that never happened. You can imagine what would happen if someone moved the sun, even though you have never experienced that and could never cause it. A model limited to predicting from observed state-action-next-state triples cannot represent such counterfactuals. Causal models with explicit interventional semantics can. Yann's response was that a world model with good predictions can absolutely predict what would have happened under different actions, within limits. The question of whether "within limits" is acceptable depends on what you need the counterfactuals for, and judges ask exactly this kind of question routinely.

Latent Actions and Controllability

Shirley asked how to distinguish controllable from uncontrollable factors when learning latent actions from videos without explicit action labels. Between two observed frames, both agent-caused events and uncontrolled environment dynamics have occurred, and there is no obvious way to separate them from passive observation alone. Yann was direct: we do not know how to do this reliably. It is not even clear in principle whether unexplained variance between frames reflects a variable the agent can influence, an unobserved latent variable in a deterministic world, or genuine stochasticity. Humans also confuse cause and agency regularly, and it would be a significant advance to have AI systems that can figure out the difference through a good causal model. Shirley's practical conclusion was that explicit action labels in the data allow a clean factored model, learning the policy and the dynamics separately, and if you have the data you should use it.

Yoshua added that even with active data, the partition is partial. There are things you do not control but still need to model. In autonomous driving, the rough partition is your own vehicle (controllable), other vehicles (not controllable but must be modeled), and road debris (not controllable, not worth modeling). Passive data cannot reliably establish this partition, and getting it wrong has costs in both directions: over-modeling irrelevant variables and under-modeling relevant ones.

Latent Predictive Loss and Abstraction Levels

An audience member asked whether a single latent predictive loss at the top of the hierarchy is sufficient for learning useful abstraction, or whether auxiliary losses at intermediate layers are necessary. Empirically, adding losses at multiple layers has not produced large gains over a single global latent predictive loss. Yann's response was candid: there has not been enough work on hierarchical models done correctly to have a clear empirical answer. A confounding factor is that current benchmarks are almost entirely semantic tasks, which reward high-level representations. If the benchmark required visual servoing or fine-grained motor control, the distribution of useful representations would look different. His expectation is that every level in the hierarchy will eventually need its own prediction error and regularizer, and that completely generic information maximization objectives are not sufficient. Some bias toward the types of features that should be learned will be necessary.

Yoshua added his view that thinking of hierarchy as captured by layers in a neural network has been a mistake throughout his career. Human abstraction is not fixed by architecture: we construct new levels of description as we learn, and acquiring a new conceptual level does not require retraining everything from scratch. Abstraction should be understood as something dynamic and indirect, not hardcoded in network depth. Yoshua's position is that latent variables are the right framework: the whole point of latent representations is to avoid requiring explicit labels for what the hidden units should represent. Shirley noted that brain activity data from transparent organisms like zebrafish is actually being collected at increasing scale, which would provide more direct hierarchical supervision. Yann and Yoshua agreed that even if that data became available at human scale, the more fundamental point is that humans construct abstractions without such labels, and that is what a good learning system should be able to do.

The moderator raised a question about what abstraction should be measured with respect to. Information decreases monotonically as you go through the layers of a network, so raw information content is not the right quantity. What matters is what is useful for downstream tasks, whether that is linear probing, gradient-based planning, or something else. Yoshua's response was that useful abstraction does not require a downstream task to be defined in advance. Science finds compact representations that explain observed data, uncovering regularities and laws, before any particular experiment is designed. Before microscopes existed, a whole set of abstractions were invisible because the data revealing their need did not exist. The right abstractions emerge from the data, not from the task.

Imitation Learning, Reinforcement Learning, and World Models

An audience member asked Yann whether imitation learning and reinforcement learning remain valuable given a world model. On imitation learning: many intelligent animals never engage in it, because they never meet their parents or conspecifics. An octopus lives a year or two, matures alone, and develops sophisticated behavior without social learning. Imitation is useful for social animals like humans, but it is not a prerequisite for intelligence. On model-free reinforcement learning: a car-driving agent trained without a world model must drive off a cliff thousands of times before learning the cliff is dangerous, and when it encounters a different cliff it must learn again. This approach has been tried, companies were built around it, and it failed. Reinforcement learning is best understood as the cherry on top: a fine-tuning mechanism operating on top of a world model, not the engine of learning itself.

One role reinforcement learning does play well is learning a cost function. A large part of what RL from human feedback currently does is learn a reward model, which is just a learned cost function you can then backpropagate through during world-model-based planning. The concern the audience member had about real-time planning efficiency (MPC with the cross-entropy method requires thousands of concurrent simulations, which is not reactive) was addressed by the same hierarchy argument: with hierarchical planning, you plan two or three high-level actions rather than thousands of low-level ones. Familiar tasks become system one, handled by a cached policy without planning. Gradient-based planning is also an active direction that would be more efficient than sampling-based methods, with known difficulties around local minima that constitute an active research agenda.

World Models vs. Policies: Generalization and Sample Efficiency

Alessandro challenged the claim that world models generalize better than policies. A policy trained on sufficient data also generalizes to new cliffs; a world model trained on limited data still fails when planning takes it out of distribution. Yann's fundamental response was system two: you cannot train a single system one (policy) that covers every situation, but you can train a world model, and once you have it you can do runtime optimization for situations the cached policy was never trained on. A policy is task-specific, requiring separate training data for each task. The best current robotic systems require large amounts of imitation data for each task they perform, which is why every robotics company hoping that sample efficiency will scale is, in Yann's view, without a clear path to making their robots broadly useful.

Yann's mathematical argument was that the number of bits needed to represent an optimal policy for all possible goals is exponentially larger than the number of bits needed to represent the world model. Any function that maps a goal and state to the best action encodes an enormous amount of task-specific information. A world model just encodes how the world transitions. Given the same training budget, the world model is a much better bet. Yoshua sharpened the argument: the inference process for planning produces its output by optimization, which is computationally unlimited in principle. You cannot reduce every computational problem to a fixed-depth forward pass through a policy network, but you can reduce it to an optimization problem over a world model.

Alessandro's counterargument was statistical: regardless of how compact the world model is in principle, the optimization process during planning will generate queries that go out of the world model's training distribution, and error accumulates. This is the same distributional shift problem that policies face. Yoshua's response was that uncertainty estimation is precisely the solution, and hierarchy is a complementary one. The discussion converged on the single-task versus multi-task distinction as the crux: Rich Sutton's earlier arguments for model-free methods were framed for single-task settings, where learning an optimal policy can indeed be simpler than learning full dynamics. The current regime, where many tasks share a common world, changes the calculation entirely.

What Is a World Model? Definitions and Scope

An audience member asked how to cleanly define world model across diverse systems: AlphaFold, AlphaGo, and current coding agents. Yann offered a formal characterization: a world model is a system that, given a state and an action, predicts the next state. Code execution fits naturally into this frame. The state of a program is entirely determined by the values of its variables and the call stack; instructions are actions; a world model of code predicts the program state resulting from executing an instruction. Work along these lines (Code World Models, from FAIR Paris) has shown this framing is tractable. For planning, you would ideally want a JEPA-style model operating over abstract state representations rather than raw tokens: instead of storing a million-element array in context, you represent it as "sorted" or "unsorted" and reason at that level.

Whether current frontier LLMs have an implicit world model of code was contested. Yann's view was probably not, at least not in a form usable for planning, because there is no explicit distinction between state and action encoded in the architecture. Yoshua's view was more permissive: natural language lets you express conditional reasoning ("if I do this, what will the output be"), so the models have been trained on enormous amounts of action-consequence structure and likely encode some of it. Current code generation systems do use a form of search: generate many candidate programs, evaluate them against an execution oracle, and select. This is a primitive form of tree exploration, analogous to MCTS in game playing, and one that computers are particularly good at even if humans are not. Yann noted that code and mathematics are two domains where reasoning in the symbol space is genuinely productive, and frontier labs appear to be betting that this can be extended to ML research itself as a path to unlocking everything else, including robotics. Whether that path works is an open question.

Shirley offered a functional definition to close the definitional discussion: a world model is whatever allows an agent to predict future observations conditioned on past observations and actions. By this definition, narrow and general models both qualify. The more useful question is what role the model plays. Online reinforcement learning for long-horizon agents is extremely expensive, whether in physical robot interactions or multi-hour coding tasks. A world model serves as an intermediary between the agent and the real world, letting the agent simulate outcomes without paying the full cost of interaction. Beyond predicting outcomes, a world model can predict the time cost of actions (an underexplored dimension in the standard MDP formulation) and thus allow agents to make smarter decisions about which actions are worth attempting. The case for world models is ultimately a case for making long-horizon agentic behavior tractable rather than prohibitively expensive.