Home Blog Miscellaneous
Swimming Around the World Horse Riding

← Back to Miscellaneous

Brainstorm Session: Generalizing Backprop

Lab discussion with Yoshua Bengio and students

Overview

A lab brainstorm session led by Yoshua Bengio, exploring how to extend the principles of backpropagation to settings it currently cannot handle: discrete decisions, sparse credit assignment over long sequences, and non-smooth transformations. The session covers the limits of gradient-based credit assignment, comparisons with REINFORCE and biological learning, the problem of localizing credit to sparse relevant events, compartmentalization through abstract models, the role of intermediate rewards and learned credit machinery, the challenge of discrete actions and entropy injection, and the surprising smoothness of loss surfaces in deep networks.

1. Introduction: Generalizing Backpropagation

Backpropagation has been the engine behind essentially all of modern deep learning, but its domain of application is restricted to smooth, differentiable computations, or to cases where non-differentiable parts can be safely ignored (as with rectifiers in saturation). The more general question is whether there is a principled credit assignment mechanism that handles discrete decisions, hard nonlinearities, and truly discontinuous transformations. The session opened with Yoshua noting that the success of backprop may itself be part of the problem: the field has naturally gravitated toward models that work well with backprop and away from models that do not. Conditional Gaussian mixtures, for example, are hard to train with backprop and consequently underexplored. If backprop were less effective, the incentive to find alternatives would be stronger.

2. The Curse of Backprop’s Success and Noisy Gradients

Optimizers like AdaGrad and RMSprop, which outperform plain gradient descent, are themselves a signal that the raw gradient is not the most useful information in the world. Mini-batches introduce noise into the gradient estimate at every step. Even in domains where backprop works well, what is actually being used is an imperfect approximation of a gradient of an approximation of the true objective. This does not mean backprop is bad; it means that its dominance may be partly a self-reinforcing artifact of the research choices it enables rather than evidence that it is uniquely correct. Models that do not support backprop well are less explored, not because they have been tried and found wanting, but because the field has not developed the tools to train them.

3. Credit Assignment versus Biological Plausibility

REINFORCE is the dominant alternative to backprop for credit assignment across discrete or stochastic computations. A recurring suspicion in the discussion was that REINFORCE does not scale: as the number of units or discrete decisions grows, the variance of gradient estimates explodes. Backprop has something REINFORCE lacks, and characterizing precisely what that something is would be a useful contribution.

Memory was identified as one key difference. Backprop requires storing all activations from the forward pass, with memory requirements scaling with sequence length. A brain using something like backprop would need to retain all neural activations indefinitely, which is implausible. One proposed resolution is to store key checkpoints and reconstruct intermediate states when needed rather than storing everything directly. This connects to the well-documented phenomenon of reverse replay in rats: after navigating a maze, firing patterns associated with each step replay in reverse temporal order, as if the brain were reconstructing the trajectory for the purposes of credit assignment.

4. The Memory Problem: The Study Example

A separate challenge is not just the volume of stored activations but the sparsity of relevant events. If a student makes a single decision early in the semester, not to study, and receives a failing grade months later, the relevant event for credit assignment is buried among thousands of irrelevant ones. Any mechanism that distributes credit softly across all past events faces a combinatorial problem: if the number of past events is large, the credit assigned to any one relevant event becomes negligible. Current memory networks and attention-based mechanisms require soft attention over many possible contributing events, which becomes intractable at the scale of months of life. There must be some prior or structural constraint that localizes credit assignment to the handful of events that actually matter.

5. Compartmentalization and Backpropagating Through Abstract Models

One resolution is abstraction. Rather than assigning credit through every second of a three-month period, you backpropagate through a simplified model: studying leads to a good grade; not studying leads to a bad grade. The abstract model compresses a long sequence into a small number of steps, and credit assignment through those steps is tractable. This is the intuition behind actor-critic methods in reinforcement learning: the critic is a learned proxy for how the world responds to actions, and backpropagation happens through the proxy, not through the real world.

A clarification that recurred in the discussion: backprop and gradient descent are distinct. Backprop is the credit assignment mechanism that computes how much each parameter contributed to the loss. Gradient descent uses those contributions to update the parameters. The goal of generalizing backprop is to generalize the credit assignment step, not merely to replace the update rule.

6. Intermediate Rewards and Learning Predictive Models

Most courses do not rely on a single final exam, and that design reflects a genuine insight about credit assignment. Intermediate feedback breaks a long-horizon credit assignment problem into manageable pieces. The analog in machine learning is the challenge of deriving principled intermediate rewards from a sparse final signal.

One approach is to learn credit assignment machinery in the reverse direction. Encoder-decoder architectures, variational autoencoders, and Helmholtz machines are all, at some level, learning an inverse mapping: a function that propagates credit backward through computation, indicating what earlier layers should have produced to get a better outcome. These are not just compression techniques but credit assignment devices.

A separate observation is that humans appear to maintain multiple internal models simultaneously and switch between them depending on the problem at hand. A system can learn a collection of models relevant to different domains and select among them as needed. If a good model of the world can be learned from observations of actions and states, then backpropagation through that proxy becomes available even for settings where backpropagating through the real world is impossible.

7. Dealing with Discrete Actions and Entropy

If actions are continuous, you can backpropagate through a learned reward model directly. If actions are discrete, this fails. The straight-through estimator handles the problem by approximation: treat the discrete sample as if it were continuous for the purpose of computing gradients through the distribution parameters. This produces a useful signal in practice but lacks clean theoretical justification.

A more principled treatment involves treating actions as samples from a distribution and computing gradients with respect to the distribution parameters. This naturally introduces entropy: any model that maintains a distribution over actions has implicit pressure toward exploration, because concentrated distributions provide weaker gradient signal than diffuse ones. The entropy connection appears across several architectures, including VAEs and wake-sleep models. Adding noise and entropy is not simply a hyperparameter choice; it is nearly a necessity for any system that must explore.

On a broader point, the brain almost certainly learned a predictive model of the world, because there are events for which no direct training signal exists. You have never experienced a fatal car accident, yet you can reason about how to avoid one. That reasoning must go through a model of consequences rather than through direct experience, which is the basic argument for model-based reinforcement learning as a necessity, not merely an engineering preference.

8. Real-world Complexity and the Smoothness of Cost Functions

An empirical observation that surfaced toward the end of the session: visualizations of the loss surface of deep rectifier networks, projected into two or three dimensions, look surprisingly smooth. Each rectifier introduces a piecewise-linear kink, and the composition of many such kinks should in principle produce an irregular landscape. In practice, the landscape appears nearly smooth.

Several explanations were proposed. Averaging over many training examples smooths out individual kinks. The visualizations are projections of a very high-dimensional surface, and the kinks may be present but aligned orthogonally to the projection directions, making them invisible in the projected view. The near-linearity of neural networks in input space, which also explains why adversarial examples can be found with a single gradient step, may extend to parameter space as well. Separately, the observation that the loss surface along the straight line from initialization to a trained minimum is smooth, with no upward excursions, is consistent with a landscape that is genuinely less irregular than the piecewise structure of individual rectifiers would suggest.