Who Samples the Predictors?

May 2026

In modern models, generation is synonymous with sampling. To sample from a distribution $p$, there are two options:

Measure a physical process which, due to unaccounted variations or fundamental uncertainty, generates measurements distributed according to $p$. Radioactive decay, for example, is fundamentally uncertain and occurs according to the exponential distribution.
Sample a different distribution $q$ and use a function to map samples from $q$ into samples from $p$. Such a function is guaranteed to exist for all continuous distributions $p$ and $q$.

A sample requires two things: randomness and structure. In option one, the physical process provides both. In option two, the process behind $q$ provides the randomness and the external function imposes the structure. We only have two options because there are only two ways to combine randomness and structure.

For most interesting distributions, option one is synonymous with labor. Say we want to sample from the distribution of Monet's paintings. The physical process that generates such samples is Claude Monet himself. The only way to sample this process is to give Monet food, wait for him to execute, and observe his outputs. Unfortunately, the Monet process stopped working almost 100 years ago. Fortunately, we are starting to find functions that can map simple samples into things like Monet's paintings.

With such functions, we can mass-produce the results of rare, slow, or expensive physical processes. Not only Monet's paintings, but also code, math, music, and movies. Everything can be viewed as a sample from some distribution. Anything that is a sample can be generated by converting from one distribution into another. In this essay, we will see how the framework of sampling and conversion can be used to connect discriminative and generative models.

Discriminative Models

The first successes in AI were discriminative models. Discriminative models are based on splitting individual data points into predictors and targets. The predictors are typically denoted by $x_i$ and the targets by $y_i$. It is worth emphasizing that $x_i$ and $y_i$ are both part of the same observation $d_i$. This separation into predictors and targets is an arbitrary decision imposed on the observations.

Usually, the separation reflects some goal we are trying to achieve. For example, if an observation $d_i$ consists of an audio recording of a speech and its text transcript, we can use the audio to predict the text (speech recognition), or we can use the text to predict the audio (text-to-speech). It is the same observation, just a different direction of discrimination.

The Same Equation

Recent models are referred to as generative. How are generative models different from discriminative ones? We can write both in the same general form:

$y_i = f_{\theta}(x_i) + \epsilon$

where $x_i$ is our predictor, $y_i$ is our target, $f_{\theta}$ is a function parametrized by $\theta$, and $\epsilon$ is some random noise. How are discriminative and generative models different if the equation describing them is the same? The difference lies in the source of the predictor. Generative models use computer hardware to sample predictors. Discriminative models require measuring external processes.

Both convert samples from one distribution into another, but the input distributions are different. Typically, discriminative inputs are sampled from complex real-world distributions, like the sound of a person's voice or the pixels in an image. Generative inputs are computer-generated random numbers. Usually, this is described as “discriminative models don't model the data distribution $P(X)$, only the conditional distribution $P(Y \mid X)$.” While accurate, I never found this description satisfactory. We are always generating an output based on an input. Focusing on sampling and conversion emphasizes the fundamental similarity between discriminative and generative models as maps between distributions, not the arbitrary placement of the conditioning bar.

The Pairing Problem

Why do we want models that internally sample inputs? Let's return to Monet. An observation of Monet's water lilies consists of millions of pixels. Like any observation, it can be viewed as a sample from a million-dimensional $p_{\text{data}}$. For a painting, there is no natural separation into “predictor pixels” and “target pixels.” How do we generate new samples in this scenario? We can still use Monet's pixels as targets, but we have to sample the predictor ourselves. To illustrate, let's sample a made-up predictor $z_i$ from some simple distribution $P_Z$ and apply the same equation:

$y_i = f_{\theta}(z_i) + \epsilon.$

Unfortunately, a simple random pairing between sampled predictors $z_i$ and observations $y_i$ will not work. A random pairing definitionally makes the predictors independent of the observations. This independence destroys generalization. Independence means there is no pattern for $f_{\theta}$ to pick up. Without a pattern, $f_{\theta}$ has no choice but to learn a lookup table. If $f_{\theta}$ lacks the flexibility to learn a full lookup, the loss minimizer becomes some average over the observations. There is no opportunity for generalization because there is no structure in the data.

In discriminative models, the right pairing is provided when we condition within an observation. For generative models, creating the right pairings is a fundamental challenge. VAEs use an encoder to connect the predictors to targets during training. Normalizing flows constrain the hypothesis space to bijections and thereby implicitly force a structured pairing. Diffusion and flow models use a slightly noised version of the target as the predictor in the pair. GANs allow the pairings to be random, but suffer from training instability and mode collapse as a consequence. Autoregressive approaches entirely avoid sampling predictors during training and instead pair targets with previous parts of the same observation. They treat generation as a series of discriminative tasks.

Conclusion

Generation can be viewed as sampling. To sample, we need randomness and structure. Machine learning methods let us find functions that provide interesting structure to random samples. Discriminative models force us to sample predictors externally. Generative models do all the sampling internally. They both map samples from one distribution into another. Internal sampling requires a structured pairing between predictors and targets. This complicates generative models compared to discriminative ones.

The purpose of this piece was to anchor discriminative and generative models to the same probabilistic fundamentals. I find it to be an interesting perspective that is often overlooked. In the next blog post, we will see how much the pairing can impact the difficulty of the learning task by exploring a simple $k$-NN model.

Edvin T. Berhane