← Back to Blog

Information Theory

From Shannon's foundational insights to the mathematics powering modern machine learning

1. The Origins of Information Theory

Since its creation in the mid-20th century, information theory has served as a foundational mathematical model that has guided modern science and engineering in ways that have transformed the way in which societies transmit and process information at scale. From error correcting codes that allow for noisy signals to be received without corruption, to computer storage that relies on compression techniques to minimize overhead, information theory is deeply embedded in our daily lives.

The story begins with Claude Shannon, author of the seminal paper, "A Mathematical Theory of Communication" and known as the "Father of Information Theory". Shannon graduated from the University of Michigan in 1936—earning a bachelor's degree in electrical engineering and mathematics. Though he found his way to MIT as a research assistant after graduation, it was his time at AT&T's Bell Labs that began much of his early-career momentum. Working alongside early pioneers and Nobel Laureates such as John Bardeen (contributor of the point-contact transistor) and Russell Shoemaker Ohl (inventor of the p-n junction for solar cells), Shannon was in the presence of some of America's greatest scientists and engineers.

Information as Shannon often said, "is the resolution of uncertainty". Information Theory, as he outlined in his work, gave a first glance at a separation between the contents of a message and how that message was sent. It was from this that two mathematical formulations were born: message construction ignoring noise and message recovery in the presence of noise.1 Put in familiar terms: encoding and decoding. For the first time, there was a two-part methodical approach to the transmission and reception of information that was both backed by rigorous mathematical principles and generalizable enough to have applications to other fields.

Put into perspective, Claude Shannon was far ahead of his time. Around the time of the publication of his seminal paper, Phil Porter had proposed a redesign of cell tower placement and the Western Electric Type 500 telephone had become available in the United States. In short, the amount of information that was being distributed with electronic devices was significantly less than what it is today, yet the fundamental principles of information theory still hold.

2. Bits, Bytes, and Everything in Between

As was alluded to in the previous section, information is all around us. From our computers that use wireless and wired connections to deliver data both internally and externally, to cellphones and GPS systems that allow us to navigate, the bits and bytes and everything in between.

This ubiquity raises a foundational question: What do we mean by information? And to go into further detail, what do we mean by a bit?

2.1 Information Gain

A bit in the information theoretic sense describes the fundamental unit of information by resolving uncertainty. For example, suppose we have a classical fair sided coin capable of producing two outcomes: heads or tails. The probability of observing heads, \(p_{\text{heads}} = 0.5\). Therefore, by using the equation shown below,

$$\boxed{-\log_2(p)} \tag{1}$$

we see that one bit of information is gained. Therefore, as a system becomes more and more predictable (i.e. \(p\) increases), the amount of information decreases monotonically.

2.2 Information Transfer

In many cases, the information that is received is often not the information that was sent. In the case of electronics, this means that bits are flipped in transit. That "in between" that describes the space between the transmitter (e.g. wires, free space, etc.) and the receiver is known as a channel. While the channel is essential to the transfer of information, it is often the source of its corruption. From noise to interference, the channel poses a threat to the ability to recover transmitted information.

Let \(\boldsymbol{X}\) represent a message transmitted over a linear, time-invariant channel \(\boldsymbol{H}\) that experiences some Additive White Gaussian Noise (AWGN), \(\boldsymbol{n}\).2 If \(\boldsymbol{Y}\) is the received message, then the communication model can be expressed mathematically as:

$$\boxed{\boldsymbol{Y} = \boldsymbol{H}\boldsymbol{X} + \boldsymbol{n}} \tag{2}$$

In many cases, when the channel itself is modeled as pure AWGN, the model can be simplified to:

$$\boxed{\boldsymbol{Y} = \boldsymbol{X} + \boldsymbol{n}} \tag{3}$$

2.3 Information Corruption

As previously mentioned, bit flips can lead to corrupted information. In the case of AWGN, this noise is introduced additively. For example by using Equation (3), we can see that if a message \(\mathbf{X}\) is represented as 101 and the channel introduces noise that flips the bit in the 2nd position, the output can be represented as:

$$\underbrace{\texttt{111}}_{\text{Y}} = \underbrace{\texttt{101}}_{\text{X}} \oplus \underbrace{\texttt{010}}_{\text{n}}$$

Here, we use \(\oplus\) to show that AWGN is bitwise and adheres to XOR. In a subsequent post, we will go into more detail on various kinds of channels and their effects on information transfer.

3. Entropy, Random Variables, Mutual Information

3.1 Random Variables

Definition (Random Variable)

A random variable is a rule that assigns a numerical value to the outcome of a random experiment.

Formally, let \(\Omega\) be the set of all possible outcomes of the experiment, and let \(\mathbb{P}\) be a probability model on events (subsets of \(\Omega\)) that we are allowed to talk about. A random variable is a function

$$X:\Omega \to \mathbb{R}$$

such that events of the form "\(X\) falls in a set \(A\)" have well-defined probabilities:

$$\{\,\omega \in \Omega : X(\omega) \in A\,\} \text{ is an event for every reasonable } A \subseteq \mathbb{R}.$$

In particular, it is enough (and standard) to require that for every threshold \(t \in \mathbb{R}\),

$$\{\,\omega \in \Omega : X(\omega) \le t\,\} \text{ is an event,}$$

so that \(\mathbb{P}(X \le t)\) is defined.

Thus a random variable is not itself "random"; the randomness comes from the underlying random outcome \(\omega \in \Omega\). In statistics, observed data are typically modeled as realizations (samples) of one or more random variables.

3.2 Entropy

Shannon introduced the Entropy function in 1948 as a measure of the uncertainty of a random variable \(X\). Intuitively, we can proceed by thinking of the uncertainty of \(X\) as the number of bits of information that we do not know about \(X\), or in other words, the number of bits needed to describe \(X\).

Let \(X\) be a discrete random variable with probability mass function \(p_X(a) \coloneqq \mathrm{P}(X=a)\), where \(\mathrm{P}(X)\) itself is a random variable as any function of a random variable is a random variable. Then, Shannon's entropy function can be written as:

$$\mathrm{H}(X) = - \sum_a \mathrm{P}(X = a) \log_2 \mathrm{P}(X = a) = - \mathbb{E}_X [\log \mathrm{P}(X)] = \mathbb{E}_X \log \frac{1}{\mathrm{P}(X)}$$

where for a discrete random variable \(Y\), \(\mathbb{E}[f(Y)] = \sum_{a} f(a)\,\mathbb{P}(Y=a)\).

We note that the entropy depends only on the probability distribution of the random variable (r.v.), rather than the actual values that the r.v. can attain. Moreover, it is symmetric, in that if we permute which values map to which probabilities, we will still end up with the same entropy, stemming from the intuition that we still are measuring the "uncertainty" of a r.v. \(X\).

Example: Coin Flip

Consider a coin flip with \(\mathrm{P}(\text{Heads}) = p\) and \(\mathrm{P}(\text{Tails}) = 1-p\).

Case 1: Fair coin (\(p = 1/2\)):

$$\begin{aligned} \mathrm{H}(X) &= -\left(\frac{1}{2}\log_2 \frac{1}{2} + \frac{1}{2}\log_2 \frac{1}{2}\right) \\ &= -\left(\frac{1}{2}(-1) + \frac{1}{2}(-1)\right) = 1 \text{ bit} \end{aligned}$$

This is maximum uncertainty: we need exactly 1 bit to encode the outcome.

Case 2: Biased coin (\(p = 0.9\)):

$$\begin{aligned} \mathrm{H}(X) &= -(0.9\log_2 0.9 + 0.1\log_2 0.1) \\ &\approx -(0.9 \cdot (-0.152) + 0.1 \cdot (-3.322)) \\ &\approx 0.469 \text{ bits} \end{aligned}$$

Lower uncertainty! The outcome is more predictable, so less information is conveyed.

Case 3: Deterministic (\(p = 1\)):

$$\mathrm{H}(X) = -(1 \cdot \log_2 1 + 0 \cdot \log_2 0) = 0 \text{ bits}$$

(We use the convention that \(0\log 0 = 0\).) No uncertainty at all, aligning with the deterministic setting.

Interestingly, one can show that Shannon's entropy function \(\mathrm{H}\) uniquely describes the uncertainty of a r.v. such that \(\mathrm{H}\) satisfies the following properties:

(1) Continuity & Symmetry. For each \(n \ge 2\), the function \(\mathrm{H}(p_1,\dots,p_n)\) is continuous on the probability "simplex"

$$\Delta_n = \left\{(p_1,\dots,p_n): p_i \ge 0,\ \sum_{i=1}^n p_i = 1\right\}$$

and is symmetric: for any permutation \(\pi\) of \(\{1,\dots,n\}\),

$$\mathrm{H}(p_1,\dots,p_n) = \mathrm{H}(p_{\pi(1)},\dots,p_{\pi(n)}).$$

(2) Normalization / Choice of Units.

$$\mathrm{H}\!\left(\tfrac{1}{2},\tfrac{1}{2}\right) = 1.$$

(Equivalently, measuring in bits fixes the scale.)

(3) Maximality at Uniform / Monotonicity in the Number of Outcomes. For each \(n \ge 2\), \(\mathrm{H}\) is maximized on \(\Delta_n\) by the uniform distribution:

$$\mathrm{H}(p_1,\dots,p_n) \le \mathrm{H}\!\left(\tfrac{1}{n},\dots,\tfrac{1}{n}\right).$$

Moreover the uniform entropies increase with \(n\):

$$\mathrm{H}\!\left(\tfrac{1}{2},\tfrac{1}{2}\right) \le \mathrm{H}\!\left(\tfrac{1}{3},\tfrac{1}{3},\tfrac{1}{3}\right) \le \mathrm{H}\!\left(\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4}\right) \le \cdots$$

(4) Grouping Axiom / Recursivity. If the outcomes are partitioned into groups \(G_1,\dots,G_m\), define the group probabilities \(p_i \coloneqq \sum_{a \in G_i} p(a)\) and the conditional distributions \(p(a \mid i) \coloneqq p(a)/p_i\) for \(a \in G_i\) (when \(p_i > 0\)). Then

$$\mathrm{H}(p) = \mathrm{H}(p_1,\dots,p_m) + \sum_{i=1}^m p_i\,\mathrm{H}\bigl(p(\cdot \mid i)\bigr).$$
Example: Grouping Axiom with Dice

Consider rolling a fair six-sided die. We can group outcomes as:

  • Group \(G_1\) = "Low" = \(\{1,2,3\}\) with probability \(p_1 = 1/2\)
  • Group \(G_2\) = "High" = \(\{4,5,6\}\) with probability \(p_2 = 1/2\)

Full entropy:

$$\mathrm{H}(X) = \log_2 6 \approx 2.585 \text{ bits}$$

Using grouping axiom:

$$\begin{aligned} \mathrm{H}(X) &= \mathrm{H}(p_1, p_2) + p_1 \cdot \mathrm{H}(X \mid G_1) + p_2 \cdot \mathrm{H}(X \mid G_2) \\ &= \mathrm{H}(1/2, 1/2) + \tfrac{1}{2}\log_2 3 + \tfrac{1}{2}\log_2 3 \\ &= 1 + \log_2 3 \approx 2.585 \text{ bits} \quad \checkmark \end{aligned}$$

Interpretation: First we learn which group (1 bit), then which element within that group (\(\log_2 3\) bits on average).

Remark. The grouping axiom states that if outcomes are first "coarsened" into groups and then resolved within a group, the total uncertainty decomposes as (uncertainty of the group) + \(\mathbb{E}\)[(uncertainty within the group)].

Joint Entropy. For r.v.'s \(X, Y\),

$$\mathrm{H}(X,Y) = -\mathbb{E}_{X,Y} \log \mathrm{P}(X,Y).$$

Conditional Entropy. For r.v.'s \(X, Y\),

$$\begin{aligned} \mathrm{H}(X \mid Y) &= \sum_{b} \mathrm{P}(Y = b) \cdot \mathrm{H}(X \mid Y = b) \\ &= -\sum_{a,b} \mathrm{P}(X = a, Y = b) \log \mathrm{P}(X = a \mid Y = b) \\ &= -\mathbb{E}_{X,Y} \log \mathrm{P}(X \mid Y) \end{aligned}$$

(the uncertainty that remained in \(X\) after seeing \(Y\)).

3.3 Mutual Information

\(\mathrm{I}(X;Y) =\) the information that \(X\) gives on \(Y\). We have \(\mathrm{I}(X;Y) = \mathrm{H}(Y) - \mathrm{H}(Y \mid X) = \mathrm{I}(Y;X)\). We note that Mutual Information (MI) is symmetric, meaning that the information that \(Y\) gives on \(X\) equals the information that \(X\) gives on \(Y\).

Alternative Characterizations of Mutual Information

The mutual information can be expressed in several equivalent ways:

$$\begin{aligned} \mathrm{I}(X;Y) &= \mathrm{H}(X) - \mathrm{H}(X \mid Y) \\ &= \mathrm{H}(Y) - \mathrm{H}(Y \mid X) \\ &= \mathrm{H}(X) + \mathrm{H}(Y) - \mathrm{H}(X,Y) \\ &= \sum_{a,b} \mathrm{P}(X\!=\!a, Y\!=\!b) \log_2 \frac{\mathrm{P}(X\!=\!a, Y\!=\!b)}{\mathrm{P}(X\!=\!a)\,\mathrm{P}(Y\!=\!b)} \end{aligned}$$

The last form shows that MI measures how much the joint distribution differs from independence.

Connection to the Noisy Channel

In Shannon's communication model, mutual information \(\mathrm{I}(X;Y)\) measures how much information survives transmission through a noisy channel:

  • \(X\) = message sent (input to channel)
  • \(Y\) = message received (output of channel)
  • \(\mathrm{H}(X)\) = total information in the message
  • \(\mathrm{H}(X \mid Y)\) = information lost to noise
  • \(\mathrm{I}(X;Y) = \mathrm{H}(X) - \mathrm{H}(X \mid Y)\) = information successfully transmitted

The channel capacity is defined as:

$$C = \max_{p_X} \mathrm{I}(X;Y)$$

Shannon's noisy channel coding theorem states that reliable communication is possible at any rate \(R < C\), but impossible for \(R > C\).

3.4 Key Inequalities and Properties

Fundamental Inequalities

1. Non-negativity of Entropy:

$$\mathrm{H}(X) \ge 0$$

with equality iff \(X\) is deterministic (one outcome has probability 1).

2. Subadditivity of Joint Entropy:

$$\mathrm{H}(X,Y) \le \mathrm{H}(X) + \mathrm{H}(Y)$$

with equality iff \(X\) and \(Y\) are independent. Joint uncertainty cannot exceed the sum of individual uncertainties.

3. Information Cannot Increase Entropy:

$$\mathrm{H}(X \mid Y) \le \mathrm{H}(X)$$

with equality iff \(X\) and \(Y\) are independent. Observing \(Y\) cannot increase uncertainty about \(X\).

4. Non-negativity of Mutual Information:

$$\mathrm{I}(X;Y) \ge 0$$

with equality iff \(X\) and \(Y\) are independent. Variables always share non-negative information.

5. Chain Rule for Entropy:

$$\mathrm{H}(X,Y) = \mathrm{H}(X) + \mathrm{H}(Y \mid X) = \mathrm{H}(Y) + \mathrm{H}(X \mid Y)$$

6. Data Processing Inequality: If \(X \to Y \to Z\) forms a Markov chain (i.e., \(Z\) depends on \(X\) only through \(Y\)), then:

$$\mathrm{I}(X;Z) \le \mathrm{I}(X;Y)$$

Processing cannot create information. This is fundamental to understanding information loss in channels!

4. Unexpected Connections

The beauty of information theory lies not entirely in its clean mathematical formulation or its rich history, but in its reach. The entropy function defined earlier also makes its appearance in an entirely different branch of science: statistical mechanics. Far from a coincidence, it reflects a deep structural connection between information and physics that we explore in this section.

4.1 Thermodynamic Entropy: A Familiar Face

In 1872, Ludwig Boltzmann introduced a quantity to describe the disorder of a physical system in thermal equilibrium. Consider a system that can occupy a set of microstates \(\{1, 2, \ldots, W\}\), each with probability \(p_i\). Boltzmann's entropy is defined as:

$$\boxed{S = -k_B \sum_{i=1}^{W} p_i \ln p_i} \tag{4}$$

where \(k_B \approx 1.381 \times 10^{-23}\) J/K is the Boltzmann constant. Compare this directly with Shannon's entropy from Section 3:

$$\mathrm{H}(X) = -\sum_a p(a) \log_2 p(a)$$

The two expressions are identical up to a change of logarithmic base (\(\ln\) vs. \(\log_2\)) and a multiplicative constant (\(k_B\)). The units differ (Joules per Kelvin for Boltzmann, bits for Shannon) but the underlying mathematical structure is the same. Both measure the uncertainty about the state of a system.

4.2 The Maximum Entropy Principle

The connection between Shannon and Boltzmann entropy was made rigorous by Edwin T. Jaynes in two landmark papers published in 1957. Jaynes posed the following question: given partial knowledge about a physical system (say, its average energy) what probability distribution over microstates should we assign? His answer was to choose the distribution that maximizes Shannon entropy subject to the known constraints.

Formally, suppose we know that the expected energy of the system satisfies \(\langle E \rangle = \sum_i p_i E_i = U\), where \(E_i\) is the energy of microstate \(i\) and \(U\) is a known constant. We seek the distribution \(\{p_i\}\) that maximizes:

$$\mathrm{H}(p) = -\sum_i p_i \ln p_i$$

subject to the constraints \(\sum_i p_i = 1\) and \(\sum_i p_i E_i = U\). Introducing Lagrange multipliers \(\lambda_0\) and \(\beta\) for these constraints, we form the Lagrangian:

$$\mathcal{L} = -\sum_i p_i \ln p_i - \lambda_0 \left(\sum_i p_i - 1\right) - \beta \left(\sum_i p_i E_i - U\right)$$

Setting \(\frac{\partial \mathcal{L}}{\partial p_i} = 0\) for each \(i\):

$$-\ln p_i - 1 - \lambda_0 - \beta E_i = 0 \quad \implies \quad p_i = e^{-(1+\lambda_0)}\,e^{-\beta E_i}$$

Enforcing the normalization constraint \(\sum_i p_i = 1\) yields the Boltzmann distribution:

$$\boxed{p_i = \frac{e^{-\beta E_i}}{Z}, \qquad Z = \sum_j e^{-\beta E_j}} \tag{5}$$

where \(Z\) is the partition function and \(\beta = \frac{1}{k_B T}\) is the inverse temperature.

Connection: Maximum Entropy = Thermal Equilibrium

The Boltzmann distribution is not an empirical approximation. It is the unique distribution that maximizes Shannon entropy subject to a constraint on average energy. In other words, a system in thermal equilibrium is one that is maximally uncertain about its microstate, given what we know about its macroscopic energy. This is Jaynes' key insight: the laws of thermodynamics are not merely physical laws, rather they are consequences of optimal inference under uncertainty.

Example: Two-Level System

Consider a system with two microstates: a ground state with energy \(E_0 = 0\) and an excited state with energy \(E_1 = \varepsilon > 0\). The partition function is:

$$Z = 1 + e^{-\beta \varepsilon}$$

and the Boltzmann probabilities are:

$$p_0 = \frac{1}{1 + e^{-\beta \varepsilon}}, \qquad p_1 = \frac{e^{-\beta \varepsilon}}{1 + e^{-\beta \varepsilon}}$$

High temperature (\(\beta \to 0\), i.e., \(T \to \infty\)): Both states become equally likely, \(p_0 \approx p_1 \approx 1/2\), and the entropy approaches its maximum of \(\ln 2\) nats = 1 bit. Maximum uncertainty.

Low temperature (\(\beta \to \infty\), i.e., \(T \to 0\)): The system collapses to the ground state, \(p_0 \to 1\), and the entropy approaches 0. No uncertainty (Third Law of Thermodynamics).

Note that this two-level system is formally identical to the biased coin from Section 3.2, with \(p = p_0\) determined by the temperature.

4.3 The Free Energy and KL Divergence

The connection deepens further when we consider the Helmholtz free energy, defined in thermodynamics as \(F = U - TS\), where \(U\) is the internal energy, \(T\) is the temperature, and \(S\) is the entropy. In the language of probability distributions, for any distribution \(q\) over microstates, we can write:

$$F_q = \sum_i q_i E_i - \frac{1}{\beta}\,\mathrm{H}(q) = \frac{1}{\beta}\left(\sum_i q_i \ln q_i + \beta \sum_i q_i E_i\right)$$

To interpret this expression in information-theoretic terms, we introduce the Kullback–Leibler divergence between two distributions \(p\) and \(q\):

Definition (KL Divergence)

The Kullback–Leibler divergence from distribution \(q\) to distribution \(p\) is

$$D_{\mathrm{KL}}(p \| q) = \sum_{a} p(a) \log \frac{p(a)}{q(a)} = \mathbb{E}_p \left[\log \frac{p(X)}{q(X)}\right].$$

It measures the "extra cost" (in bits) incurred by using \(q\) to encode data that is actually drawn from \(p\). Note that \(D_{\mathrm{KL}}(p \| q) \ge 0\) with equality if and only if \(p = q\).

With this definition in hand, and after some algebra, the free energy can be rewritten as:

$$\boxed{F_q = F^* + \frac{1}{\beta}\,D_{\mathrm{KL}}(q \| p^*)} \tag{6}$$

where \(p^*\) is the Boltzmann distribution from Equation (5) and \(F^* = -\frac{1}{\beta}\ln Z\) is the equilibrium free energy.

Free Energy Minimization = KL Divergence Minimization

Equation (6) reveals that since \(F^*\) is a constant and \(D_{\mathrm{KL}}(q \| p^*) \ge 0\):

  1. The free energy \(F_q\) is minimized when \(q = p^*\) (the Boltzmann distribution).
  2. The excess free energy above equilibrium is exactly the KL divergence from the current distribution to the equilibrium distribution, scaled by temperature.
  3. A physical system relaxing to thermal equilibrium is performing KL divergence minimization—incidentally the same optimization that a neural network performs when trained with cross-entropy loss (Section 5.1)!

To me this is a most remarkable insight. When a gas equilibrates in a box, when a crystal cools to its ground state, when a neural network minimizes its loss function—in each case, the underlying mathematical operation is the same: minimizing the KL divergence between a current distribution and a target distribution. The physics looks different, the applications are different, but the information-theoretic structure is identical.

5. Applications in Machine Learning

5.1 Cross-Entropy Loss: The Hidden Information-Theoretic Engine

If you have ever trained a neural network for classification, you have almost certainly used the cross-entropy loss. It is the default in nearly every deep learning framework, from PyTorch to TensorFlow, and it underpins everything from image classifiers to large language models. This loss function is in fact a direct consequence of information theory!

Recall from Section 3 that the entropy of a discrete random variable \(X\) with distribution \(p\) is given by \(\mathrm{H}(X) = -\sum_a p(a) \log_2 p(a)\). Now suppose we have a true distribution \(p\) over class labels and a predicted distribution \(q\) produced by our model (e.g., via a softmax output). The cross-entropy between \(p\) and \(q\) is defined as:

$$\boxed{\mathrm{H}(p, q) = -\sum_{a} p(a) \log q(a)} \tag{7}$$

At first glance, this looks like a minor modification of Shannon's entropy. But its significance becomes clear when we decompose it. We can write:

$$\mathrm{H}(p, q) = \mathrm{H}(p) + D_{\mathrm{KL}}(p \| q)$$

Here \(D_{\mathrm{KL}}(p \| q)\) is the Kullback–Leibler divergence defined in Section 4.3; it measures the "extra cost" (in bits) incurred by using \(q\) to encode data that is actually drawn from \(p\).

Since \(\mathrm{H}(p)\) is a constant determined entirely by the data and does not depend on our model's parameters, minimizing the cross-entropy \(\mathrm{H}(p, q)\) over the model parameters is mathematically equivalent to minimizing the KL divergence \(D_{\mathrm{KL}}(p \| q)\). In other words, every time you call loss = CrossEntropyLoss() in your training loop, you are solving an information-theoretic optimization problem: finding the predicted distribution \(q\) that is closest to the true distribution \(p\) in the sense of KL divergence.

Example: Binary Classification

Consider binary classification where the true label is \(y \in \{0,1\}\) and the model outputs \(\hat{y} = \sigma(z)\) via a sigmoid function. The binary cross-entropy loss for a single sample is:

$$\mathcal{L} = -\left[y \log \hat{y} + (1-y)\log(1-\hat{y})\right]$$

This is precisely Equation (7) applied to the Bernoulli distributions \(p = (y, 1-y)\) and \(q = (\hat{y}, 1-\hat{y})\). When \(y = 1\) and \(\hat{y} = 0.9\):

$$\mathcal{L} = -\log(0.9) \approx 0.152 \text{ nats}$$

The model is penalized proportional to the information-theoretic "surprise" of assigning only 90% probability to the correct class.

This connection also explains why cross-entropy loss tends to work better than, say, mean squared error for classification: it directly optimizes the information-theoretic distance between distributions, rather than treating class probabilities as regression targets.

5.2 Contrastive Learning: Mutual Information in Disguise

In the last several years, one of the most impactful developments in machine learning has been self-supervised learning: the idea that a model can learn useful representations from unlabeled data by solving a cleverly designed pretext task. Among the most successful approaches is contrastive learning, which has powered systems like SimCLR and CLIP. What is perhaps most surprising is that contrastive learning, which on the surface looks like a clever engineering trick, turns out to be a method for estimating and maximizing mutual information.

The key insight was formalized by van den Oord, Li, and Vinyals in their 2018 paper on Contrastive Predictive Coding. Consider two random variables \(X_1\) and \(X_2\) that represent two "views" of the same underlying data—for example, two different crops of the same image. A contrastive learning system produces embeddings \(z_1 = f(X_1)\) and \(z_2 = f(X_2)\), and is trained to distinguish the true pair \((z_1, z_2)\) from \(N-1\) "negative" pairs \((z_1, z_j^{-})\) drawn from unrelated samples.

The resulting loss function, known as InfoNCE, takes the form:

$$\boxed{\mathcal{L}_{\mathrm{InfoNCE}} = -\log \frac{\exp\!\bigl(\mathrm{sim}(z_1, z_2)/\tau\bigr)}{\displaystyle\sum_{j=1}^{N} \exp\!\bigl(\mathrm{sim}(z_1, z_j)/\tau\bigr)}} \tag{8}$$

where \(\mathrm{sim}(\cdot,\cdot)\) denotes cosine similarity, \(\tau > 0\) is a temperature parameter, and the sum in the denominator runs over the one positive and \(N-1\) negative pairs.

InfoNCE as a Lower Bound on Mutual Information

It can be shown that the InfoNCE loss provides a lower bound on the mutual information between the two views:

$$\mathrm{I}(X_1; X_2) \ge \log(N) - \mathcal{L}_{\mathrm{InfoNCE}}$$

where \(N\) is the total number of samples (one positive + \(N-1\) negatives) in each batch.

This means that minimizing InfoNCE is equivalent to maximizing a lower bound on mutual information. The bound becomes tighter as \(N\) increases, which explains a well-known empirical observation: contrastive learning methods perform significantly better with larger batch sizes.

This connection reveals that models like CLIP (which learns to align images and text in a shared embedding space) are fundamentally performing mutual information maximization between visual and linguistic representations. When CLIP learns that a photo of a dog and the caption "a golden retriever sitting in a park" should map to nearby points in embedding space, it is maximizing the information that the image representation carries about the text representation, and vice versa. The temperature parameter \(\tau\) is not merely a training hyperparameter; it controls the sharpness of the distribution used to estimate mutual information, directly affecting the quality of the bound.

Footnotes
  1. Noise (used here in the stochastic sense) refers to the corruption of information.
  2. AWGN is a class of noise that has a uniform distribution in the frequency spectrum.