Rademacher's theorem

This is a set of notes I prepared for a seminar. The talk abstract was as follows:

Every set is nearly a finite union of intervals, every function is nearly continuous, every convergent sequence of functions is nearly uniformly convergent, every pancake is nearly a waffle - but is every function nearly differentiable? Come see how Lipschitz continuity can make the world a better place.

1. Rademacher’s Theorem

Recall that an outer measure on $X$ is a nonnegative extended-real-valued function $\mu$ on $\mathscr{P}(X)$ that is countably subadditive. Following Carathéodory’s criterion, we shall call a set $A \subseteq X$ measurable if

$$\mu(E) = \mu(E \cap A) + \mu(E \cap A^c)$$

for all $E \subseteq X$. The outer measure of a disjoint union of measurable sets is the sum of the outer measure of each set. Henceforth, we shall drop outer and call any such $\mu$ a measure.

A measure $\mu$ on a topological space $X$ is Borel regular if all Borel sets are measurable and every subset of $X$ is contained in a Borel set of the same measure. The standard measure on $\mathbb{R}^n$ is the $n$-dimensional Lebesgue measure $\mathscr{L}^n$, which is the unique translation-invariant Borel regular measure on $\mathbb{R}^n$ with the normalization $\mathscr{L}^n([0,1]^n) = 1$.

There are a number of theorems that establish the $\mathscr{L}^n$-almost everywhere differentiability of a function $f:\mathbb{R}^n \to \mathbb{R}^m$. These theorems are typically proved using techniques from geometric measure theory, which studies the geometric properties of subsets of Euclidean spaces and measures on them. Today’s talk will focus on Rademacher’s theorem, which establishes the almost-everywhere differentiability of Lipschitz functions:

Definition. A function $f:\mathbb{R}^n \to \mathbb{R}^m$ is Lipschitz if there exists a nonnegative real number $M$ such that

$$f(x)-f(y)| \leq M|x-y|$$

for all $x,y \in \mathbb{R}^n$. The Lipschitz constant $\operatorname{Lip} f$ is the infimum of all such $M$.

Theorem 1 (Rademacher, 1919). If $f:\mathbb{R}^n \to \mathbb{R}^m$ is Lipschitz, then $f$ is differentiable $\mathscr{L}^n$-almost everywhere on $\mathbb{R}^n$.

2. Bounded Variations

The one-dimensional Rademacher’s theorem is a special case of differentiation theory on $\mathbb{R}$, which is a standard real-analysis topic. Let us briefly review the relevant facts.

Recall that a curve $\gamma$ in $\mathbb{R}^2$ with a continuous parametrization $z(t) = (x(t),y(t))$ on $[a,b]$ is rectifiable if there exists a nonnegative real number $M$ such that

$$\sum_{j=1}^N |z(t_j) - z(t_{j-1})| \leq M$$

for every partition $a = t_0 < \cdots < t_N = b$. The length of a rectifiable curve $\gamma$ is the supremum of all such sums, or equivalently the infimum of all such $M$. It is well-known that $\gamma$ is rectifiable if and only if $x(t)$ and $y(t)$ are of bounded variation, which we define as follows:

Definition. $F:[a,b] \to \mathbb{R}$ is of bounded variation if there exists a nonnegative real number $M$ such that

$$\sum_{j=1}^N |F(t_j) - F(t_{j-1})| \leq M$$

for every partition $a = t_0 < \cdots < t_N = b$. The total variation $T_F([a,b])$ of $F$ is the supremum of all such sums, or equivalently the infimum of all such $M$.

As it turns out, bounded variations are differentiable almost everywhere:

Lemma 2. If $F:[a,b] \to \mathbb{R}$ is of bounded variation, then $F$ is differentiable $\mathscr{L}^1$-almost everywhere on $[a,b]$.

For a proof, see [SS05] Chapter 3, Theorem 3.4 or [Fol99] Theorem 3.27. Rudin, of course, leaves it as an exercise: see [Rud87] Exercise 13 in Chapter 7 if you’d like to try it yourself.

To connect the above lemma to Lipschitz functions, we need the notion of bounded variation on $\mathbb{R}$:

Definition. $F:\mathbb{R} \to \mathbb{R}$ is of bounded variation if there exists a nonnegative real number $M$ such that \[T_F([a,b]) \leq M\] for every interval $[a,b] \subseteq \mathbb{R}$.

A simple modification of the proof yields the following:

Corollary 3. If $F:\mathbb{R} \to \mathbb{R}$ is of bounded variation, then $F$ is differentiable $\mathscr{L}^1$-almost everywhere on $\mathbb{R}$.

Observing that Lipschitz functions are of bounded variation on $\mathbb{R}$, we now conclude:

Corollary 4 (One-dimensional Rademacher's theorem). If $f:\mathbb{R} \to \mathbb{R}$ is Lipschitz, then $F$ is differentiable $\mathscr{L}^1$-almost everywhere on $\mathbb{R}$.

3. Hausdorff Measures

We shall also need a way of assigning $m$-dimensional measures on subsets of $\mathbb{R}^n$, when $m<n$. To this end, we recall that the diameter of a set $E \subseteq \mathbb{R}^n$ is

$$\operatorname{diam}(E) = \inf_{x,y \in E} |x-y|.$$

Recall also that the distance between two subsets $E$ and $F$ of $\mathbb{R}^n$ is

$$\operatorname{dist}(E,F) = \inf_{\substack{x \in E \ y \in F}} |x-y|.$$

Before we give a definition of the Hausdorff measure, we review a particularly constructive way of defining the Lebesgue measure. The crucial ingredient is the following

Lemma 5 (Whitney decomposition theorem). Every open set $O$ in $\mathbb{R}^n$ can be decomposed into a countable union of cubes whose interiors are disjoint. Furthermore, we can choose the cubes $(Q_j)_{j=1}^\infty$ satisfying

$$\operatorname{diam}(Q_j) \leq \operatorname{dist}(Q_j,O^c) \leq 4\operatorname{diam}(Q_j).$$

The proof of the first statement is easy and can be found in [SS05], Chapter 1, Theorem 1.4. The full proof can be found in [Ste70], Chapter VI §1.2.

We now use the Whitney decomposition theorem to construct the Lebesgue measure. Here we denote by $|Q_j|$ the standard $n$-dimensional volume of the cube $Q_j$.

Definition (The Lebesgue measure). The $n$-dimensional Lebesgue measure $\mathscr{L}^n$ on $\mathbb{R}^n$ is defined for every subset $E \subseteq \mathbb{R}^n$ by

$$ \mathscr{L}^n(E) = \inf_{O \supseteq E} \inf_{O = \bigcup Q_j} \sum_{j=1}^\infty |Q_j| $$

where the first infimum is taken over all open supersets of $E$ and the second infimum over all Whitney decompositions of $O$.

In words, we approximate the measure of each set by the measure of its open covers, whose measure is approximated by their Whitney decompositions. That the Lebesgue measure is Borel Regular is a consequence of the following

Lemma 6 (Metric outer measures are Borel regular). A measure $\mu$ on $\mathbb{R}^n$ is Borel regular if and only if $$ \mu(E \cup F) = \mu(E) + \mu(F) $$ for all subsets $E$ and $F$ of $\mathbb{R}^n$ with $\operatorname{dist}(E,F) > 0$.

The proof of the “only if” part is easy. For a proof of the “if” part, see [SS05], Chapter 6, Theorem 1.2 or [Fol99], Proposition 11.16.

We now return to the task of assigning $m$-dimensional measures on the subsets of $\mathbb{R}^n$. The idea, due to Hausdorff, is to approximate the measure by $m$-dimensional balls:

Definition. Fix $m \in \mathbb{N}$, and let $\omega_m$ be the $m$-dimensional Lebesgue measure of the closed unit ball in $\mathbb{R}^m$. The $m$-dimensional Hausdorff measure $\mathscr{H}^m$ is defined for every subset $E \subseteq \mathbb{R}^n$ by

$$ \mathscr{H}^m(E) = \lim_{\delta \to 0} \inf_{\substack{E \subseteq \bigcup S_j \ \operatorname{diam}(S_j) \leq \delta}} \sum_{j=1}^\infty \omega_m \left(\frac{\operatorname{diam}(S_j)}{2} \right)^m, $$

where the infimum is taken over a countable cover $(S_j)_{n=1}^\infty$ of $E$ of diameter at most $\delta$.

Here we have chosen the normalization that will give us the identity $\mathscr{H}^n = \mathscr{L}^n$. Note that $\mathscr{H}^0$ is the counting measure. The one-dimensional Hausdorff measure $\mathscr{H}^1$ of a rectifiable curve is its length. In general, the $m$-dimensional Hausdorff measure of an $m$-manifold coincides with its surface measure, but we shall not pursue the idea in this talk.

4. Proof of Rademacher’s Theorem

We are now ready to present a proof of Rademacher’s theorem. The proof is taken from pages 101-102 of [Mat95].

Let $f:\mathbb{R}^n \to \mathbb{R}^m$ be a Lipschitz function. Since the differentiability of $f$ is equivalent to the differentiability of the coordinate functions $f_1,\ldots,f_m$, we may assume without loss of generality that $m=1$. Recall that the directional derivative of $f$ at $x \in \mathbb{R}^n$ in the direction of $u \in \mathbb{S}^{n-1}$ is

$$D_u f(x) = \lim_{t \to 0} \frac{f(x+tu)-f(x)}{t},$$

provided that the limit exists. We contend that $D_u f(x)$ exists for $\mathscr{L}^n$-almost every $x \in \mathbb{R}^n$.

To prove the assertion, we let $B_e$ denote the collection of points $x \in \mathbb{R}^n$ at which $D_u f(x)$ does not exist and observe that

$$B_u = \left\lbrace x \in \mathbb{R}^n : \limsup_{t \to 0} \frac{f(x+tu)-f(x)}{t} - \liminf_{t \to 0} \frac{f(x+tu)-f(x)}{t} > 0 \right\rbrace.$$

Since the difference quotient is continuous at $t \neq 0$, the limit superior and the limit inferior thereof are measurable, whereby $B_u$ is measurable. Applying the one-dimensional Rademacher’s theorem to $t \mapsto f(x+tu)$, we find that

$$\mathscr{H}^1(B_u \cap \lbrace x + tu : t \in \mathbb{R}\rbrace) = 0$$

for each $x \in \mathbb{R}^n$. We now set $\chi_{B_u}$ to be the characteristic function on $B_u$ and write

$$\mathscr{L}^n(B_u) = \int_{\mathbb{R}^n} \chi_{B_u}(x) \, dx.$$

By the Fubini-Tonelli theorem, we may integrate along each line parallel to the unit vector $u$ and then integrate over the set of all such lines to conclude that $\mathscr{L}^n(B_u) = 0$.

Recall that the gradient of $f$ at $x \in \mathbb{R}^n$ is given by the row vector

$$\nabla f(x) = \begin{bmatrix} D_1 f(x) & \cdots & D_n f(x) \end{bmatrix},$$

where $D_j f(x) = D_{e_j} f(x)$ are the partial derivatives in the standard basis ${e_1,\ldots, e_n}$ for $\mathbb{R}^n$. We claim that, for each $u \in \mathbb{S}^{n-1}$,

$$D_u f(x) = \nabla f(x) \cdot u$$

at $\mathscr{L}^n$-almost every $x \in \mathbb{R}^n$.

To this end, we fix $\varphi \in \mathscr{C}^\infty_c(\mathbb{R}^n)$ and observe that

$$\begin{align*} \int_{\mathbb{R}^n} \frac{f(x+hu)-f(x)}{h} \varphi(x) \, dx &= \frac{1}{h} \left( \int_{\mathbb{R}^n} f(x+hu) \varphi(x) \, dx - \int_{\mathbb{R}^n} f(x)\varphi(x) \, dx \right) \\ &= \frac{1}{h} \left( \int_{\mathbb{R}^n} f(x) \varphi(x-hu) \, dx - \int_{\mathbb{R}^n} f(x)\varphi(x) \, dx \right) \\ &= -\frac{\varphi(x)-\varphi(x-hu)}{h} f(x) \, dx \end{align*}$$

for each $h \neq 0$. By the Lipschitz condition, we have the bound

$$\left| \frac{f(x+hu)-f(x)}{h} \varphi(x) \right| \leq \frac{\operatorname{Lip} f |hu|}{|h|} |\varphi(x)| \leq \operatorname{Lip} f |\varphi (x)|,$$

whence by the dominated convergence theorem

$$\begin{align*} \lim_{h \to 0} \int_{\mathbb{R}^n} \frac{f(x+hu)-f(x)}{h} \varphi(x) \, dx &= \int_{\mathbb{R}^n} \lim_{h \to 0} \frac{f(x+hu)-f(x)}{h} \varphi(x) \, dx \\ &= \int_{\mathbb{R}^n} D_u f(x) \varphi(x) \, dx \end{align*}$$


$$\begin{align*} \lim_{h \to 0} \int_{\mathbb{R}^n} \frac{f(x+hu)-f(x)}{h} \varphi(x) \, dx &= -\lim_{h \to 0} \int_{\mathbb{R}^n} \frac{\varphi(x)-\varphi(x-hu)}{h} f(x) \, dx \\ &= -\int_{\mathbb{R}^n} \lim_{h \to 0} \frac{\varphi(x)-\varphi(x-hu)}{h} f(x) \, dx \\ &= -\int_{\mathbb{R}^n} f(x) D_u \varphi(x) \, dx. \end{align*}$$

Since $\varphi(x)$ is $\mathscr{C}^1$ everywhere, we have

$$D_u\varphi(x) = \nabla \varphi(x) \cdot u$$

for all $x \in \mathbb{R}^n$. It thus follows that

$$\begin{align*} \int_{\mathbb{R}^n} D_u f(x) \varphi(x) \, dx &= -\int_{\mathbb{R}^n} f(x) D_u \varphi(x) \, dx \\ &= -\int_{\mathbb{R}^n} f(x) \left(\nabla \varphi(x) \cdot u \right) \, dx \\ &= -\int_{\mathbb{R}^n} f(x) \left(\sum_{j=1}^n D_j \varphi(x) u_j \right) \, dx \\ &= - \sum_{j=1}^n u_j \int_{\mathbb{R}^n} f(x) D_j \varphi(x) \, dx \\ &= \sum_{j=1}^n u_j \int_{\mathbb{R}^n} D_j f(x) \varphi(x) \, dx \\ &= \int_{\mathbb{R}^n} \left( \sum_{j=1}^n D_j f(x) u_j \right) \varphi(x) \, dx \\ &= \int_{\mathbb{R}^n} \nabla f(x) \cdot u \varphi(x) \, dx. \end{align*}$$

Since the identity holds for all $\varphi \in \mathscr{C}^\infty_c$, we conclude that

$$D_uf(x) = \nabla f(x) \cdot u$$

whenever $D_u f(x)$ exists.

We now recall that $f$ is differentiable at $x \in \mathbb{R}^n$ if there exists a matrix $Df(x)$ such that

$$\lim_{|v| \to 0} \frac{|f(x+v) - f(x) - Df(x) \cdot v|}{|v|} = 0.$$

We are now left with the task of establishing the $\mathscr{L}^n$-almost everywhere differentiability of $f$. Of course, we will have

$$Df(x) = \nabla f(x)$$

whenever the left-hand side exists.

Since $\mathbb{S}^{n-1}$ is compact, it is separable, and so we can find a countable dense subset ${u_1,u_2,\ldots}$ of $\mathbb{S}^{n-1}$. For each $j$, we let $A_i$ be the collection of $x \in \mathbb{R}^n$ at which $\nabla f(x)$ and $D_{u_j} f(x)$ exist and $D_{u_j} f(x) = \nabla f(x) \cdot u_j$. We set

$$A = \bigcup_{j=1}^\infty A_j$$

and observe that

$$\begin{align*} \mathscr{L}^n(\mathbb{R}^n \smallsetminus A) &= \mathscr{L}^n \left( \mathbb{R}^n \smallsetminus \bigcap_{j=1}^\infty A_j \right) \\ &= \mathscr{L}^n \left( \mathbb{R}^n \cap \left( \bigcap_{j=1}^\infty A_j \right)^c \right) \\ &= \mathscr{L}^n \left( \mathbb{R}^n \cap \bigcup_{j=1}^\infty A_j^c \right) \\ &\leq \mathscr{L}^n \left( \bigcup_{j=1}^\infty A_j^c \right) \\ &\leq \sum_{j=1}^\infty \mathscr{L}^n(A_j) \\ &= 0. \end{align*}$$

Our final claim is that $f$ is differentiable on $A$. To show this, we set, for every $x \in A$, $u \in \mathbb{S}^{n-1}$, and $h > 0$,

$$Q(x,u,h) = \frac{f(x+hu)-f(x)}{h} - \nabla f(x) \cdot u.$$

Given a fixed $x_0 \in A$, it suffices to show that

$$\lim_{h \to 0} Q(x_0,u,h) = 0$$

uniformly on $\mathbb{S}^{n-1}$. We first note that

$$\begin{align*} & |Q(x_0,u,h)-Q(x_0,u’,h)| \\ =& \left|\frac{f(x+hu)-f(x+hu’)}{h} - \nabla f(x) \cdot (u-u’) \right| \\ =& \left|\frac{f(x+hu)-f(x+hu’)}{h} - \lim_{t \to 0} \frac{f(x+t(u-u’))-f(x)}{t} \right| \\ \leq& \frac{\operatorname{Lip} f |h||u-u’|}{|h|} + \frac{\operatorname{Lip} f |t||u-u’|}{|t|} \\ \leq& 2 (\operatorname{Lip} f) |u-u’|. \end{align*}$$

Fix $\varepsilon>0$. By the generalized Heine-Borel theorem, we can cover $\mathbb{S}^{n-1}$ with finitely many $\varepsilon$-balls. Since ${u_1,u_2,\ldots}$ is dense in $\mathbb{S}^{n-1}$, we can find $N \in \mathbb{N}$ such that the $\varepsilon$-balls centered at $u_1,\ldots,u_N$ cover $\mathbb{S}^{n-1}$. By definition of $A$, we have $Q(x_0,u_j,h) \to 0$ as $h \to 0$ regardless of $j$. Therefore, we can find $\delta>0$ such that $|Q(x_0,u_j,h)| < \varepsilon$ for $0 < h < \delta$ and $1 \leq j \leq N$.

Now, for each $u \in \mathbb{S}^{n-1}$ and $0 < h < \delta$, we can find $1 \leq j \leq N$ such that $|u-u_j| < \varepsilon$. We then have

$$\begin{align*} |Q(x_0,u,h)| &\leq |Q(x_0,u,h)-Q(x_0,u_j,h)| + |Q(x_0,u_j,h)| \\ &< 2 (\operatorname{Lip} f) |u-u_j| + \varepsilon \\ &< (2 \operatorname{Lip} f + 1) \varepsilon, \end{align*}$$

which proves the claim. This completes the proof of Rademacher’s theorem.