Functional Analysis

Linear Operators

A neural network weight matrix is a linear operator $\mathbb{R}^n \to \mathbb{R}^m$. The singular values of that matrix (the spectrum of $T^*T$) determine how fast gradients propagate (vanishing/exploding), whether normalization is needed, and how stable fine-tuning will be. The spectral norm $\|W\|_2 = \sigma_{\max}$ is exactly what spectral normalization in StyleGAN controls.

  • **Spectral normalization (Miyato 2018)**: controlling $\|W\|_2$ for the Lipschitz condition in WGAN/StyleGAN - direct application of bounded operator norm
  • **LoRA and PCA**: rank-$r$ weight approximation works because real weights are close to compact operators; nuclear norm minimization (Netflix Prize) exploits the Riesz-Schauder theorem
  • **Self-attention**: $\text{softmax}(QK^T/\sqrt{d})V$ is an operator on the sequence space; compact at finite context; the spectral properties of $QK^T$ govern attention patterns

Предварительные знания

  • Hilbert Spaces

Bounded Operators

A neural network weight matrix is a linear operator $\mathbb{R}^n \to \mathbb{R}^m$. The singular values of that matrix (the spectrum of $T^*T$) determine everything: how fast gradients propagate (vanishing/exploding), whether normalization is needed, how stable fine-tuning will be. Spectral normalization in StyleGAN controls $\|W\|_2 = \sigma_{\max}$ - precisely the operator norm. Not a heuristic: a theorem.

In finite dimensions every linear operator is automatically continuous. In infinite-dimensional spaces the story changes: an operator can be linear and discontinuous at the same time. Control is needed - **boundedness**.

A linear operator $T: X \to Y$ is called bounded if $\exists C > 0$ such that $\|Tx\| \leq C\|x\|$ for all $x \in X$. Operator norm: $\|T\| = \sup_{x \neq 0} \frac{\|Tx\|}{\|x\|} = \sup_{\|x\| = 1} \|Tx\|$

Key fact: for linear operators, boundedness and continuity are one and the same. Continuity at a single point forces continuity everywhere - linearity propagates any failure at zero across the entire space.

Do unbounded linear operators exist? Yes. The differentiation operator $d/dx$ on $C[0,1]$ is the standard example: the functions $f_n(x) = \sin(nx)/n$ satisfy $\|f_n\| = 1/n \to 0$, yet $\|f_n'\| = \|\cos(nx)\| = 1$. The norm of the output does not decay even as the input approaches zero. Such operators are central to PDEs and quantum mechanics, but require careful separate treatment.

What is the norm of a linear operator T?

Compact Operators

PCA is a compact operator in action. Projection onto the top-$r$ principal components has finite rank, maps bounded sets to precompact ones, and its nonzero eigenvalues (variances along components) are isolated and decay. LoRA approximates weight matrices as $W = W_0 + AB$ with $\text{rank}(AB) = r \ll n$ - because real weights are close to compact operators and are well approximated by low-rank perturbations.

A linear operator $T: X \to Y$ is called compact if for every bounded set $B \subset X$ the closure $\overline{T(B)}$ is compact in $Y$. Equivalently: from every bounded sequence $\{x_n\}$ one can extract a subsequence on which $\{Tx_n\}$ converges.

The classic example is the integral operator $(Tf)(x) = \int K(x,t)f(t)\,dt$ with a continuous kernel $K$. The Arzelà-Ascoli theorem guarantees compactness on $C[a,b]$: the kernel smooths things out, collapsing an infinite-dimensional ball of inputs into a precompact family of outputs. Self-attention in transformers with finite context is a discrete analogue of the same phenomenon.

Why are compact operators so valuable? Their spectral theory mirrors the finite-dimensional case: nonzero eigenvalues are isolated, have finite multiplicity, and can only accumulate at zero. This is the Riesz-Schauder theorem - a bridge between linear algebra and infinite-dimensional analysis. Nuclear norm minimization in matrix completion (the Netflix Prize setup) exploits exactly this: minimizing $\|W\|_* = \sum \sigma_i$ promotes low-rank, near-compact solutions.

The identity operator $I$ on an infinite-dimensional space is NOT compact: the sequence of standard basis vectors $\{e_n\}$ in $\ell^2$ is bounded, yet $\|e_n - e_m\| = \sqrt{2}$ for all $n \neq m$ - no subsequence converges. This is the fundamental difference between infinite and finite dimensions.

Which statement about compact operators is correct?

Spectrum and Adjoint Operator

Spectral normalization in StyleGAN and WGAN-GP controls $\|W\|_2 = \sigma_{\max}(W)$ - the largest singular value. And $\sigma_{\max}(W) = \sqrt{\lambda_{\max}(W^*W)}$ - the square root of the largest eigenvalue of the operator $W^*W$. This is spectral theory and adjoint operators appearing directly in production code for generative models.

$\sigma(T) = \{\lambda \in \mathbb{C} : \text{the operator } (T - \lambda I) \text{ has no bounded inverse}\}$ Three parts of the spectrum: - Point spectrum $\sigma_p(T)$: $Tx = \lambda x$ has a nonzero solution (eigenvalues) - Continuous spectrum $\sigma_c(T)$: $(T-\lambda I)^{-1}$ exists but is unbounded; image is dense - Residual spectrum $\sigma_r(T)$: image of $(T-\lambda I)$ is not dense

Why "Spectrum"

Hilbert introduced the term by analogy with the optical spectrum of an atom: the eigenvalues of the self-adjoint Hamiltonian operator are the energy levels - literally the lines in the emission spectrum. The mathematical abstraction of 1906 turned out to be an exact model of physics discovered experimentally by Bohr in 1913.

For a bounded operator $T: H \to H$, the adjoint $T^*: H \to H$ is defined by $\langle Tx, y \rangle = \langle x, T^*y \rangle$ for all $x, y \in H$. Operator types and their spectra: - Self-adjoint ($T^* = T$): $\sigma(T) \subset \mathbb{R}$ - Unitary ($T^*T = TT^* = I$): $\sigma(T) \subset \{|\lambda| = 1\}$ - Normal ($T^*T = TT^*$): diagonalizable by the spectral theorem - Positive ($T = T^*$, $\langle Tx,x \rangle \geq 0$): $\sigma(T) \subset [0, +\infty)$

The spectral theorem for compact self-adjoint operators: $H$ has an orthonormal basis $\{e_n\}$ of eigenvectors of $T$, and $T = \sum_n \lambda_n \langle \cdot, e_n \rangle e_n$. This is infinite-dimensional diagonalization. PCA is its finite-rank approximation: take the top $r$ eigenvectors of the covariance operator and project. LoRA does the same with weights: $\Delta W = AB$, $A \in \mathbb{R}^{m \times r}$, $B \in \mathbb{R}^{r \times n}$, with $r$ chosen from the spectral decay of the weight matrix.

The spectrum of an operator is the same as its set of eigenvalues

Spectrum = point + continuous + residual. Eigenvalues are only the point part.

The right shift $S$ in $\ell^2$ has no eigenvalues (the equation $Se = \lambda e$ has no nonzero solutions), yet $\sigma(S)$ is the entire closed unit disk. In finite dimensions the spectrum always coincides with the eigenvalues - which is precisely why matrices build false intuition.

What is the spectrum of an operator T?

Key Takeaways

  • **Boundedness = continuity** for linear operators; $\|T\| = \sigma_{\max}$ for matrices - what spectral normalization controls
  • **Compact operators** - "almost finite-dimensional": unit ball image is precompact; nonzero eigenvalues isolated and decay - the foundation of LoRA and PCA
  • **Spectrum** $\sigma(T)$ is strictly larger than the eigenvalue set: includes continuous and residual components impossible for matrices
  • **Adjoint** $T^*$: $\langle Tx, y \rangle = \langle x, T^*y \rangle$; self-adjoint operators are diagonalizable with real spectrum - infinite-dimensional matrix diagonalization

Related Topics

Operator theory is the core of functional analysis, with direct paths into ML:

  • Hilbert Spaces — Inner product constructs the adjoint via the Riesz theorem
  • Normed Spaces — B(X,Y) is itself a normed space with the operator norm
  • Spectral Theory — Full spectral theorem for normal operators

Вопросы для размышления

  • Spectral normalization divides each weight matrix $W$ by $\sigma_{\max}(W)$ after every training step. Why does this guarantee the Lipschitz condition across the entire network?
  • The right shift operator in $\ell^2$ has no eigenvalues, yet its spectrum is the entire unit disk. How is this possible? What is happening with the continuous spectrum?
  • LoRA replaces $\Delta W$ with a rank-$r$ matrix. Why does this work? What property of real neural network weights does it exploit?

Связанные уроки

  • fa-02 — Hilbert geometry is the foundation for adjoint operators
  • fa-04 — Hahn-Banach builds on the theory of bounded functionals
  • fa-06 — Full spectral theory continues the discussion of operator spectrum
  • fa-11 — Direct application of compact operators to kernel methods and PCA
  • la-02-dot-product — A matrix in R^n is the finite-dimensional prototype of a linear operator
  • la-13-linear-maps
Linear Operators

0

1

Sign In