Scientific Computing

NumPy and Linear Algebra

Цели урока

Understand ndarray as a homogeneous contiguous-memory array and why it beats Python lists
Apply broadcasting correctly and predict the shape of the result
Use np.linalg for solving systems, SVD, eigenvalues; know when NOT to use inv()
Replace Python loops with vectorized operations and understand the speedup through BLAS/LAPACK

Предварительные знания

Introduction to Scientific Computing

The ECMWF climate model processes 100 TB of data every day for weather forecasting. NumPy is the foundation of PyTorch, TensorFlow, and pandas. scipy.linalg runs Fortran LAPACK under the hood. CERN's Large Hadron Collider: 15 petabytes per year through the ROOT framework. A 10,000 x 10,000 matrix? One line of NumPy - seconds. Three nested Python loops - hours.

**PyTorch and TensorFlow** - Tensor = ndarray + autograd; all of deep learning is built on NumPy broadcasting and matmul semantics
**ECMWF weather forecast** - 100 TB daily, NumPy/SciPy for post-processing; 10-day forecast accuracy doubled over 20 years
**CERN ROOT framework** - 15 PB/year; Python bindings via NumPy arrays for particle physics analysis
**NumPy SVD in recommendations** - Netflix Prize 2009 was won through matrix factorization; np.linalg.svd in 3 lines
**Scipy LAPACK** - the same Fortran code used in aerospace engineering and CFD simulations since the 1980s, now in pip install scipy

Travis Oliphant and the birth of NumPy

In 2005 Travis Oliphant merged two competing projects - Numeric (1995, Jim Hugunin) and Numarray (2001, STScI) - into a single package called NumPy. Under the hood NumPy calls into BLAS (Basic Linear Algebra Subprograms, 1979) and LAPACK (1992, Anderson et al.) - Fortran libraries that are over forty years old. That is why np.dot runs hundreds of times faster than a Python loop: SIMD instructions and cache-optimized GEMM from the 80s. In 2020 the paper "Array programming with NumPy" in Nature officially recognized NumPy as the foundation of modern data science.

ndarray - the Core of NumPy

John von Neumann and Herman Goldstine, 1947 - first numerical computing

In 1947, John von Neumann and Herman Goldstine ran a matrix inversion on ENIAC - the first serious numerical computation on a modern computer. ENIAC had no float64, no broadcasting, no vectorization. NumPy (2005) carries the same principle: a homogeneous array of numbers in memory. But orders of magnitude faster. PyTorch, TensorFlow, pandas - all of them are ndarray under the hood.

The ECMWF climate model processes 100 TB of data every day for weather forecasting. NumPy is the foundation of PyTorch, TensorFlow, and pandas. CERN's Large Hadron Collider produces 15 petabytes per year through the ROOT framework, with most analysis running in NumPy/SciPy. Python lists cannot handle this. **ndarray** - an N-dimensional array of a fixed type in a contiguous memory block - delivers speeds hundreds of times faster than lists.

**shape** - a tuple of dimensions (rows, columns, ...). **dtype** - the element type (float64, int32, complex128). All elements share one type - this enables compact storage and direct access without Python's per-object overhead. A PyTorch Tensor and a NumPy ndarray share the same memory via `torch.from_numpy()` - zero-copy interop.

Slicing in NumPy returns a **view** - not a copy, but a window into the same memory block. Modifying the view modifies the original. For an independent copy: `M[0, :].copy()`. This is not a bug - it is an architectural decision for zero-copy ML pipelines.

Creation	Function	Example
Zeros	np.zeros(shape)	np.zeros((3, 3))
Ones	np.ones(shape)	np.ones((2, 5))
Identity matrix	np.eye(n)	np.eye(4)
Random	np.random.randn(shape)	np.random.randn(3, 3)
Range	np.linspace(a, b, n)	np.linspace(0, 1, 50)

What is the result of running: row = M[0, :]; row[0] = 42?

Broadcasting

Normalizing 1000 samples across 5 features. Subtract the mean, divide by the standard deviation. Naive approach: loop over 1000 rows. NumPy approach: one line. A matrix of shape (1000, 5) minus a vector of shape (5,) - this is **broadcasting**: NumPy automatically stretches the vector to match the matrix without creating a copy in memory. Zero overhead.

**Broadcasting rules:** 1. Dimensions are compared right to left. 2. Two dimensions are compatible if they are equal OR one of them is 1. 3. Missing dimensions are treated as 1. If the rules are not satisfied, an error is raised.

Broadcasting is not just convenience. NumPy does not create an enlarged copy in memory - the operation executes on the fly. PyTorch uses the same mechanism: per-sample loss, batch normalization statistics, gradient accumulation - all through broadcasting. Mastering broadcasting means mastering half of PyTorch.

What is the result of np.ones((3, 4)) + np.ones((3, 1))?

numpy.linalg

scipy.linalg runs Fortran LAPACK under the hood. The same LAPACK written in the 1970s-80s, optimized for every CPU architecture. NumPy calls the same library. Solving systems, eigenvalues, SVD - none of this is Python. It is decades of numerical analysis in compiled code.

**np.linalg.solve(A, b)** - O(n^3), but much faster and more accurate than np.linalg.inv(A) @ b. Never compute the matrix inverse to solve a linear system - it is numerically unstable. solve() uses LU decomposition with partial pivoting: the standard in scientific computing since the 1960s.

Function	What it does	Complexity
np.linalg.solve(A, b)	Solves Ax=b	O(n^3)
np.linalg.eig(A)	Eigenvalues and eigenvectors	O(n^3)
np.linalg.svd(A)	Singular Value Decomposition	O(min(m,n)mn)
np.linalg.inv(A)	Matrix inverse (avoid!)	O(n^3)
np.linalg.det(A)	Determinant	O(n^3)
np.linalg.norm(x)	Vector or matrix norm	O(n)

**SVD** (Singular Value Decomposition) is one of the most versatile tools in ML. PCA = SVD on centered data. Netflix recommender systems (collaborative filtering) - matrix factorization via SVD. Image compression (JPEG-like) - truncated SVD. Moore-Penrose pseudoinverse - through SVD. One function, four killer applications.

Why is np.linalg.solve(A, b) preferable to np.linalg.inv(A) @ b?

Vectorization

PyTorch trains ResNet-50 on ImageNet in 76 hours on a single V100 GPU. The same training in pure Python with loops would take years. The difference is not the GPU. The difference is **vectorization**: replacing Python loops with operations on entire arrays. NumPy executes them in compiled C code with SIMD instructions (SSE4, AVX-512). Typical speedup: 50-300x, up to 500x for matrix multiplication.

**Why vectorization is so fast:** 1. No per-iteration Python interpreter overhead. 2. Data in an ndarray is stored contiguously - the CPU loads it into L1/L2 cache in a single operation. 3. AVX-512 SIMD instructions process 8 float64 values per CPU clock cycle.

Loop pattern	Vectorized equivalent	Speedup
sum(a[i]*b[i])	np.dot(a, b)	~300x
for + if > threshold	a[a > threshold]	~100x
triple loop C=A*B	A @ B	~500x
for + math.sin(x)	np.sin(arr)	~100x

Rule of thumb: if there is a `for` loop in NumPy code, there is almost always a vectorized alternative. The exception: recurrences where step N depends on step N-1. For those cases, use Numba (@jit) or Cython. In ML codebases, such cases are rare.

Under the hood, NumPy calls **BLAS** and **LAPACK** - libraries optimized over decades for Intel MKL, OpenBLAS, and ATLAS. Writing `np.dot(a, b)` passes pointers to a C function. All the work happens in Fortran/C assembly.

Python is too slow for scientific computing

A Python loop is indeed slow, but NumPy/SciPy call optimized C/Fortran libraries (BLAS, LAPACK). Vectorized Python code runs at speeds comparable to C.

NumPy is a thin Python wrapper around compiled code. np.dot(a, b) passes pointers to a C function that handles everything with SIMD. Python accounts for less than 1% of execution time.

Which operation does NOT lend itself to direct vectorization in NumPy?

Key ideas

**ndarray** - homogeneous fixed-type array in contiguous memory; the foundation of PyTorch, TensorFlow, pandas
**Broadcasting**: compare dimensions right to left, compatible if equal or one equals 1; zero-copy in memory
**np.linalg.solve** via LU decomposition is numerically more stable than inv(); SVD = PCA, recommendations, pseudoinverse
**Vectorization** delivers 100-500x over Python loops: SIMD AVX-512, BLAS/LAPACK under the hood
**View vs copy**: slice returns a view - modifying it changes the original; use .copy() for independence
**Vectorization exception**: recurrences x[i] = f(x[i-1]) require Numba or Cython

Вопросы для размышления

Why does NumPy require all elements of an array to share the same type? What advantages does this bring?
In what situations can broadcasting produce unexpected results? Construct an example.
If vectorization is so fast, why don't all Python programs use NumPy?

Связанные уроки

sci-01 — Introduction to scientific computing - context for NumPy
sci-03 — np.linalg.solve is the entry point; next come LU, Cholesky, iterative methods
la-21-lu — LU decomposition underlies np.linalg.solve
la-15-svd — NumPy SVD is an instance of the full theoretical SVD
ml-03-math-foundations — NumPy is the foundation of the math layer across the entire ML stack
calc-17-multivariable — Broadcasting generalizes operations over multi-dimensional arrays like multivariable calculus
la-04-matrix-ops

Scientific Computing

NumPy and Linear Algebra

Цели урока

Understand ndarray as a homogeneous contiguous-memory array and why it beats Python lists
Apply broadcasting correctly and predict the shape of the result
Use np.linalg for solving systems, SVD, eigenvalues; know when NOT to use inv()
Replace Python loops with vectorized operations and understand the speedup through BLAS/LAPACK

Предварительные знания

Introduction to Scientific Computing

**PyTorch and TensorFlow** - Tensor = ndarray + autograd; all of deep learning is built on NumPy broadcasting and matmul semantics
**ECMWF weather forecast** - 100 TB daily, NumPy/SciPy for post-processing; 10-day forecast accuracy doubled over 20 years
**CERN ROOT framework** - 15 PB/year; Python bindings via NumPy arrays for particle physics analysis
**NumPy SVD in recommendations** - Netflix Prize 2009 was won through matrix factorization; np.linalg.svd in 3 lines
**Scipy LAPACK** - the same Fortran code used in aerospace engineering and CFD simulations since the 1980s, now in pip install scipy

Travis Oliphant and the birth of NumPy

ndarray - the Core of NumPy

John von Neumann and Herman Goldstine, 1947 - first numerical computing

Creation	Function	Example
Zeros	np.zeros(shape)	np.zeros((3, 3))
Ones	np.ones(shape)	np.ones((2, 5))
Identity matrix	np.eye(n)	np.eye(4)
Random	np.random.randn(shape)	np.random.randn(3, 3)
Range	np.linspace(a, b, n)	np.linspace(0, 1, 50)

What is the result of running: row = M[0, :]; row[0] = 42?

Broadcasting

What is the result of np.ones((3, 4)) + np.ones((3, 1))?

numpy.linalg

Function	What it does	Complexity
np.linalg.solve(A, b)	Solves Ax=b	O(n^3)
np.linalg.eig(A)	Eigenvalues and eigenvectors	O(n^3)
np.linalg.svd(A)	Singular Value Decomposition	O(min(m,n)mn)
np.linalg.inv(A)	Matrix inverse (avoid!)	O(n^3)
np.linalg.det(A)	Determinant	O(n^3)
np.linalg.norm(x)	Vector or matrix norm	O(n)

Why is np.linalg.solve(A, b) preferable to np.linalg.inv(A) @ b?

Vectorization

Loop pattern	Vectorized equivalent	Speedup
sum(a[i]*b[i])	np.dot(a, b)	~300x
for + if > threshold	a[a > threshold]	~100x
triple loop C=A*B	A @ B	~500x
for + math.sin(x)	np.sin(arr)	~100x

Python is too slow for scientific computing

A Python loop is indeed slow, but NumPy/SciPy call optimized C/Fortran libraries (BLAS, LAPACK). Vectorized Python code runs at speeds comparable to C.

NumPy is a thin Python wrapper around compiled code. np.dot(a, b) passes pointers to a C function that handles everything with SIMD. Python accounts for less than 1% of execution time.

Which operation does NOT lend itself to direct vectorization in NumPy?

Key ideas

**ndarray** - homogeneous fixed-type array in contiguous memory; the foundation of PyTorch, TensorFlow, pandas
**Broadcasting**: compare dimensions right to left, compatible if equal or one equals 1; zero-copy in memory
**np.linalg.solve** via LU decomposition is numerically more stable than inv(); SVD = PCA, recommendations, pseudoinverse
**Vectorization** delivers 100-500x over Python loops: SIMD AVX-512, BLAS/LAPACK under the hood
**View vs copy**: slice returns a view - modifying it changes the original; use .copy() for independence
**Vectorization exception**: recurrences x[i] = f(x[i-1]) require Numba or Cython

Вопросы для размышления

Why does NumPy require all elements of an array to share the same type? What advantages does this bring?
In what situations can broadcasting produce unexpected results? Construct an example.
If vectorization is so fast, why don't all Python programs use NumPy?

Связанные уроки

sci-01 — Introduction to scientific computing - context for NumPy
sci-03 — np.linalg.solve is the entry point; next come LU, Cholesky, iterative methods
la-21-lu — LU decomposition underlies np.linalg.solve
la-15-svd — NumPy SVD is an instance of the full theoretical SVD
ml-03-math-foundations — NumPy is the foundation of the math layer across the entire ML stack
calc-17-multivariable — Broadcasting generalizes operations over multi-dimensional arrays like multivariable calculus
la-04-matrix-ops

NumPy and Linear Algebra

Цели урока

Предварительные знания

Travis Oliphant and the birth of NumPy

ndarray - the Core of NumPy

John von Neumann and Herman Goldstine, 1947 - first numerical computing

Broadcasting

numpy.linalg

Vectorization

Key ideas

Related topics

Вопросы для размышления

Связанные уроки

NumPy and Linear Algebra

Цели урока

Предварительные знания

Travis Oliphant and the birth of NumPy

ndarray - the Core of NumPy

John von Neumann and Herman Goldstine, 1947 - first numerical computing

Broadcasting

numpy.linalg

Vectorization

Key ideas

Related topics

Вопросы для размышления

Связанные уроки