Scientific Computing

NumPy and Linear Algebra

Цели урока

  • Understand ndarray as a homogeneous contiguous-memory array and why it beats Python lists
  • Apply broadcasting correctly and predict the shape of the result
  • Use np.linalg for solving systems, SVD, eigenvalues; know when NOT to use inv()
  • Replace Python loops with vectorized operations and understand the speedup through BLAS/LAPACK

Предварительные знания

  • Introduction to Scientific Computing

The ECMWF climate model processes 100 TB of data every day for weather forecasting. NumPy is the foundation of PyTorch, TensorFlow, and pandas. scipy.linalg runs Fortran LAPACK under the hood. CERN's Large Hadron Collider: 15 petabytes per year through the ROOT framework. A 10,000 x 10,000 matrix? One line of NumPy - seconds. Three nested Python loops - hours.

  • **PyTorch and TensorFlow** - Tensor = ndarray + autograd; all of deep learning is built on NumPy broadcasting and matmul semantics
  • **ECMWF weather forecast** - 100 TB daily, NumPy/SciPy for post-processing; 10-day forecast accuracy doubled over 20 years
  • **CERN ROOT framework** - 15 PB/year; Python bindings via NumPy arrays for particle physics analysis
  • **NumPy SVD in recommendations** - Netflix Prize 2009 was won through matrix factorization; np.linalg.svd in 3 lines
  • **Scipy LAPACK** - the same Fortran code used in aerospace engineering and CFD simulations since the 1980s, now in pip install scipy

Travis Oliphant and the birth of NumPy

In 2005 Travis Oliphant merged two competing projects - Numeric (1995, Jim Hugunin) and Numarray (2001, STScI) - into a single package called NumPy. Under the hood NumPy calls into BLAS (Basic Linear Algebra Subprograms, 1979) and LAPACK (1992, Anderson et al.) - Fortran libraries that are over forty years old. That is why np.dot runs hundreds of times faster than a Python loop: SIMD instructions and cache-optimized GEMM from the 80s. In 2020 the paper "Array programming with NumPy" in Nature officially recognized NumPy as the foundation of modern data science.

ndarray - the Core of NumPy

John von Neumann and Herman Goldstine, 1947 - first numerical computing

In 1947, John von Neumann and Herman Goldstine ran a matrix inversion on ENIAC - the first serious numerical computation on a modern computer. ENIAC had no float64, no broadcasting, no vectorization. NumPy (2005) carries the same principle: a homogeneous array of numbers in memory. But orders of magnitude faster. PyTorch, TensorFlow, pandas - all of them are ndarray under the hood.

The ECMWF climate model processes 100 TB of data every day for weather forecasting. NumPy is the foundation of PyTorch, TensorFlow, and pandas. CERN's Large Hadron Collider produces 15 petabytes per year through the ROOT framework, with most analysis running in NumPy/SciPy. Python lists cannot handle this. **ndarray** - an N-dimensional array of a fixed type in a contiguous memory block - delivers speeds hundreds of times faster than lists.

**shape** - a tuple of dimensions (rows, columns, ...). **dtype** - the element type (float64, int32, complex128). All elements share one type - this enables compact storage and direct access without Python's per-object overhead. A PyTorch Tensor and a NumPy ndarray share the same memory via `torch.from_numpy()` - zero-copy interop.

Slicing in NumPy returns a **view** - not a copy, but a window into the same memory block. Modifying the view modifies the original. For an independent copy: `M[0, :].copy()`. This is not a bug - it is an architectural decision for zero-copy ML pipelines.

CreationFunctionExample
Zerosnp.zeros(shape)np.zeros((3, 3))
Onesnp.ones(shape)np.ones((2, 5))
Identity matrixnp.eye(n)np.eye(4)
Randomnp.random.randn(shape)np.random.randn(3, 3)
Rangenp.linspace(a, b, n)np.linspace(0, 1, 50)

What is the result of running: row = M[0, :]; row[0] = 42?

Broadcasting

Normalizing 1000 samples across 5 features. Subtract the mean, divide by the standard deviation. Naive approach: loop over 1000 rows. NumPy approach: one line. A matrix of shape (1000, 5) minus a vector of shape (5,) - this is **broadcasting**: NumPy automatically stretches the vector to match the matrix without creating a copy in memory. Zero overhead.

**Broadcasting rules:** 1. Dimensions are compared right to left. 2. Two dimensions are compatible if they are equal OR one of them is 1. 3. Missing dimensions are treated as 1. If the rules are not satisfied, an error is raised.

Broadcasting is not just convenience. NumPy does not create an enlarged copy in memory - the operation executes on the fly. PyTorch uses the same mechanism: per-sample loss, batch normalization statistics, gradient accumulation - all through broadcasting. Mastering broadcasting means mastering half of PyTorch.

What is the result of np.ones((3, 4)) + np.ones((3, 1))?

numpy.linalg

scipy.linalg runs Fortran LAPACK under the hood. The same LAPACK written in the 1970s-80s, optimized for every CPU architecture. NumPy calls the same library. Solving systems, eigenvalues, SVD - none of this is Python. It is decades of numerical analysis in compiled code.

**np.linalg.solve(A, b)** - O(n^3), but much faster and more accurate than np.linalg.inv(A) @ b. Never compute the matrix inverse to solve a linear system - it is numerically unstable. solve() uses LU decomposition with partial pivoting: the standard in scientific computing since the 1960s.

FunctionWhat it doesComplexity
np.linalg.solve(A, b)Solves Ax=bO(n^3)
np.linalg.eig(A)Eigenvalues and eigenvectorsO(n^3)
np.linalg.svd(A)Singular Value DecompositionO(min(m,n)*m*n)
np.linalg.inv(A)Matrix inverse (avoid!)O(n^3)
np.linalg.det(A)DeterminantO(n^3)
np.linalg.norm(x)Vector or matrix normO(n)

**SVD** (Singular Value Decomposition) is one of the most versatile tools in ML. PCA = SVD on centered data. Netflix recommender systems (collaborative filtering) - matrix factorization via SVD. Image compression (JPEG-like) - truncated SVD. Moore-Penrose pseudoinverse - through SVD. One function, four killer applications.

Why is np.linalg.solve(A, b) preferable to np.linalg.inv(A) @ b?

Vectorization

PyTorch trains ResNet-50 on ImageNet in 76 hours on a single V100 GPU. The same training in pure Python with loops would take years. The difference is not the GPU. The difference is **vectorization**: replacing Python loops with operations on entire arrays. NumPy executes them in compiled C code with SIMD instructions (SSE4, AVX-512). Typical speedup: 50-300x, up to 500x for matrix multiplication.

**Why vectorization is so fast:** 1. No per-iteration Python interpreter overhead. 2. Data in an ndarray is stored contiguously - the CPU loads it into L1/L2 cache in a single operation. 3. AVX-512 SIMD instructions process 8 float64 values per CPU clock cycle.

Loop patternVectorized equivalentSpeedup
sum(a[i]*b[i])np.dot(a, b)~300x
for + if > thresholda[a > threshold]~100x
triple loop C=A*BA @ B~500x
for + math.sin(x)np.sin(arr)~100x

Rule of thumb: if there is a `for` loop in NumPy code, there is almost always a vectorized alternative. The exception: recurrences where step N depends on step N-1. For those cases, use Numba (@jit) or Cython. In ML codebases, such cases are rare.

Under the hood, NumPy calls **BLAS** and **LAPACK** - libraries optimized over decades for Intel MKL, OpenBLAS, and ATLAS. Writing `np.dot(a, b)` passes pointers to a C function. All the work happens in Fortran/C assembly.

Python is too slow for scientific computing

A Python loop is indeed slow, but NumPy/SciPy call optimized C/Fortran libraries (BLAS, LAPACK). Vectorized Python code runs at speeds comparable to C.

NumPy is a thin Python wrapper around compiled code. np.dot(a, b) passes pointers to a C function that handles everything with SIMD. Python accounts for less than 1% of execution time.

Which operation does NOT lend itself to direct vectorization in NumPy?

Key ideas

  • **ndarray** - homogeneous fixed-type array in contiguous memory; the foundation of PyTorch, TensorFlow, pandas
  • **Broadcasting**: compare dimensions right to left, compatible if equal or one equals 1; zero-copy in memory
  • **np.linalg.solve** via LU decomposition is numerically more stable than inv(); SVD = PCA, recommendations, pseudoinverse
  • **Vectorization** delivers 100-500x over Python loops: SIMD AVX-512, BLAS/LAPACK under the hood
  • **View vs copy**: slice returns a view - modifying it changes the original; use .copy() for independence
  • **Vectorization exception**: recurrences x[i] = f(x[i-1]) require Numba or Cython

Related topics

NumPy is the foundation of the entire Python scientific stack:

  • Systems of Linear Equations — np.linalg.solve is the entry point; next come LU, Cholesky, and iterative methods
  • Introduction to Scientific Computing — NumPy is the primary tool for implementing models and simulations
  • LU Decomposition — The mathematics underlying np.linalg.solve

Вопросы для размышления

  • Why does NumPy require all elements of an array to share the same type? What advantages does this bring?
  • In what situations can broadcasting produce unexpected results? Construct an example.
  • If vectorization is so fast, why don't all Python programs use NumPy?

Связанные уроки

  • sci-01 — Introduction to scientific computing - context for NumPy
  • sci-03 — np.linalg.solve is the entry point; next come LU, Cholesky, iterative methods
  • la-21-lu — LU decomposition underlies np.linalg.solve
  • la-15-svd — NumPy SVD is an instance of the full theoretical SVD
  • ml-03-math-foundations — NumPy is the foundation of the math layer across the entire ML stack
  • calc-17-multivariable — Broadcasting generalizes operations over multi-dimensional arrays like multivariable calculus
  • la-04-matrix-ops
NumPy and Linear Algebra

0

1

Sign In