Scientific Computing

Scientific Computing at Scale

In 2020 AWS spun up a million vCPUs for a genomic analysis in 12 minutes, then tore down the cluster. Total cost: $5000. An equivalent physical HPC cluster runs about $25 million and amortizes over 5 years. This is not just cost savings - it is a paradigm shift: compute became operational expense instead of capital asset. The price is management complexity: reproducibility, workflows, and data at petabyte scale.

**Fugaku (Japan, 2020)**: 442 PFLOPS across 158,976 ARM nodes - the largest simulation of COVID aerosol dispersion in a week of compute.
**Nextflow + nf-core**: 80% of bioinformatics pipelines are built on this stack; 100+ ready-made workflows from the global community, executable on any backend.
**CERN Data Tier**: 1 PB/s from the LHC compressed by triggers to 1 GB/s, stored across 11 Tier-1 national centers, accessible via Globus with a DOI per dataset.

Cloud HPC

In 2020, Fugaku set a record: 442 PFLOPS across 158,976 ARM nodes. That same year, an AWS Parallel Cluster Numerate job ran a genomic analysis on 1 million vCPUs in 12 minutes - then terminated the instances. Total cost: $5000. A stationary cluster with equivalent peak compute runs about $25 million and amortizes over 5 years. Cloud HPC shifted the economic model: compute became an operating expense rather than capital expenditure. The price for that flexibility is complexity: getting MPI to work over virtualized networks, surviving spot interruptions, and avoiding the runaway test cluster that bankrupts the lab over a weekend.

AWS EFA (Elastic Fabric Adapter) provides RDMA-like semantics over Ethernet with ~15 us latency, close to InfiniBand. SLURM remains the dominant scheduler in the cloud (via ParallelCluster, Azure CycleCloud). Spot/Preemptible instances cut cost by 60-90%, but require checkpoint-restart - otherwise a long job dies halfway.

Cloud networking is not InfiniBand. Allreduce on 1024 nodes over 100 Gbps Ethernet with EFA shows ~3x slowdown vs a physical IB cluster. For bandwidth-bound codes this is critical; for compute-bound workloads (MD, FEM) it is acceptable.

A team is running a 96-hour MD simulation on 256 spot instances. What checkpoint-restart strategy is correct?

Reproducibility

In 2016, Nature surveyed 1576 scientists: 70% failed to reproduce other people's experiments, and 50% failed to reproduce their own. In computational science the causes are concrete: different library versions, different float summation orders, different random seeds, different CPU architectures all produce different numerical results. A real case: the PLINK statistics package returned different p-values for the same genomic data on Intel vs AMD because of different SVD implementations. Reproducibility is not idealism. ACM introduced Artifact Available / Artifact Evaluated / Results Reproduced badges. Nature, Science, and PNAS now require code and data with publication.

Levels of reproducibility (Claerbout, ACM): Reproducible - same author, same dataset, same code. Replicable - different author, new dataset, same method. The numerical instability of float summation depends on order, which changes with the number of MPI processes. Mitigations: Kahan summation, fixed PRNG seeds, version pinning via containers.

A Conda environment with pinned channels plus a Singularity/Apptainer container registered with a Zenodo DOI is the current de-facto standard. Nix/Guix produce bit-identical builds, but adoption remains low.

An MD simulation returns different results from the same input on 64 vs 128 nodes. What is the most likely cause?

Scientific Workflows

An RNA-Seq genomic pipeline has 12 steps from FASTQ files to a gene expression table. Each step is a bioinformatics tool with dozens of parameters. Running manually through bash scripts breaks at step 5, loses intermediate results, and does not scale. Workflow managers (Nextflow, Snakemake, Cromwell) solve this: a declarative DAG of tasks with automatic parallelization, retry, and resume. Nextflow became the bioinformatics standard within two years: 80% of nf-core pipelines run on it. The key idea is workflow as code, versioned in Git, executable on any backend (local, SLURM, AWS Batch, Kubernetes).

A task DAG: each task is a function with declared inputs and outputs. The workflow manager builds the dependency graph and parallelizes independent tasks. Resume: if step 7 of 12 fails, restart begins at step 7, not the beginning. WDL (Workflow Description Language) is the Broad Institute standard; CWL (Common Workflow Language) is the open standard.

Nextflow + nf-core ships community-maintained pipelines: rnaseq, sarek (somatic variants), atacseq, and more. 100+ pipelines, each tested on 10+ clusters. Solves the pipeline portability problem across labs.

A Snakemake/Nextflow pipeline failed at the alignment step due to OOM on one of 200 samples. What happens on rerun?

Scientific Data Management

The LHC at CERN generates 1 PB per second. Storing all of it is impossible - triggers select 1 GB/s of interesting events. The Square Kilometre Array will produce 1 TB/s once online. The CESM2 climate model at 0.25 degree resolution: 50 TB per simulation. Data management is not a disk question - it is an architecture question: where to store, how to index billions of files, how to provide FAIR access (Findable, Accessible, Interoperable, Reusable), how to ensure usability in 30 years when formats are obsolete. The standard is object storage (S3, Ceph) plus a metadata catalog (Globus, iRODS, DataCite) plus DOIs for persistent identification.

FAIR principles: every dataset gets a persistent identifier (DOI), machine-readable metadata, and open formats (NetCDF, HDF5, Parquet, Zarr). Bandwidth-storage tradeoff: hot tier (SSD, $200/TB/year), warm (HDD, $30), cold (S3 Glacier Deep Archive, $1/TB/year, 12 hours to access). Zarr is the cloud-native format for scientific data: chunked storage with independent random access.

Reproducibility = identical numerical results on any hardware

Reproducibility = the same scientific conclusion under controlled variation. Bit-identity is unattainable in the general case; the right criterion is that results stay within expected numerical uncertainty

Float operations are not associative, GPU/CPU produce different roundoff, and different BLAS implementations (MKL vs OpenBLAS) differ at the last bits. The goal is scientific reproducibility (robust conclusions), not bit-identity (often impossible).

A climatologist needs one region (Europe, Januaries 1980-2020) from a 50 TB global CESM2 dataset on S3. Which storage format allows loading only the needed slice?

Key Ideas

**Cloud HPC** provides elasticity and spot pricing, but requires checkpoint-restart and acceptance that virtualized networks are ~3x slower than InfiniBand for bandwidth-bound codes.
**Reproducibility** does not mean bit-identity: float addition is not associative, and changing MPI rank counts reorders summation. The practical goal is robust scientific conclusions via containers + lock files + fixed seeds.
**Workflow managers** (Nextflow, Snakemake) turn pipelines into declarative DAGs: automatic parallelization, resume after failure, retry with resource scaling.
**FAIR data management**: Zarr/HDF5 for multidimensional gridded data, object storage with cold tiers, DOIs per dataset, metadata in standard schemas (DataCite).

Вопросы для размышления

If spot instances save 80% but risk interruption, at what ratio of job duration to instance MTBF does checkpoint-restart stop being worth the overhead?
Containers freeze code and dependencies, but not raw inputs from medical devices or physical detectors. How should systematic drift in source data be controlled?
FAIR demands open data, yet genomic and medical datasets are legally protected. How should open science and privacy regulations (GDPR, HIPAA) be reconciled?

Связанные уроки

sci-12 — Foundations for scaling scientific computing
par-14 — Spark/Dask for scientific data is analogous to MapReduce
opt-14 — Distributed optimization and HPC solve the same problems
dl-12 — Distributed training on supercomputers is scientific computing territory
alg-01-big-o — Complexity analysis is critical for HPC algorithms
par-01