Deep Learning
Frameworks: PyTorch vs TensorFlow
2019. Google openly concedes PyTorch is winning in research and ships TensorFlow 2.0 with a completely redesigned API. Tesla Autopilot migrates from TF to PyTorch. Hugging Face builds its entire hub on PyTorch. The outcome looks decided. But TFLite runs on 3 billion mobile devices, and TF Serving anchors production at Google - nobody is moving that.
- **Meta (Facebook):** PyTorch powers the recommendation system handling one trillion inferences per day
- **Google:** TensorFlow runs in Search, Gmail, Google Photos, YouTube - billions of requests daily
- **Tesla Autopilot:** started on TensorFlow, migrated to PyTorch - an instructive industry migration story
Soumith Chintala and the fight for the dynamic graph
In 2016 Soumith Chintala at FAIR wrote PyTorch in a few weeks, pulling ideas from Torch (a Lua library) and Chainer (a Japanese framework with define-by-run). The whole FAIR team agreed: researchers do not want Sessions and placeholders - they want to write Python. The intuition was right. Within two years PyTorch took the top spot in academic publications. The classic ML pattern: the most ergonomic tool wins, not the most capable one.
Предварительные знания
PyTorch: define-by-run
**January 2016. Facebook AI Research drops PyTorch** - a framework that flipped the script on what neural network code should look like. Instead of describing a computation graph in a custom language (as TensorFlow 1.x demanded), PyTorch let neural networks be written as plain Python. The approach is called **define-by-run**: the computational graph is built on the fly during every forward pass.
**PyTorch's philosophy** is 'Python first'. Standard Python constructs (if, for, print) work inside the model. Breakpoints, pdb, intermediate value prints - all of it just works. For researchers this was a revolution after the 'black box' of TensorFlow 1.x. By 2023 over 80% of papers at NeurIPS and ICML ship in PyTorch.
**PyTorch ecosystem:** torchvision (computer vision), torchaudio (audio), torchtext (NLP), PyTorch Lightning (high-level wrapper), Hugging Face Transformers (pretrained models). Meta runs PyTorch in a recommendation system serving one trillion inferences per day.
**model.train() and model.eval()** - never skip the switch! Train mode: dropout randomly zeros neurons, batch norm uses current batch statistics. Eval mode: dropout off, batch norm uses accumulated statistics. Skip the switch and validation results turn flaky.
What does the define-by-run approach in PyTorch mean?
TensorFlow: from graph to Keras
**TensorFlow shipped in November 2015 out of Google Brain.** Version 1.x took the opposite philosophy from PyTorch: first describe the computational graph in a custom language, then run it inside a Session. This approach - **define-and-run** - opened the door to optimization but made debugging brutal.
**TensorFlow 2.0 (2019) tore up the playbook.** Google admitted PyTorch's ergonomics had won and shipped three big changes: eager execution by default, Keras as the primary API, and stripped-down code. Tesla Autopilot migrated from TF to PyTorch - a cautionary tale for the industry.
| Component | Purpose | PyTorch equivalent |
|---|---|---|
| TensorFlow Core | Low-level tensor operations | torch |
| Keras | High-level API for models | torch.nn + Lightning |
| TFLite | Deploy on mobile devices | PyTorch Mobile / ExecuTorch |
| TF.js | Run in the browser | ONNX.js |
| TF Serving | Production inference server | TorchServe |
| TFX | ML pipeline (from data to deploy) | MLflow + Kubeflow |
**TensorFlow's edge: the production ecosystem.** TFLite runs on billions of mobile devices. TF.js runs in the browser without a server. TF Serving handles millions of requests per second at Google. Search, Gmail, Google Photos, YouTube - all on TensorFlow.
**Do not mix up TF 1.x and TF 2.x** - effectively two different frameworks. Most complaints about TensorFlow target version 1.x. TF 2 with Keras is a modern, convenient tool. Check the version when reading old tutorials.
What was the main change introduced by TensorFlow 2.0?
Eager Execution: compute immediately
**Eager execution** is the mode where operations run immediately, like normal Python. Write `a + b` - get the result on the spot, not a description of some future computation. PyTorch worked this way from day one. TensorFlow flipped to it as the default in version 2.0.
**Eager execution's killer feature: debugging.** Drop print() anywhere in the model to see real values. Set breakpoints in pdb. Standard Python profiling tools just work. Critical for research, where models are experimental and crawling with bugs.
| Property | Eager Execution | Graph Mode |
|---|---|---|
| Computation | Immediate | Deferred (compile -> run) |
| Debugging | print, pdb, breakpoints | Harder - needs special tools |
| Python control flow | if/for work natively | Need tf.cond / tf.while_loop (TF1) |
| Speed | Baseline | Optimized (operator fusion, etc.) |
| Usage | Research, prototyping | Production, deployment |
**Rule of thumb:** stick with eager execution during development and debugging. Flip to graph mode (torch.compile, tf.function) once the model is production-ready. Most researchers never leave eager mode - optimization only matters at scale.
Why is eager execution more convenient for debugging neural networks?
Graph Mode: optimization for production
**Eager execution is convenient but slower.** Every operation hits the Python interpreter, allocates intermediate tensors, fires commands at the GPU one at a time. **Graph mode** ingests the full computational graph at once and optimizes it: fuses operations (operator fusion), strips redundant computations, plans memory.
**What does the compiler actually do?** Operator fusion - merging multiple operations into one (Linear + ReLU = one GPU kernel instead of two). Memory planning - recycling memory from tensors no longer needed. Constant folding - pre-computing static expressions. None of these optimizations are reachable in eager mode because the framework only sees one operation at a time.
**torch.compile() is not a free speedup.** The first call is slower due to compilation (sometimes minutes). For small models the compile overhead may never pay off. Dynamic tensor shapes (varying batch length) can trigger recompilation. Start without compile - add it once speed becomes the bottleneck.
**Practical advice for 2026:** new project? Start with PyTorch. Mobile deployment? Consider ONNX or ExecuTorch. Browser? TensorFlow.js or ONNX Runtime Web. Peak inference speed on NVIDIA? TensorRT. The 'framework wars' are over - pick what fits the task.
PyTorch is for research, TensorFlow is for production. Each framework only suits its own niche.
This claim is outdated. PyTorch 2.0 with torch.compile(), TorchServe, and ExecuTorch has closed the production gap. TensorFlow 2.x with Keras and eager execution became more convenient for research. Both frameworks can be used for the full cycle.
Historically (2016-2019), PyTorch was more convenient for experiments while TensorFlow had a more mature production ecosystem. But since 2020 both frameworks have been actively borrowing the best ideas from each other. Meta and Google use their frameworks for both scenarios.
What does torch.compile() do under the hood?
Key Ideas
- **PyTorch** - define-by-run, Pythonic API, the standard for research (80%+ papers). Code reads like plain Python
- **TensorFlow** - from static graph (TF1) to eager execution and Keras (TF2). Strong production ecosystem (TFLite, TF.js, TF Serving)
- **Eager execution** - operations execute immediately. Convenient for debugging and research. Default in both frameworks
- **Graph mode** (torch.compile, tf.function) - optimizes computations: operator fusion, memory planning. Needed for production
Related Topics
The choice of framework is a core decision affecting the entire pipeline:
- Backpropagation — PyTorch autograd and TF GradientTape - different implementations of the same algorithm
- CNN Architectures — Subsequent lessons build convolutional networks using PyTorch
- MLOps Pipeline — Model deployment depends on the framework: TorchServe vs TF Serving vs ONNX
Вопросы для размышления
- Why did PyTorch win in research despite TensorFlow coming out earlier and having Google's backing?
- torch.compile() and tf.function() convert eager code into an optimized graph. Why is eager mode needed at all - why not always compile?
- Starting a new ML project today - which framework to choose and why?
Связанные уроки
- dl-02 — Backpropagation is the algorithm PyTorch autograd and TF GradientTape implement differently
- dl-04 — CNN architectures are built with PyTorch in subsequent lessons
- ml-09-gradient-descent — Adam/SGD optimizers are concrete gradient descent implementations in these frameworks
- ml-45-mlops-pipeline — Model deployment depends on framework: TorchServe vs TF Serving vs ONNX
- dl-01 — Computational graph concept is introduced in the first DL lesson
- ml-28-optimizers — Optimizers explain which framework API to choose
- ml-25-neural-networks