Data Science

Python for Data Science

DJ Patil and Jeff Hammerbacher became CDO at LinkedIn and Facebook in 2008. In 2012, Harvard Business Review called data scientist 'the sexiest job of the 21st century'. In 15 years it went from a mathematician's hobby to an industry with 11M open positions. Nearly all of them start with the same stack: NumPy, Pandas, Matplotlib, Jupyter.

**Netflix** uses Pandas for EDA of viewing patterns across 260M+ users. Prototyping in Jupyter, production in Python services
**NASA** processes Hubble and James Webb telescope data with NumPy and Matplotlib. The first black hole image (2019) was processed with NumPy
**Kaggle** - 87% of participants use Python; the stack is the same: pandas + numpy + seaborn
**OpenAI** - all preprocessing of GPT training data runs through NumPy-compatible pipelines
**Spotify** - Jupyter notebooks for analyzing 100B+ listening events in A/B tests

John Tukey and the birth of modern data analysis

In 1962, mathematician John Tukey published 'The Future of Data Analysis' - the paper that first described data science as a distinct discipline. Tukey introduced EDA (Exploratory Data Analysis), the box plot, and developed the Fast Fourier Transform. His philosophy: 'Far better an approximate answer to the right question than an exact answer to the wrong question.' When Travis Oliphant created NumPy in 2005, he built on the tradition Tukey had laid down 40 years earlier.

Предварительные знания

What is Data Science

NumPy: vectorization instead of loops

Python is one of the slowest popular languages. A loop over a list of a million numbers runs 100x slower than the C equivalent. That sounds like a death sentence for data work. But in 2005, Travis Oliphant created **NumPy** - and everything changed. The secret: operations execute not in Python but in optimized C code using SIMD processor instructions.

The core NumPy data structure is the **ndarray**. Unlike a Python list, all elements in an ndarray share one type (int32, float64), are stored contiguously in memory, and are processed by **vectorized operations** with no explicit loop. This is exactly the foundation PyTorch builds on: a tensor is an ndarray with autograd support. Every `model(x)` call internally triggers hundreds of NumPy-style vector operations.

**Why the gap?** A Python list stores pointers to objects scattered in memory. For each element Python checks the type, unpacks the value, performs the operation, and repacks. NumPy stores raw numbers contiguously and calls C code that processes the entire block in one pass.

**Slicing creates a view, not a copy.** `b = a[1:4]` is a reference to the same memory region. Modifying `b` modifies `a`. For an independent copy: `b = a[1:4].copy()`. This optimization saves memory but catches beginners off-guard.

Given `m = np.arange(20).reshape(4, 5)`. What does `m[1:3, ::2]` return?

Pandas: DataFrame as a first-class citizen

2008. Wes McKinney is working at AQR Capital Management and is tired of clunky financial data tables. He writes **Pandas** in his spare time - a library that turned Python into a tool rivaling Excel and R. Today: 27 million downloads per week. On Kaggle, 87% of participants start with `pd.read_csv()`.

Two structures: **Series** - a one-dimensional array with an index; **DataFrame** - a two-dimensional table. Unlike a NumPy ndarray, DataFrame columns can have different types: int, float, string, datetime, bool. SQL-style operations - groupby, merge, filter - work out of the box.

Operation	SQL	Pandas
Filter	WHERE age > 30	df[df['age'] > 30]
Sort	ORDER BY salary DESC	df.sort_values('salary', ascending=False)
Group	GROUP BY dept	df.groupby('dept')
Join	JOIN ON id	df1.merge(df2, on='id')
Limit	LIMIT 10	df.head(10)

Find the average order value per city, but only for orders above 50. Which code is correct?

Matplotlib and Seaborn: data that speaks

2003. Neuroscientist John Hunter is tired of MATLAB and creates **Matplotlib** to analyze brain signals in Python. Today every Nature paper, every Kaggle notebook, every Airbnb report starts with `import matplotlib.pyplot as plt`. **Seaborn** is the statistical layer on top: correlation matrices, distributions, regressions in a single line.

Two Matplotlib APIs: **pyplot** (`plt.plot()`) - fast but limited. **Figure/Axes OOP API** (`fig, ax = plt.subplots()`) - full control. Professional work always uses the Axes API: predictable with multiple subplots and custom styling.

Goal	Chart type	Library
Trend over time	Line plot	matplotlib
Category comparison	Bar plot	matplotlib / seaborn
Two numeric variables	Scatter plot	matplotlib / seaborn
Distribution + outliers	Box plot / Violin plot	seaborn
Correlations	Heatmap	seaborn
Multi-dimensional patterns	Pair plot	seaborn

**Chart title = insight, not description.** Not 'Revenue by month' but 'Revenue grew 83% over the year, anomalous dip in March'. The reader should not need to interpret - the chart speaks for itself.

**Pie charts do not work for comparison.** The human eye poorly compares angles. If two slices are close (33% vs 28%), the difference is invisible in a pie. A bar chart is 5x more legible. Pie charts are acceptable only for 2-3 categories with an obvious difference.

Show the relationship between income and age for 5,000 clients and highlight outliers. Which chart is best?

Jupyter: the data scientist's lab

2014. Fernando Perez renames IPython to **Jupyter** - from Julia + Python + R. The idea: a single environment where code, text, formulas, and visualizations live together. Today GitHub hosts more than 10 million notebooks. The entire Kaggle platform is built on Jupyter. But there is a boundary that must not be crossed: Jupyter is an exploration environment, not production.

**When Jupyter is the right choice:** EDA, prototyping, presenting results, teaching. **When Jupyter is the wrong choice:** production ETL pipelines, microservices, cron jobs, team development (merge conflicts in JSON-format notebooks are a nightmare). Netflix, Airbnb, Spotify all use Jupyter for research but deploy clean Python through CI/CD.

**Hidden state is Jupyter's main trap.** Cells execute in any order. A variable deleted in cell 5 is still live if cell 3 was run before. Everything seems fine - until the next restart. Rule: periodically run **Kernel -> Restart & Run All** to confirm the notebook executes top-to-bottom.

**Best workflow:** explore in Jupyter, move stable code into `.py` files and import it back: `from src.preprocessing import clean_data`. Jupyter interactivity plus testable, version-controlled Python files.

Jupyter is fine for production code. If the model works in a notebook, it can be deployed as-is.

Jupyter is for exploration and EDA. Production code lives in .py files with tests, logging, and CI/CD.

Notebooks have hidden state, are poor with git (JSON format produces unreadable diffs), and lack proper error handling. Netflix, Airbnb, and Spotify use Jupyter for research but deploy via CI/CD pipelines.

A team built a model in a Jupyter Notebook. What is the right path to production deployment?

Key Takeaways

**NumPy** - the foundation: typed ndarray + vectorized ops = C speed with Python convenience. Broadcasting eliminates loops
**Pandas** - tables (DataFrame) with .loc/.iloc, groupby, and method chaining. SQL-style operations in Python, 27M downloads per week
**Matplotlib + Seaborn** - Axes API for control. One chart, one insight. Title = conclusion, not label
**Jupyter** - a lab for exploration, not production. Restart & Run All is a mandatory ritual
**The stack together** - NumPy + Pandas + Matplotlib + Jupyter covers 90% of the data scientist's toolkit from Netflix to Kaggle
**History** - Tukey 1962 laid the EDA philosophy, McKinney 2008 embodied it in Pandas, Oliphant 2005 gave speed via NumPy

Вопросы для размышления

A CSV with 10 million rows. The loop `for row in df.iterrows()` runs for 40 minutes. How to speed it up using knowledge from this lesson?
A colleague sent a Jupyter Notebook with 80 cells. Running Restart & Run All fails on cell 45. What most likely went wrong?
When should Pandas be used over plain SQL? What data size threshold matters?

Связанные уроки

ds-01 — Data pipeline and project lifecycle from the first lesson
ds-03 — NumPy and Pandas are the foundation for descriptive statistics
ml-04-data-preprocessing — Pandas transforms and NumPy ops are the core of ML preprocessing
ml-01-intro — Scikit-learn is built on NumPy; Pandas is the path to model training
prob-01-intro — NumPy random ops are practical probability in action
stat-31-eda — Pandas + Seaborn are the EDA toolkit in statistical analysis