Data Science
Python for Data Science
DJ Patil and Jeff Hammerbacher became CDO at LinkedIn and Facebook in 2008. In 2012, Harvard Business Review called data scientist 'the sexiest job of the 21st century'. In 15 years it went from a mathematician's hobby to an industry with 11M open positions. Nearly all of them start with the same stack: NumPy, Pandas, Matplotlib, Jupyter.
- **Netflix** uses Pandas for EDA of viewing patterns across 260M+ users. Prototyping in Jupyter, production in Python services
- **NASA** processes Hubble and James Webb telescope data with NumPy and Matplotlib. The first black hole image (2019) was processed with NumPy
- **Kaggle** - 87% of participants use Python; the stack is the same: pandas + numpy + seaborn
- **OpenAI** - all preprocessing of GPT training data runs through NumPy-compatible pipelines
- **Spotify** - Jupyter notebooks for analyzing 100B+ listening events in A/B tests
John Tukey and the birth of modern data analysis
In 1962, mathematician John Tukey published 'The Future of Data Analysis' - the paper that first described data science as a distinct discipline. Tukey introduced EDA (Exploratory Data Analysis), the box plot, and developed the Fast Fourier Transform. His philosophy: 'Far better an approximate answer to the right question than an exact answer to the wrong question.' When Travis Oliphant created NumPy in 2005, he built on the tradition Tukey had laid down 40 years earlier.
Предварительные знания
NumPy: vectorization instead of loops
Python is one of the slowest popular languages. A loop over a list of a million numbers runs 100x slower than the C equivalent. That sounds like a death sentence for data work. But in 2005, Travis Oliphant created **NumPy** - and everything changed. The secret: operations execute not in Python but in optimized C code using SIMD processor instructions.
The core NumPy data structure is the **ndarray**. Unlike a Python list, all elements in an ndarray share one type (int32, float64), are stored contiguously in memory, and are processed by **vectorized operations** with no explicit loop. This is exactly the foundation PyTorch builds on: a tensor is an ndarray with autograd support. Every `model(x)` call internally triggers hundreds of NumPy-style vector operations.
**Why the gap?** A Python list stores pointers to objects scattered in memory. For each element Python checks the type, unpacks the value, performs the operation, and repacks. NumPy stores raw numbers contiguously and calls C code that processes the entire block in one pass.
**Slicing creates a view, not a copy.** `b = a[1:4]` is a reference to the same memory region. Modifying `b` modifies `a`. For an independent copy: `b = a[1:4].copy()`. This optimization saves memory but catches beginners off-guard.
Given `m = np.arange(20).reshape(4, 5)`. What does `m[1:3, ::2]` return?
Pandas: DataFrame as a first-class citizen
2008. Wes McKinney is working at AQR Capital Management and is tired of clunky financial data tables. He writes **Pandas** in his spare time - a library that turned Python into a tool rivaling Excel and R. Today: 27 million downloads per week. On Kaggle, 87% of participants start with `pd.read_csv()`.
Two structures: **Series** - a one-dimensional array with an index; **DataFrame** - a two-dimensional table. Unlike a NumPy ndarray, DataFrame columns can have different types: int, float, string, datetime, bool. SQL-style operations - groupby, merge, filter - work out of the box.
| Operation | SQL | Pandas |
|---|---|---|
| Filter | WHERE age > 30 | df[df['age'] > 30] |
| Sort | ORDER BY salary DESC | df.sort_values('salary', ascending=False) |
| Group | GROUP BY dept | df.groupby('dept') |
| Join | JOIN ON id | df1.merge(df2, on='id') |
| Limit | LIMIT 10 | df.head(10) |
Find the average order value per city, but only for orders above 50. Which code is correct?
Matplotlib and Seaborn: data that speaks
2003. Neuroscientist John Hunter is tired of MATLAB and creates **Matplotlib** to analyze brain signals in Python. Today every Nature paper, every Kaggle notebook, every Airbnb report starts with `import matplotlib.pyplot as plt`. **Seaborn** is the statistical layer on top: correlation matrices, distributions, regressions in a single line.
Two Matplotlib APIs: **pyplot** (`plt.plot()`) - fast but limited. **Figure/Axes OOP API** (`fig, ax = plt.subplots()`) - full control. Professional work always uses the Axes API: predictable with multiple subplots and custom styling.
| Goal | Chart type | Library |
|---|---|---|
| Trend over time | Line plot | matplotlib |
| Category comparison | Bar plot | matplotlib / seaborn |
| Two numeric variables | Scatter plot | matplotlib / seaborn |
| Distribution + outliers | Box plot / Violin plot | seaborn |
| Correlations | Heatmap | seaborn |
| Multi-dimensional patterns | Pair plot | seaborn |
**Chart title = insight, not description.** Not 'Revenue by month' but 'Revenue grew 83% over the year, anomalous dip in March'. The reader should not need to interpret - the chart speaks for itself.
**Pie charts do not work for comparison.** The human eye poorly compares angles. If two slices are close (33% vs 28%), the difference is invisible in a pie. A bar chart is 5x more legible. Pie charts are acceptable only for 2-3 categories with an obvious difference.
Show the relationship between income and age for 5,000 clients and highlight outliers. Which chart is best?
Jupyter: the data scientist's lab
2014. Fernando Perez renames IPython to **Jupyter** - from Julia + Python + R. The idea: a single environment where code, text, formulas, and visualizations live together. Today GitHub hosts more than 10 million notebooks. The entire Kaggle platform is built on Jupyter. But there is a boundary that must not be crossed: Jupyter is an exploration environment, not production.
**When Jupyter is the right choice:** EDA, prototyping, presenting results, teaching. **When Jupyter is the wrong choice:** production ETL pipelines, microservices, cron jobs, team development (merge conflicts in JSON-format notebooks are a nightmare). Netflix, Airbnb, Spotify all use Jupyter for research but deploy clean Python through CI/CD.
**Hidden state is Jupyter's main trap.** Cells execute in any order. A variable deleted in cell 5 is still live if cell 3 was run before. Everything seems fine - until the next restart. Rule: periodically run **Kernel -> Restart & Run All** to confirm the notebook executes top-to-bottom.
**Best workflow:** explore in Jupyter, move stable code into `.py` files and import it back: `from src.preprocessing import clean_data`. Jupyter interactivity plus testable, version-controlled Python files.
Jupyter is fine for production code. If the model works in a notebook, it can be deployed as-is.
Jupyter is for exploration and EDA. Production code lives in .py files with tests, logging, and CI/CD.
Notebooks have hidden state, are poor with git (JSON format produces unreadable diffs), and lack proper error handling. Netflix, Airbnb, and Spotify use Jupyter for research but deploy via CI/CD pipelines.
A team built a model in a Jupyter Notebook. What is the right path to production deployment?
Key Takeaways
- **NumPy** - the foundation: typed ndarray + vectorized ops = C speed with Python convenience. Broadcasting eliminates loops
- **Pandas** - tables (DataFrame) with .loc/.iloc, groupby, and method chaining. SQL-style operations in Python, 27M downloads per week
- **Matplotlib + Seaborn** - Axes API for control. One chart, one insight. Title = conclusion, not label
- **Jupyter** - a lab for exploration, not production. Restart & Run All is a mandatory ritual
- **The stack together** - NumPy + Pandas + Matplotlib + Jupyter covers 90% of the data scientist's toolkit from Netflix to Kaggle
- **History** - Tukey 1962 laid the EDA philosophy, McKinney 2008 embodied it in Pandas, Oliphant 2005 gave speed via NumPy
Related Topics
Python tools are the foundation for all subsequent Data Science topics:
- What is Data Science — NumPy, Pandas, Matplotlib implement every stage of the data pipeline from the first lesson
- Descriptive Statistics — NumPy and Pandas provide mean, median, std out of the box
- Machine Learning: Introduction — Scikit-learn is built on NumPy. Pandas is the path to model training
Вопросы для размышления
- A CSV with 10 million rows. The loop `for row in df.iterrows()` runs for 40 minutes. How to speed it up using knowledge from this lesson?
- A colleague sent a Jupyter Notebook with 80 cells. Running Restart & Run All fails on cell 45. What most likely went wrong?
- When should Pandas be used over plain SQL? What data size threshold matters?
Связанные уроки
- ds-01 — Data pipeline and project lifecycle from the first lesson
- ds-03 — NumPy and Pandas are the foundation for descriptive statistics
- ml-04-data-preprocessing — Pandas transforms and NumPy ops are the core of ML preprocessing
- ml-01-intro — Scikit-learn is built on NumPy; Pandas is the path to model training
- prob-01-intro — NumPy random ops are practical probability in action
- stat-31-eda — Pandas + Seaborn are the EDA toolkit in statistical analysis