Data Science

Advanced Visualization

2010. The New York Times publishes 'Mapping America' - an interactive visualization where every dot on the US map is a person, colored by ethnic origin from census data. 300 million points, smooth zoom, filters by race and income. Millions of readers explored that visualization; a plain table of 50 state rows would have reached hundreds. The difference is in the tool: static matplotlib gives a picture, but readers cannot explore data at their own pace. Interactive visualization turns data into an experience.

  • **Uber kepler.gl**: WebGL visualization of millions of trips on maps, open-sourced; used by city planners to analyze transportation patterns
  • **Bloomberg Terminal**: interactive financial dashboards on Plotly-like architecture; millions of traders get real-time charts with hover tooltips and drill-down
  • **Cytoscape for bioinformatics**: building and analyzing graphs of protein-protein interactions, metabolic pathways, and regulatory networks - science depends on network visualization

Plotly: interactive charts

In the previous lesson the storytelling narrative had to let a CFO make a decision in 60 seconds. But what if the audience is a data engineer who wants to spin a hover tooltip, filter cohorts, and zoom into a date range? Static matplotlib is dead for that. **Plotly** is a library of interactive charts on JavaScript (plotly.js) with bindings in Python, R, and Julia. Every chart is JSON, rendered in the browser via WebGL for large datasets (millions of points). Plotly Express is a high-level API in seaborn style: one line of code produces a hoverable, zoomable, filterable chart ready to embed in Jupyter, Streamlit, or Dash.

Plotly architecture: figure = data (traces) + layout (axes, titles, legend) + frames (for animation). A trace is one data layer (scatter, bar, choropleth, and so on); a figure can contain any number of traces. The renderer is picked by context: notebook (inline HTML), browser (opens a local HTML), static (kaleido for PNG/SVG in a pipeline). For large data, use WebGL traces (scattergl, parcoords), which scale up to 10M points. Production alternatives: Bokeh (same approach, reactive), Altair (declarative grammar of graphics).

The price of interactivity is HTML size. One Plotly chart with 10K points is roughly 500KB of JSON in HTML; a dashboard of 10 charts is 5MB. For email or PDF embeds, use static export (fig.write_image('chart.png', engine='kaleido')). Production dashboards are better served by Dash or Streamlit with server-side data fetching - otherwise the frontend drowns in data.

A team builds a dashboard with metrics across 50 products and 365 days of history. Which tool combination fits best?

D3.js: visualization without limits

Plotly delivers standard chart types with a fast learning curve. But if a custom visual is needed that no library offers - a Sankey diagram with custom animation, a force-directed graph of 5000 nodes with physics, or an interactive map of molecular interactions - the choice is one: **D3.js** (Data-Driven Documents, Mike Bostock 2011). D3 is not a library of ready-made charts but a low-level toolkit for manipulating SVG/Canvas via data binding. The New York Times, FiveThirtyEight, and Pudding build their signature visuals on D3 - because every story has its own shape.

D3 philosophy: data binding (selectAll('.bar').data(arr) - link DOM elements to data), scales (d3.scaleLinear() - convert values to coordinates), axes, force simulation (physics for graphs: repulsion + links + center), transitions (animated state changes). The learning curve is steep: 50 lines for a first bar chart vs 1 line in Plotly. But behind the curve lies unlimited flexibility. Modern frontends often use react-d3 or Observable Plot (a new high-level API by Bostock) to reduce boilerplate.

A startup builds an interactive story-driven visualization for an investor pitch - an animated map of disease spread through a contact chain with chronology. Which tool fits?

Geospatial visualization

When data has latitude/longitude coordinates - sales by region, taxi trips, migration, weather sensors - a plain scatter plot does not work. A **map projection** is needed: converting the 3D sphere into a 2D plane. Mercator (classic navigation, distorts Greenland's area), Albers Equal Area (for statistics - preserves area), Mollweide (compromise between shape and area). Standard formats: GeoJSON (geometry + properties in JSON), Shapefile (binary, ESRI), TopoJSON (compact GeoJSON with topology). For production, PostGIS (geospatial extensions for PostgreSQL) and tile servers (Mapbox, OpenStreetMap) are the norm.

Geo-visualization types: (1) **Choropleth** - region fill by value (examples: COVID maps, election maps); fits normalized per-capita metrics, not absolute numbers (large states will dominate); (2) **Heatmap** - point density on a grid (crime distribution, taxi pickups); (3) **Hexbin** - aggregating points into hexagonal cells (avoids the Modifiable Areal Unit Problem - an artifact from administrative borders); (4) **Flow map** - lines between source and destination (migration, flight routes). Tools: Plotly choropleth, Folium (Python wrapper over Leaflet), kepler.gl (for large WebGL datasets from Uber), deck.gl.

Mercator projection covers 90% of world maps (default for Google Maps) - but distorts area near the poles. Greenland looks like Africa, even though its area is 14 times smaller. For global statistical maps (COVID, GDP per capita) use Equal Area projections (Albers, Eckert IV) - otherwise the visual bias exaggerates the role of northern countries.

An analyst builds a choropleth map of the US showing absolute sales by state. What is wrong?

Network graphs: links as the primary object

When primary data is not values per object but **relationships between objects**: a repost network in social media, a graph of transactions between bank accounts, a web hyperlink graph, the neural network of the brain. A graph = nodes + edges (which can be directed/undirected, weighted). Visualizing a graph requires solving a fundamental problem: **layout** - where to place nodes in 2D space so that link structure is readable. There is no ready solution; force-based algorithms are used (Fruchterman-Reingold, ForceAtlas2): nodes 'repel' each other, edges 'pull' them back, the system searches for a minimum energy state.

Graph metrics for cherry-picking nodes: degree centrality (number of direct links), betweenness centrality (how many shortest paths pass through a node - brokers/bridges), closeness centrality (average distance to all others - 'centrality'), PageRank (recursive weight from important neighbors, Google search). Clustering algorithms: Louvain, Label Propagation, K-Core - extract communities in the graph. Tools: NetworkX (Python, academic), Graph-tool (faster, C++), Gephi (GUI for exploration), Cytoscape (biology), vis.js/d3-force (web visualization).

A large graph (>10K nodes) can be visualized the same way as a small one - just enlarge the canvas

Large graphs require aggregation (community detection -> meta-graph), filtering (subgraph by centrality), or WebGL rendering (sigma.js, deck.gl); direct force-layout turns into a 'ball of yarn' and loses meaning

Force-directed layout scales as O(N^2) in naive implementations; even Barnes-Hut O(N log N) gives an unreadable tangle past 1000 nodes. Human visual bandwidth is limited; real insight comes through slice-and-dice: which community, who is in it, what external connections exist.

A team discovered a fraud network of 50000 accounts with suspicious transfers. The goal is to find coordinating nodes. Which metric helps?

Key Ideas

  • **Plotly** is the default pick for interactive charts with hover, zoom, filter; standard types covered by a high-level API
  • **D3.js** is a low-level toolkit for unique story-driven visuals; steep learning curve but unlimited flexibility
  • **Geospatial data** requires choosing a projection (Mercator vs Equal Area), normalizing the metric (per capita), and picking the right map type (choropleth vs heatmap vs flow map)
  • **Network graphs** are visualized via force-directed layout; key insights come through centrality metrics (betweenness for brokers, PageRank for authorities) and community detection (Louvain)

Related Topics

Mapping America from the opening is impossible on the static tools of the previous lesson - complex data requires complex interactive tools. Advanced visualization works alongside other skills:

  • Data storytelling — Interactive visualization without a narrative is a data dump; advanced tools work when there is a structured message (setup/conflict/resolution)
  • Big data and ETL — WebGL visualizations on millions of points need backend aggregation; raw data cannot be pushed to the browser - the pipeline aggregates first
  • Dashboards and BI — Plotly/D3 are the foundation of production dashboards (Dash, Streamlit, Superset); BI tools add a data layer + permissions + scheduling

Вопросы для размышления

  • NYT Mapping America loads 300M points through client-side WebGL - a significant resource investment. When is interactive visualization justified by business outcomes, and when is a static PNG enough?
  • D3.js requires months to master for non-standard cases; Plotly handles 80% of tasks in hours. Should an analytics team invest in D3 expertise, or stick with ready-made tools?
  • A geo-map with Mercator projection visually exaggerates northern countries (Greenland looks like Africa). What data-storytelling decisions have you seen that manipulate perception through visual form?

Связанные уроки

  • ds-13 — Storytelling - the foundation before picking a tool
  • ds-15 — NLP adds text data to the visual pipeline
  • alg-12-bfs — Graph visualization is built via BFS/DFS traversal
  • ds-12-service-discovery — Network topology - a special case of network graph
  • stat-08-correlation
Advanced Visualization

0

1

Sign In