Generating our first plot

Objectives

Be able to create simple plots with Vega-Altair and tweak them
Know how to look for help
Know that other tools exist
We will build up this notebook (spoiler alert!)

Instructor note

25 min talking/type-along

[this lesson is adapted from https://aaltoscicomp.github.io/python-for-scicomp/data-visualization/]

Repeatability/reproducibility

From Claus O. Wilke: “Fundamentals of Data Visualization”:

One thing I have learned over the years is that automation is your friend. I think figures should be autogenerated as part of the data analysis pipeline (which should also be automated), and they should come out of the pipeline ready to be sent to the printer, no manual post-processing needed.

Try to minimize manual post-processing. This could bite you when you need to regenerate 50 figures one day before submission deadline or regenerate a set of figures after the person who created them left the group.
There is not the one perfect language and not the one perfect library for everything.
Within Python, many libraries exist:
- Vega-Altair: declarative visualization, statistics built in
- Matplotlib: probably the most standard and most widely used
- Seaborn: high-level interface to Matplotlib, statistical functions built in
- Plotly: interactive graphs
- Bokeh: also here good for interactivity
- plotnine: implementation of a grammar of graphics in Python, it is based on ggplot2
- ggplot: R users will be more at home
- PyNGL: used in the weather forecast community
- K3D: Jupyter Notebook extension for 3D visualization
- …
Two main families of libraries: procedural (e.g. Matplotlib) and declarative.

Why are we starting with Vega-Altair?

Concise and powerful
Allows us to focus on the data visualization part and get started without too much Python knowledge
The way it combines visual channels with data columns can feel intuitive
Interfaces very nicely with pandas
Easy to change figures
Good documentation
Open source
Makes it easy to save figures in a number of formats
Easy to save interactive visualizations to be used in websites

Loading and plotting a dataset

In this lesson will work with one of the Gapminder datasets.

Let us together read and plot the data and then we explain what is happening and we will improve the figure together. First we read and inspect the data:

# import necessary libraries
import altair as alt
import pandas as pd

# read the data
url_prefix = "https://raw.githubusercontent.com/plotly/datasets/master/"
data = pd.read_csv(url_prefix + "gapminder_with_codes.csv")

# print overview of the dataset
data

With very few lines we can get the first plot:

alt.Chart(data).mark_point().encode(
    x="gdpPercap",
    y="lifeExp",
)

First raw plot with all countries and all years.

Observe how we connect (encode) visual channels to data columns:

x-coordinate with “gdpPercap”
y-coordinate with “lifeExp”

The following code would have the same effect but the above version might be easier to read:

alt.Chart(data).mark_point().encode(x="gdpPercap", y="lifeExp")

Let us pause and explain the code

alt is a short-hand for altair which we imported on top of the notebook
Chart() is a function defined inside altair which takes the data as argument
mark_point() is a function that produces scatter plots
encode() is a function which encodes data columns to visual channels

Filtering data directly in Vega-Altair 

In Vega-Altair we can chain functions. Let us add two more:

alt.Chart(data).mark_point().encode(
    x="gdpPercap",
    y="lifeExp",
).transform_filter(alt.datum.year == 2007).interactive()

Using color as additional channel

A very neat feature of Vega-Altair is that it is easy to add and modify visual channels. Let us try to add one more so that we do something with the “continent” data column:

alt.Chart(data).mark_point().encode(
    x="gdpPercap",
    y="lifeExp",
    color="continent",
).transform_filter(alt.datum.year == 2007).interactive()

Using different colors for different continents.

Changing to log scale

For this data set we will get a better insight when switching the x-axis from linear to log scale (we changed two lines to show both the “method syntax” and the “attribute syntax”):

alt.Chart(data).mark_point().encode(
    x=alt.X("gdpPercap").scale(type="log"),
    y=alt.Y("lifeExp"),
    color="continent",
).transform_filter(alt.datum.year == 2007).interactive()

Improving axis titles

alt.Chart(data).mark_point().encode(
    x=alt.X("gdpPercap").scale(type="log").title("GDP per capita (PPP dollars)"),
    y=alt.Y("lifeExp").title("Life expectancy (years)"),
    color="continent",
).transform_filter(alt.datum.year == 2007).interactive()

Faceted charts

To see what faceted charts are and how easy it is to do this, add the following line:

alt.Chart(data).mark_point().encode(
    x=alt.X("gdpPercap").scale(type="log").title("GDP per capita (PPP dollars)"),
    y=alt.Y("lifeExp").title("Life expectancy (years)"),
    color="continent",
    row="continent",
).transform_filter(alt.datum.year == 2007).interactive()

Guess what happens when you change row="continent" to column="continent"?

Changing from points to circles

Let us add one more visual channel, mapping size of the circle to the population size of a country:

alt.Chart(data).mark_circle().encode(
    x=alt.X("gdpPercap").scale(type="log").title("GDP per capita (PPP dollars)"),
    y=alt.Y("lifeExp").title("Life expectancy (years)"),
    color="continent",
    size="pop",
).transform_filter(alt.datum.year == 2007).interactive()

Circle sizes are proportional to population sizes.

Where to go from here?

In few steps and few lines of code we have achieved a lot!

These plots are perhaps not publication quality yet but we will learn how to customize and improve in Customizing plots.

Keypoints

Avoid manual post-processing, try to script all steps.
Browse a number of example galleries to help you choose the library that fits best your work/style.
Figures for presentation slides and figures for manuscripts have different requirements. More about that later.