Generating our first plot

Objectives

  • Be able to create simple plots with Vega-Altair and tweak them

  • Know how to look for help

  • Know that other tools exist

  • We will build up this notebook (spoiler alert!)

Instructor note

  • 25 min talking/type-along

[this lesson is adapted from https://aaltoscicomp.github.io/python-for-scicomp/data-visualization/]

Repeatability/reproducibility

From Claus O. Wilke: “Fundamentals of Data Visualization”:

One thing I have learned over the years is that automation is your friend. I think figures should be autogenerated as part of the data analysis pipeline (which should also be automated), and they should come out of the pipeline ready to be sent to the printer, no manual post-processing needed.

  • Try to minimize manual post-processing. This could bite you when you need to regenerate 50 figures one day before submission deadline or regenerate a set of figures after the person who created them left the group.

  • There is not the one perfect language and not the one perfect library for everything.

  • Within Python, many libraries exist:

    • Vega-Altair: declarative visualization, statistics built in

    • Matplotlib: probably the most standard and most widely used

    • Seaborn: high-level interface to Matplotlib, statistical functions built in

    • Plotly: interactive graphs

    • Bokeh: also here good for interactivity

    • plotnine: implementation of a grammar of graphics in Python, it is based on ggplot2

    • ggplot: R users will be more at home

    • PyNGL: used in the weather forecast community

    • K3D: Jupyter Notebook extension for 3D visualization

  • Two main families of libraries: procedural (e.g. Matplotlib) and declarative.

Why are we starting with Vega-Altair?

  • Concise and powerful

  • Allows us to focus on the data visualization part and get started without too much Python knowledge

  • The way it combines visual channels with data columns can feel intuitive

  • Interfaces very nicely with pandas

  • Easy to change figures

  • Good documentation

  • Open source

  • Makes it easy to save figures in a number of formats

  • Easy to save interactive visualizations to be used in websites

Loading and plotting a dataset

In this lesson will work with one of the Gapminder datasets.

Let us together read and plot the data and then we explain what is happening and we will improve the figure together. First we read and inspect the data:

# import necessary libraries
import altair as alt
import pandas as pd

# read the data
url_prefix = "https://raw.githubusercontent.com/plotly/datasets/master/"
data = pd.read_csv(url_prefix + "gapminder_with_codes.csv")

# print overview of the dataset
data

With very few lines we can get the first plot:

alt.Chart(data).mark_point().encode(
    x="gdpPercap",
    y="lifeExp",
)
First raw plot with all countries and all years.

First raw plot with all countries and all years.

Observe how we connect (encode) visual channels to data columns:

  • x-coordinate with “gdpPercap”

  • y-coordinate with “lifeExp”

The following code would have the same effect but the above version might be easier to read:

alt.Chart(data).mark_point().encode(x="gdpPercap", y="lifeExp")

Let us pause and explain the code

  • alt is a short-hand for altair which we imported on top of the notebook

  • Chart() is a function defined inside altair which takes the data as argument

  • mark_point() is a function that produces scatter plots

  • encode() is a function which encodes data columns to visual channels

Filtering data directly in Vega-Altair

In Vega-Altair we can chain functions. Let us add two more:

alt.Chart(data).mark_point().encode(
    x="gdpPercap",
    y="lifeExp",
).transform_filter(alt.datum.year == 2007).interactive()
Now we only keep the year 2007.

Now we only keep the year 2007.

Using color as additional channel

A very neat feature of Vega-Altair is that it is easy to add and modify visual channels. Let us try to add one more so that we do something with the “continent” data column:

alt.Chart(data).mark_point().encode(
    x="gdpPercap",
    y="lifeExp",
    color="continent",
).transform_filter(alt.datum.year == 2007).interactive()
Using different colors for different continents.

Using different colors for different continents.

Changing to log scale

For this data set we will get a better insight when switching the x-axis from linear to log scale (we changed two lines to show both the “method syntax” and the “attribute syntax”):

alt.Chart(data).mark_point().encode(
    x=alt.X("gdpPercap").scale(type="log"),
    y=alt.Y("lifeExp"),
    color="continent",
).transform_filter(alt.datum.year == 2007).interactive()
Changing the x axis to log scale.

Changing the x axis to log scale.

Improving axis titles

alt.Chart(data).mark_point().encode(
    x=alt.X("gdpPercap").scale(type="log").title("GDP per capita (PPP dollars)"),
    y=alt.Y("lifeExp").title("Life expectancy (years)"),
    color="continent",
).transform_filter(alt.datum.year == 2007).interactive()
Improving the axis titles.

Improving the axis titles.

Faceted charts

To see what faceted charts are and how easy it is to do this, add the following line:

alt.Chart(data).mark_point().encode(
    x=alt.X("gdpPercap").scale(type="log").title("GDP per capita (PPP dollars)"),
    y=alt.Y("lifeExp").title("Life expectancy (years)"),
    color="continent",
    row="continent",
).transform_filter(alt.datum.year == 2007).interactive()

Guess what happens when you change row="continent" to column="continent"?

Changing from points to circles

Let us add one more visual channel, mapping size of the circle to the population size of a country:

alt.Chart(data).mark_circle().encode(
    x=alt.X("gdpPercap").scale(type="log").title("GDP per capita (PPP dollars)"),
    y=alt.Y("lifeExp").title("Life expectancy (years)"),
    color="continent",
    size="pop",
).transform_filter(alt.datum.year == 2007).interactive()
Circle sizes are proportional to population sizes.

Circle sizes are proportional to population sizes.


Where to go from here?

In few steps and few lines of code we have achieved a lot!

These plots are perhaps not publication quality yet but we will learn how to customize and improve in Customizing plots.

Keypoints

  • Avoid manual post-processing, try to script all steps.

  • Browse a number of example galleries to help you choose the library that fits best your work/style.

  • Figures for presentation slides and figures for manuscripts have different requirements. More about that later.