Generating our first plot
Objectives
Be able to create simple plots with Vega-Altair and tweak them
Know how to look for help
Know that other tools exist
We will build up this notebook (spoiler alert!)
Instructor note
25 min talking/type-along
[this lesson is adapted from https://aaltoscicomp.github.io/python-for-scicomp/data-visualization/]
Repeatability/reproducibility
From Claus O. Wilke: “Fundamentals of Data Visualization”:
One thing I have learned over the years is that automation is your friend. I think figures should be autogenerated as part of the data analysis pipeline (which should also be automated), and they should come out of the pipeline ready to be sent to the printer, no manual post-processing needed.
Try to minimize manual post-processing. This could bite you when you need to regenerate 50 figures one day before submission deadline or regenerate a set of figures after the person who created them left the group.
There is not the one perfect language and not the one perfect library for everything.
Within Python, many libraries exist:
Vega-Altair: declarative visualization, statistics built in
Matplotlib: probably the most standard and most widely used
Seaborn: high-level interface to Matplotlib, statistical functions built in
Plotly: interactive graphs
Bokeh: also here good for interactivity
plotnine: implementation of a grammar of graphics in Python, it is based on ggplot2
ggplot: R users will be more at home
PyNGL: used in the weather forecast community
K3D: Jupyter Notebook extension for 3D visualization
…
Two main families of libraries: procedural (e.g. Matplotlib) and declarative.
Why are we starting with Vega-Altair?
Concise and powerful
Allows us to focus on the data visualization part and get started without too much Python knowledge
The way it combines visual channels with data columns can feel intuitive
Interfaces very nicely with pandas
Easy to change figures
Good documentation
Open source
Makes it easy to save figures in a number of formats
Easy to save interactive visualizations to be used in websites
Loading and plotting a dataset
In this lesson will work with one of the Gapminder datasets.
Let us together read and plot the data and then we explain what is happening and we will improve the figure together. First we read and inspect the data:
# import necessary libraries
import altair as alt
import pandas as pd
# read the data
url_prefix = "https://raw.githubusercontent.com/plotly/datasets/master/"
data = pd.read_csv(url_prefix + "gapminder_with_codes.csv")
# print overview of the dataset
data
With very few lines we can get the first plot:
alt.Chart(data).mark_point().encode(
x="gdpPercap",
y="lifeExp",
)
Observe how we connect (encode) visual channels to data columns:
x-coordinate with “gdpPercap”
y-coordinate with “lifeExp”
The following code would have the same effect but the above version might be easier to read:
alt.Chart(data).mark_point().encode(x="gdpPercap", y="lifeExp")
Let us pause and explain the code
alt
is a short-hand foraltair
which we imported on top of the notebookChart()
is a function defined insidealtair
which takes the data as argumentmark_point()
is a function that produces scatter plotsencode()
is a function which encodes data columns to visual channels
Filtering data directly in Vega-Altair
In Vega-Altair we can chain functions. Let us add two more:
alt.Chart(data).mark_point().encode(
x="gdpPercap",
y="lifeExp",
).transform_filter(alt.datum.year == 2007).interactive()
Using color as additional channel
A very neat feature of Vega-Altair is that it is easy to add and modify visual channels. Let us try to add one more so that we do something with the “continent” data column:
alt.Chart(data).mark_point().encode(
x="gdpPercap",
y="lifeExp",
color="continent",
).transform_filter(alt.datum.year == 2007).interactive()
Changing to log scale
For this data set we will get a better insight when switching the x-axis from linear to log scale (we changed two lines to show both the “method syntax” and the “attribute syntax”):
alt.Chart(data).mark_point().encode(
x=alt.X("gdpPercap").scale(type="log"),
y=alt.Y("lifeExp"),
color="continent",
).transform_filter(alt.datum.year == 2007).interactive()
Improving axis titles
alt.Chart(data).mark_point().encode(
x=alt.X("gdpPercap").scale(type="log").title("GDP per capita (PPP dollars)"),
y=alt.Y("lifeExp").title("Life expectancy (years)"),
color="continent",
).transform_filter(alt.datum.year == 2007).interactive()
Faceted charts
To see what faceted charts are and how easy it is to do this, add the following line:
alt.Chart(data).mark_point().encode(
x=alt.X("gdpPercap").scale(type="log").title("GDP per capita (PPP dollars)"),
y=alt.Y("lifeExp").title("Life expectancy (years)"),
color="continent",
row="continent",
).transform_filter(alt.datum.year == 2007).interactive()
Guess what happens when you change row="continent"
to column="continent"
?
Changing from points to circles
Let us add one more visual channel, mapping size of the circle to the population size of a country:
alt.Chart(data).mark_circle().encode(
x=alt.X("gdpPercap").scale(type="log").title("GDP per capita (PPP dollars)"),
y=alt.Y("lifeExp").title("Life expectancy (years)"),
color="continent",
size="pop",
).transform_filter(alt.datum.year == 2007).interactive()
Where to go from here?
In few steps and few lines of code we have achieved a lot!
These plots are perhaps not publication quality yet but we will learn how to customize and improve in Customizing plots.
Keypoints
Avoid manual post-processing, try to script all steps.
Browse a number of example galleries to help you choose the library that fits best your work/style.
Figures for presentation slides and figures for manuscripts have different requirements. More about that later.