Plotting with Vega-Altair
Objectives
Be able to create simple plots with Vega-Altair and tweak them
Know how to look for help
Know how to tweak example plots from a gallery for your own purpose
We will build up this notebook (spoiler alert!)
Repeatability/reproducibility
From Claus O. Wilke: “Fundamentals of Data Visualization”:
One thing I have learned over the years is that automation is your friend. I think figures should be autogenerated as part of the data analysis pipeline (which should also be automated), and they should come out of the pipeline ready to be sent to the printer, no manual post-processing needed.
Try to minimize manual post-processing. This could bite you when you need to regenerate 50 figures one day before submission deadline or regenerate a set of figures after the person who created them left the group.
There is not the one perfect language and not the one perfect library for everything.
Within Python, many libraries exist:
Vega-Altair: declarative visualization, statistics built in
Matplotlib: probably the most standard and most widely used
Seaborn: high-level interface to Matplotlib, statistical functions built in
Plotly: interactive graphs
Bokeh: also here good for interactivity
plotnine: implementation of a grammar of graphics in Python, it is based on ggplot2
ggplot: R users will be more at home
PyNGL: used in the weather forecast community
K3D: Jupyter Notebook extension for 3D visualization
Mayavi: 3D scientific data visualization and plotting in Python
…
Two main families of libraries: procedural (e.g. Matplotlib) and declarative (e.g. Vega-Altair).
Why are we starting with Vega-Altair?
Concise and powerful
“Simple, friendly and consistent API” allows us to focus on the data visualization part and get started without too much Python knowledge
The way it combines visual channels with data columns can feel intuitive
Interfaces very nicely with Pandas
Easy to change figures
Good documentation
Open source
Makes it easy to save figures in a number of formats (svg, png, html)
Easy to save interactive visualizations to be used in websites
Reading data into a dataframe
From the previous section, let’s load the data in our jupyter notebook and fix the dates.
import pandas as pd
url_prefix = "https://raw.githubusercontent.com/coderefinery/data-visualization-python/main/data/"
data_tromso = pd.read_csv(url_prefix + "tromso-monthly.csv")
data_oslo = pd.read_csv(url_prefix + "oslo-monthly.csv")
data_monthly = pd.concat([data_tromso, data_oslo], axis=0)
# replace mm.yyyy to date format
data_monthly["date"] = pd.to_datetime(list(data_monthly["date"]), format="%m.%Y")
# let us print the combined result
data_monthly
Plotting the data
Now let’s plot the data. We will start with a plot that is not optimal and then we will explore and improve a bit as we go:
import altair as alt
alt.Chart(data_monthly).mark_bar().encode(
x="date",
y="precipitation",
color="name",
)
Monthly precipitation for the cities Oslo and Tromsø over the course of a year.
Let us pause and explain the code
alt
is a short-hand foraltair
which we imported on top of the notebookChart()
is a function defined insidealtair
which takes the data as argumentmark_bar()
is a function that produces bar chartsencode()
is a function which encodes data columns to visual channels
Observe how we connect (encode) visual channels to data columns:
x-coordinate with “date”
y-coordinate with “precipitation”
color with “name” (name of weather station; city)
We can improve the plot by giving Vega-Altair a bit more information that the x-axis is temporal (T) and that we would like to see the year and month (yearmonth):
alt.Chart(data_monthly).mark_bar().encode(
x="yearmonth(date):T",
y="precipitation",
color="name",
)
Apart from T (temporal), there are other encoding data types:
Q (quantitative)
O (ordinal)
N (nominal)
T (temporal)
G (geojson)
Monthly precipitation for the cities Oslo and Tromsø over the course of a year.
Let us improve the plot with another one-line change:
alt.Chart(data_monthly).mark_bar().encode(
x="yearmonth(date):T",
y="precipitation",
color="name",
column="name",
)
Monthly precipitation for the cities Oslo and Tromsø over the course of a year with with both cities plotted side by side.
With another one-line change we can make the bar chart stacked:
alt.Chart(data_monthly).mark_bar().encode(
x="yearmonth(date):T",
y="precipitation",
color="name",
xOffset="name",
)
Monthly precipitation for the cities Oslo and Tromsø over the course of a year plotted as stacked bar chart.
This is not publication-quality yet but a really good start!
Let us try one more example where we can nicely see how Vega-Altair is able to map visual channels to data columns:
alt.Chart(data_monthly).mark_area(opacity=0.5).encode(
x="yearmonth(date):T",
y="max temperature",
y2="min temperature",
color="name",
)
Monthly temperature ranges for two cities in Norway.
What other marks and other visual channels exist?
Themes
In Vega-Altair you can change the theme and select from a long list of themes. On top of your notebook try to add:
alt.themes.enable('dark')
Then re-run all cells. Later you can try some other themes such as:
fivethirtyeight
latimes
urbaninstitute
You can even define your own themes!
Discover the Vega-Altair gallery of examples
Try to rerun some examples from the Gallery of examples. Which one did you choose? Were you able to reproduce the figures? Did you try `alt.themes.enable(‘dark’)?
Note: you will need to first run in a cell the command !pip install vega_datasets
to make the demo data available.
Keypoints
Browse a number of example galleries to help you choose the library that fits best your work/style.
Minimize manual post-processing and try to script all steps.
CSV (comma-separated values) files are often a good format to store the data that we wish to plot.
Read the data into a Pandas dataframe and then plot it with Vega-Altair where you connect data columns to visual channels.