Generating our first plot

Objectives

  • Be able to create simple plots with Matplotlib and tweak them

  • Be able to adapt gallery examples (more about this tomorrow)

  • Know how to look for help

  • Know that other tools exist

Instructor note

  • 20 min talking/type-along

  • 15 min exercise

[this lesson is adapted from https://aaltoscicomp.github.io/python-for-scicomp/data-visualization/]

Repeatability/reproducibility

From Claus O. Wilke: “Fundamentals of Data Visualization”:

One thing I have learned over the years is that automation is your friend. I think figures should be autogenerated as part of the data analysis pipeline (which should also be automated), and they should come out of the pipeline ready to be sent to the printer, no manual post-processing needed.

  • No manual post-processing. This will bite you when you need to regenerate 50 figures one day before submission deadline or regenerate a set of figures after the person who created them left the group.

  • There is not the one perfect language and not the one perfect library for everything.

  • Within Python, many libraries exist:

    • Matplotlib: probably the most standard and most widely used

    • Seaborn: high-level interface to Matplotlib, statistical functions built in

    • Altair: declarative visualization (R users will be more at home), statistics built in

    • Plotly: interactive graphs

    • Bokeh: also here good for interactivity

    • plotnine: implementation of a grammar of graphics in Python, it is based on ggplot2

    • ggplot: R users will be more at home

    • PyNGL: used in the weather forecast community

    • K3D: Jupyter notebook extension for 3D visualization

  • Two main families of libraries: procedural (e.g. Matplotlib) and declarative (using grammar of graphics).

Why are we starting with Matplotlib?

  • Matplotlib is perhaps the most “standard” Python plotting library.

  • Many libraries build on top of Matplotlib.

  • MATLAB users will feel familiar.

  • Even if you choose to use another library (see above list), chances are high that you need to adapt a Matplotlib plot of somebody else.

  • Libraries that are built on top of Matplotlib may need knowledge of Matplotlib for custom adjustments.

However it is a relatively low-level interface for drawing (in terms of abstractions, not in terms of quality) and does not provide statistical functions. Some figures require typing and tweaking many lines of code.

Many other visualization libraries exist with their own strengths, it is also a matter of personal preferences. Later we will also try other libraries.

Getting started with Matplotlib in the Jupyter notebook

Let us create our first plot:

# this line tells Jupyter to display matplotlib figures in the notebook
%matplotlib inline

import matplotlib.pyplot as plt

# this is dataset 1 from
# https://en.wikipedia.org/wiki/Anscombe%27s_quartet
data_x = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0]
data_y = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]

fig, ax = plt.subplots()

ax.scatter(x=data_x, y=data_y, c="#E69F00")

ax.set_xlabel("we should label the x axis")
ax.set_ylabel("we should label the y axis")
ax.set_title("some title")

# uncomment the next line if you would like to save the figure to disk
# fig.savefig("my-first-plot.png")
Result of our first plot

This is the result of our first plot.

When running a Matplotlib script on a remote server without a “display” (e.g. compute cluster), you may need to add this line:

import matplotlib.pyplot as plt
matplotlib.use("Agg")

# ... rest of the script

Exercises

Exercise Matplotlib-1: extend the previous example (15 min)

  • Extend the previous plot by also plotting this set of values but this time using a different color (#56B4E9):

    # this is dataset 2
    data2_y = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
    
  • Then add another color (#009E73) which plots the second dataset, scaled by 2.0.

    # here we multiply all elements of data2_y by 2.0
    data2_y_scaled = [y*2.0 for y in data2_y]
    
  • Try to add a legend to the plot with ax.legend() and searching the web for clues on how to add labels to each dataset.

  • At the end it should look like this one:

    Result of the exercise

Matplotlib has two different interfaces

When plotting with Matplotlib, it is useful to know and understand that there are two approaches even though the reasons of this dual approach is outside the scope of this lesson.

  • The more modern option is an object-oriented interface (the fig and ax objects can be configured separately and passed around to functions):

    import matplotlib.pyplot as plt
    
    # this is dataset 1 from
    # https://en.wikipedia.org/wiki/Anscombe%27s_quartet
    data_x = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0]
    data_y = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
    
    fig, ax = plt.subplots()
    
    ax.scatter(x=data_x, y=data_y, c="#E69F00")
    
    ax.set_xlabel("we should label the x axis")
    ax.set_ylabel("we should label the y axis")
    ax.set_title("some title")
    
  • The more traditional option mimics MATLAB plotting and uses the pyplot interface (plt carries the global settings):

    import matplotlib.pyplot as plt
    
    # this is dataset 1 from
    # https://en.wikipedia.org/wiki/Anscombe%27s_quartet
    data_x = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0]
    data_y = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
    
    plt.scatter(x=data_x, y=data_y, c="#E69F00")
    
    plt.xlabel("we should label the x axis")
    plt.ylabel("we should label the y axis")
    plt.title("some title")
    

When searching for help on the internet, you will find both approaches, they can also be mixed. Although the pyplot interface looks more compact, we recommend to learn and use the object oriented interface.

Why do we emphasize this?

One day you may want to write functions which wrap around Matplotlib function calls and then you can send fig and ax into these functions and there is less risk that adjusting figures changes settings also for unrelated figures created in other functions.

When using the pyplot interface, settings are modified for the entire plt package. The latter is acceptable for linear scripts but may yield surprising results when introducing functions to enhance/abstract Matplotlib calls.


Keypoints

  • Avoid manual post-processing, script everything.

  • Browse a number of example galleries to help you choose the library that fits best your work/style.

  • Think about color-vision deficiencies when choosing colors. Use existing solutions for this problem.

  • Figures for presentation slides and figures for manuscripts have different requirements. More about that later.