An iterative solution
Before we start
We don’t have to follow this line by line but it’s important to study this example well before demonstrating this.
Emphasize that the example is Python but we will try to see “through” the code and focus on the bigger picture and hopefully manage to imagine other languages in its place.
We collect ideas and feedback in the collaborative document while coding and the instructor tries to react to that without going into the rabbit hole.
We recommend to go through this together where the instructor(s) demonstrate(s) and learners can commend, suggest, and ask questions, and we are either all in the same video room or everybody is watching via stream. In other words, for this lesson, learners are not in separate breakout-rooms.
Checklist
Start with notebook
Generalize from 1 figure to 3 figures
Abstract code into functions
From functions with side-effects towards stateless functions
Move from notebook to script
Initialize git
Add
requirements.txt
Add test
Add command line interface
Split into multiple files/modules
Our initial version
We imagine that we assemble a working script from various StackOverflow recommendations and arrive at:
import pandas as pd
from matplotlib import pyplot as plt
num_measurements = 25
# read data from file
data = pd.read_csv("temperatures.csv", nrows=num_measurements)
temperatures = data["Air temperature (degC)"]
# compute statistics
mean = sum(temperatures) / num_measurements
# plot results
plt.plot(temperatures, "r-")
plt.axhline(y=mean, color="b", linestyle="--")
plt.show()
plt.savefig("25.png")
plt.clf()
We test it out in a notebook.
We add axis labels
It’s not the best placement but it works and later it will bite us (only the first plot will have labels) and we will improve it:
import pandas as pd
from matplotlib import pyplot as plt
plt.xlabel("measurements")
plt.ylabel("air temperature (deg C)")
num_measurements = 25
# read data from file
data = pd.read_csv("temperatures.csv", nrows=num_measurements)
temperatures = data["Air temperature (degC)"]
# compute statistics
mean = sum(temperatures) / num_measurements
# plot results
plt.plot(temperatures, "r-")
plt.axhline(y=mean, color="b", linestyle="--")
plt.show()
plt.savefig("25.png")
plt.clf()
Once we get this working for 25 measurements, our task changes to also plot the first 100 and the first 500 measurements in two additional plots.
Plotting also 100 and 500 measurements
Next idea is perhaps code duplication.
Then a for-loop to iterate over
[25, 100, 500]
:
import pandas as pd
from matplotlib import pyplot as plt
plt.xlabel("measurements")
plt.ylabel("air temperature (deg C)")
for num_measurements in [25, 100, 500]:
# read data from file
data = pd.read_csv("temperatures.csv", nrows=num_measurements)
temperatures = data["Air temperature (degC)"]
# compute statistics
mean = sum(temperatures) / num_measurements
# plot results
plt.plot(temperatures, "r-")
plt.axhline(y=mean, color="b", linestyle="--")
plt.show()
plt.savefig(f"{num_measurements}.png")
plt.clf()
Abstracting the plotting part into a function
import pandas as pd
from matplotlib import pyplot as plt
plt.xlabel("measurements")
plt.ylabel("air temperature (deg C)")
def plot_temperatures(temperatures):
plt.plot(temperatures, "r-")
plt.axhline(y=mean, color="b", linestyle="--")
plt.show()
plt.savefig(f"{num_measurements}.png")
plt.clf()
for num_measurements in [25, 100, 500]:
# read data from file
data = pd.read_csv("temperatures.csv", nrows=num_measurements)
temperatures = data["Air temperature (degC)"]
# compute statistics
mean = sum(temperatures) / num_measurements
# plot results
# plt.plot(temperatures, 'r-')
# plt.axhline(y=mean, color='b', linestyle='--')
# plt.show()
# plt.savefig(f'{num_measurements}.png')
# plt.clf()
plot_temperatures(temperatures)
Discuss what we expect before running it (some will expect this not to work because variables seem undefined).
Then try it out (it actually works).
Discuss problems with this solution (what if we copy-paste the function to a different file?).
The point of this step was that abstracting code into functions can be really good for reusability but just the fact that we created a function does not mean that the function is reusable since in this case it depends on a variable defined outside the function and hence there are side-effects.
Small improvements
Abstracting into more functions.
Notice how the comments got redundant:
import pandas as pd
from matplotlib import pyplot as plt
def plot_data(data, xlabel, ylabel):
plt.plot(data, "r-")
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.axhline(y=mean, color="b", linestyle="--")
plt.show()
plt.savefig(f"{num_measurements}.png")
plt.clf()
def compute_statistics(data):
mean = sum(data) / num_measurements
return mean
def read_data(file_name, column):
data = pd.read_csv(file_name, nrows=num_measurements)
return data[column]
for num_measurements in [25, 100, 500]:
temperatures = read_data(
file_name="temperatures.csv", column="Air temperature (degC)"
)
mean = compute_statistics(temperatures)
plot_data(
data=temperatures, xlabel="measurements", ylabel="air temperature (deg C)"
)
Discuss what would happen if we copy-paste the functions to another project (these functions are stateful/time-dependent).
Emphasize how stateful functions and order of execution in Jupyter notebooks can produce unexpected results and explain why we motivate to rerun all cells before sharing the notebook.
Towards functions without side-effects
Improve to more stateless functions:
import pandas as pd
from matplotlib import pyplot as plt
def plot_data(data, mean, xlabel, ylabel, file_name):
plt.plot(data, "r-")
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.axhline(y=mean, color="b", linestyle="--")
plt.show()
plt.savefig(file_name)
plt.clf()
def compute_mean(data):
mean = sum(data) / len(data)
return mean
def read_data(file_name, nrows, column):
data = pd.read_csv(file_name, nrows=nrows)
return data[column]
for num_measurements in [25, 100, 500]:
temperatures = read_data(
file_name="temperatures.csv",
nrows=num_measurements,
column="Air temperature (degC)",
)
mean = compute_mean(temperatures)
plot_data(
data=temperatures,
mean=mean,
xlabel="measurements",
ylabel="air temperature (deg C)",
file_name=f"{num_measurements}.png",
)
These functions can now be copy-pasted to a different notebook or project and they will still work.
Move from notebook to script
Adding unit tests is often the moment when notebook is not the right fit anymore.
But before we add tests:
“File” -> “Save and Export Notebook As …” -> “Executable Script”
git init
and commit the working version.Add
requirements.txt
and motivate how that can be useful to have later.
As we continue from here, create commits after meaningful changes and later also share the repository with learners. This nicely connects to other lessons of the workshop.
Unit tests
Design code for testing.
Move the main scope code into a main function.
Discuss where to add a test and add a test to the statistics function:
import pandas as pd
from matplotlib import pyplot as plt
import pytest
def plot_data(data, mean, xlabel, ylabel, file_name):
plt.plot(data, "r-")
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.axhline(y=mean, color="b", linestyle="--")
# plt.show()
plt.savefig(file_name)
plt.clf()
def compute_mean(data):
mean = sum(data) / len(data)
return mean
def test_compute_mean():
result = compute_mean([1.0, 2.0, 3.0, 4.0])
assert result == pytest.approx(2.5)
def read_data(file_name, nrows, column):
data = pd.read_csv(file_name, nrows=nrows)
return data[column]
def main():
for num_measurements in [25, 100, 500]:
temperatures = read_data(
file_name="temperatures.csv",
nrows=num_measurements,
column="Air temperature (degC)",
)
mean = compute_mean(temperatures)
plot_data(
data=temperatures,
mean=mean,
xlabel="measurements",
ylabel="air temperature (deg C)",
file_name=f"{num_measurements}.png",
)
if __name__ == "__main__":
main()
Command-line interface
Add a CLI for the input data file, the number of measurements, and the output file name.
Example here is using click but it can equally well be optparse, argparse, docopt, or Typer.
Discuss the motivations for adding a CLI:
We are able to modify the behavior without changing the code
We can run many of such scripts as part of a workflow
import pandas as pd
from matplotlib import pyplot as plt
import pytest
import click
def plot_data(data, mean, xlabel, ylabel, file_name):
plt.plot(data, "r-")
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.axhline(y=mean, color="b", linestyle="--")
plt.savefig(file_name)
plt.clf()
def compute_mean(data):
mean = sum(data) / len(data)
return mean
def test_compute_mean():
result = compute_mean([1.0, 2.0, 3.0, 4.0])
assert result == pytest.approx(2.5)
def read_data(file_name, nrows, column):
data = pd.read_csv(file_name, nrows=nrows)
return data[column]
@click.command()
@click.option(
"--num-measurements", required=True, type=int, help="Number of measurements."
)
@click.option("--in-file", required=True, help="File name where we read from.")
@click.option("--out-file", required=True, help="File name where we write to.")
def main(num_measurements, in_file, out_file):
temperatures = read_data(
file_name=in_file,
nrows=num_measurements,
column="Air temperature (degC)",
)
mean = compute_mean(temperatures)
plot_data(
data=temperatures,
mean=mean,
xlabel="measurements",
ylabel="air temperature (deg C)",
file_name=out_file,
)
if __name__ == "__main__":
main()
Split long script into modules
Discuss how you would move some functions out and organize them into separate modules which can be imported to other projects: For instance
compute_mean
can be moved tostatistics.py
.Discuss naming.
Discuss interface design.
Summarize in the collaborative document
Now return to initial questions on the collaborative document and discuss questions and comments. If there is time left, there are additional questions and exercises.
It is easier and more fun to teach this as a pair with somebody else where one person can type and the other person helps watching the questions and commends and relays them to the co-instructor.