Convert a Jupyter Notebook to a Python script

For this walk-through we will be starting with a jupyter notebook that is based on the Nearest Neighbor Classification example of the scikit-learn toolkit. The notebook can be found on github. In it, we

  • load the Iris dataset from scikit-learn datasets,

  • preprocess the data and save the preprocessed version to disk,

  • learn an Iris subspecies classifier from a subset of the data, and

  • plot the classifier’s boundary decisions on the complete data set.

The first step is to convert the notebook into a python script. This is rather simple and can be done in jupyter by going to:

"File" -> "Save and Export Notebook as..." -> "Executable Script"

The result of this conversion can be found on github.

Split into a pre-processing and a execution script

Our code has two distinct parts, a pre-processing part and a model generation and plotting part. The former part needs to be run exactly once and actually shouldn’t be run separately each time if we want to compare the results, as the training/test split should be the same for all methods. Thus, we split our code into two files, preprocess.py and train_and_plot.py.

 1from pathlib import Path
 2import pickle
 3
 4
 5from sklearn.model_selection import train_test_split
 6from sklearn.datasets import load_iris
 7
 8
 9# ## Preprocess data
10#
11# Load Iris flower data set from file.
12#
13# Extract two features
14#
15# - `SepalLengthCm`
16# - `SepalWidthCm`
17#
18# out of the available four
19#
20# - `SepalLengthCm`
21# - `SepalWidthCm`
22# - `PetalLengthCm`
23# - `PetalWidthCm`
24#
25# and map the class labels
26#
27# - `Iris-setosa`
28# - `Iris-versicolor`
29# - `Iris-virginica`)
30#
31# to integers 0, 1, and 2.
32#
33# Divide the data randomly to train and test sets.
34#
35# Save the preprocessed data to disk.
36
37# Load preprocessed data from disk
38
39
40# Load data from sklearn datasets
41iris = load_iris(as_frame=True)
42
43# Extract two features
44features = ["sepal length (cm)", "sepal width (cm)"]
45X = iris.data[features]
46
47# Class labels
48classes = iris.target_names
49y = iris.target
50
51# Divide randomly to train and test set
52X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
53
54# Save to disk
55Path("data/preprocessed").mkdir(exist_ok=True)
56pickle.dump(
57    [X, X_train, X_test, y, y_train, y_test, features, classes],
58    open("data/preprocessed/Iris.pkl", "wb"),
59)

We only include those imports necessary and make sure, that the data/preprocessed folder exists when we run the code. This allows us to run the pre-processing once and in further steps always use the already pre-processed data avoiding unnecessary compute time if we e.g. want to change the metrics.

 1from pathlib import Path
 2import pickle
 3
 4import matplotlib.pyplot as plt
 5
 6from sklearn.inspection import DecisionBoundaryDisplay
 7from sklearn.neighbors import KNeighborsClassifier
 8from sklearn.pipeline import Pipeline
 9from sklearn.preprocessing import StandardScaler
10
11# ## Fit pipelines and plot decision boundaries
12#
13# Loop over the `n_neighbors` parameter
14#
15# - Fit a standard scaler + knn classifier pipeline
16# - Plot decision boundaries and save the image to disk
17
18# Load preprocessed data from disk
19with open("data/preprocessed/Iris.pkl", "rb") as f:
20    data = pickle.load(f)
21    X, X_train, X_test, y, y_train, y_test, features, classes = data
22
23# Parameters
24# Metrics: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics
25n_neighbors_list = [1, 2, 4, 8, 16, 32, 64]
26metrics = ["euclidean", "manhattan", "l1", "haversine", "cosine"]
27
28# Loop over n_neighbors
29for n_neighbors in n_neighbors_list:
30    for metric in metrics:
31        # Fit
32        clf = Pipeline(
33            steps=[
34                ("scaler", StandardScaler()),
35                ("knn", KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)),
36            ]
37        )
38        clf.fit(X_train, y_train)
39
40        # Plot
41        disp = DecisionBoundaryDisplay.from_estimator(
42            clf,
43            X_test,
44            response_method="predict",
45            plot_method="pcolormesh",
46            xlabel=features[0],
47            ylabel=features[1],
48            shading="auto",
49            alpha=0.5,
50        )
51        scatter = disp.ax_.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, edgecolors="k")
52        disp.ax_.legend(
53            scatter.legend_elements()[0],
54            classes,
55            loc="lower left",
56            title="Classes",
57        )
58        _ = disp.ax_.set_title(
59            f"3-Class classification\n(k={n_neighbors!r}, metric={metric!r})"
60        )
61        plt.show()
62        # Save image to disk
63        Path("results/").mkdir(parents=True, exist_ok=True)
64        plt.savefig(f"results/n_neighbors={n_neighbors}___metric={metric}.png")
65        plt.close()

For the training and plotting we again clean up the imports, and otherwise leave the code unchanged.

Update code to run on a cluster

To run the code on a cluster we will need two steps, first, we will need to create an environment in which the code can run. How you go about this depends on the cluster. Most clusters allow the use of containers, which is why we will be using a container for this example. In the second step, we need to execute our code on the cluster using the scheduler.

Build a container for dependencies

We assume, that your cluster does have support for singularity. We provide both a singularity and docker definition file

# You might need to activate singularity depending on your cluster
singularity build python3_10 docker://harbor.cs.aalto.fi/aaltorse-public/coderefinery/parallel-workflow:latest

This commands builds the singularity container based on the docker image we provide. Containers are discussed in more details in our Container Lecture

Create a slurm script to run the code

We will need a slurm script to submit our job to the cluster queue. The script we will be using is the following:

#!/bin/bash
#SBATCH --job-name=long_job
#SBATCH --time=01:00:00
#SBATCH --mem=1G
#SBATCH --cpus-per-task=1

# We assume, that singularity runs out of the box on your cluster, if not, you will have to
# add a command here that makes the singularity command available on your cluster

# Example command: replace with your actual command
singularity exec python_container python preprocess.py

singularity exec python_container python train_and_plot.py

When this is done, we now have some code that runs on the cluster. This code will run all the different metrics and neighbourhood sizes one after another.