Convert a Jupyter Notebook to a Python script

For this walk-through we will be starting with a jupyter notebook that is based on the Nearest Neighbor Classification example of the scikit-learn toolkit. The notebook can be found on github. In it, we

load the Iris dataset from scikit-learn datasets,
preprocess the data and save the preprocessed version to disk,
learn an Iris subspecies classifier from a subset of the data, and
plot the classifier’s boundary decisions on the complete data set.

The first step is to convert the notebook into a python script. This is rather simple and can be done in jupyter by going to:

"File" -> "Save and Export Notebook as..." -> "Executable Script"

The result of this conversion can be found on github.

Split into a pre-processing and a execution script

Our code has two distinct parts, a pre-processing part and a model generation and plotting part. The former part needs to be run exactly once and actually shouldn’t be run separately each time if we want to compare the results, as the training/test split should be the same for all methods. Thus, we split our code into two files, preprocess.py and train_and_plot.py.

from pathlib import Path
import pickle


from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris


# ## Preprocess data
#
# Load Iris flower data set from file.
#
# Extract two features
#
# - `SepalLengthCm`
# - `SepalWidthCm`
#
# out of the available four
#
# - `SepalLengthCm`
# - `SepalWidthCm`
# - `PetalLengthCm`
# - `PetalWidthCm`
#
# and map the class labels
#
# - `Iris-setosa`
# - `Iris-versicolor`
# - `Iris-virginica`)
#
# to integers 0, 1, and 2.
#
# Divide the data randomly to train and test sets.
#
# Save the preprocessed data to disk.

# Load preprocessed data from disk


# Load data from sklearn datasets
iris = load_iris(as_frame=True)

# Extract two features
features = ["sepal length (cm)", "sepal width (cm)"]
X = iris.data[features]

# Class labels
classes = iris.target_names
y = iris.target

# Divide randomly to train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

# Save to disk
Path("data/preprocessed").mkdir(exist_ok=True)
pickle.dump(
    [X, X_train, X_test, y, y_train, y_test, features, classes],
    open("data/preprocessed/Iris.pkl", "wb"),
)

We only include those imports necessary and make sure, that the data/preprocessed folder exists when we run the code. This allows us to run the pre-processing once and in further steps always use the already pre-processed data avoiding unnecessary compute time if we e.g. want to change the metrics.

from pathlib import Path
import pickle

import matplotlib.pyplot as plt

from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# ## Fit pipelines and plot decision boundaries
#
# Loop over the `n_neighbors` parameter
#
# - Fit a standard scaler + knn classifier pipeline
# - Plot decision boundaries and save the image to disk

# Load preprocessed data from disk
with open("data/preprocessed/Iris.pkl", "rb") as f:
    data = pickle.load(f)
    X, X_train, X_test, y, y_train, y_test, features, classes = data

# Parameters
# Metrics: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics
n_neighbors_list = [1, 2, 4, 8, 16, 32, 64]
metrics = ["euclidean", "manhattan", "l1", "haversine", "cosine"]

# Loop over n_neighbors
for n_neighbors in n_neighbors_list:
    for metric in metrics:
        # Fit
        clf = Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("knn", KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)),
            ]
        )
        clf.fit(X_train, y_train)

        # Plot
        disp = DecisionBoundaryDisplay.from_estimator(
            clf,
            X_test,
            response_method="predict",
            plot_method="pcolormesh",
            xlabel=features[0],
            ylabel=features[1],
            shading="auto",
            alpha=0.5,
        )
        scatter = disp.ax_.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, edgecolors="k")
        disp.ax_.legend(
            scatter.legend_elements()[0],
            classes,
            loc="lower left",
            title="Classes",
        )
        _ = disp.ax_.set_title(
            f"3-Class classification\n(k={n_neighbors!r}, metric={metric!r})"
        )
        plt.show()
        # Save image to disk
        Path("results/").mkdir(parents=True, exist_ok=True)
        plt.savefig(f"results/n_neighbors={n_neighbors}___metric={metric}.png")
        plt.close()

For the training and plotting we again clean up the imports, and otherwise leave the code unchanged.

Update code to run on a cluster

To run the code on a cluster we will need two steps, first, we will need to create an environment in which the code can run. How you go about this depends on the cluster. Most clusters allow the use of containers, which is why we will be using a container for this example. In the second step, we need to execute our code on the cluster using the scheduler.

Build a container for dependencies

We assume, that your cluster does have support for singularity. We provide both a singularity and docker definition file

# You might need to activate singularity depending on your cluster
singularity build python3_10 docker://harbor.cs.aalto.fi/aaltorse-public/coderefinery/parallel-workflow:latest

This commands builds the singularity container based on the docker image we provide. Containers are discussed in more details in our Container Lecture

Create a slurm script to run the code

We will need a slurm script to submit our job to the cluster queue. The script we will be using is the following:

#!/bin/bash
#SBATCH --job-name=long_job
#SBATCH --time=01:00:00
#SBATCH --mem=1G
#SBATCH --cpus-per-task=1

# We assume, that singularity runs out of the box on your cluster, if not, you will have to
# add a command here that makes the singularity command available on your cluster

# Example command: replace with your actual command
singularity exec python_container python preprocess.py

singularity exec python_container python train_and_plot.py

When this is done, we now have some code that runs on the cluster. This code will run all the different metrics and neighbourhood sizes one after another.