List of exercises

Full list

This is a list of all exercises and solutions in this lesson, mainly as a reference for helpers and instructors. This list is automatically generated from all of the other pages in the lesson. Any single teaching event will probably cover only a subset of these, depending on their interests.

Parallelize using scripting

In parallelization/parallelize_using_script.md:

Parallel-1: Add the metrics as a parameter to the submission

Let’s assume that we noticed, that running all metrics still took too long for our purpose, and we also want to parallelize that part. You will need to update the train_and_plot.py script, the submission script and the sbatch script.

In parallelization/parallelize_using_script.md:

Solution: Parallel-1

We will need to remove the metrics loop from the train_and_plot.py script as follows and add a parser for it as follows:

from pathlib import Path
import pickle
import argparse
import matplotlib.pyplot as plt

from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# ## Fit pipelines and plot decision boundaries
#
# Loop over the `n_neighbors` parameter
#
# - Fit a standard scaler + knn classifier pipeline
# - Plot decision boundaries and save the image to disk

# Parse arguments
parser = argparse.ArgumentParser()
parser.add_argument(
    "--n_neighbors",
    type=int,
    help="The number of neighbors to use for calculation.",
)
parser.add_argument(
    "--metric",
    type=str,
    help="The metric to use",
)
args = parser.parse_args()
n_neighbors = args.n_neighbors
metric = args.metric

# Load preprocessed data from disk
with open("data/preprocessed/Iris.pkl", "rb") as f:
    data = pickle.load(f)
    X, X_train, X_test, y, y_train, y_test, features, classes = data


# Parameters
# Metrics: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics

# Loop over n_neighbors
# Fit
clf = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        ("knn", KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)),
    ]
)
clf.fit(X_train, y_train)

# Plot
disp = DecisionBoundaryDisplay.from_estimator(
    clf,
    X_test,
    response_method="predict",
    plot_method="pcolormesh",
    xlabel=features[0],
    ylabel=features[1],
    shading="auto",
    alpha=0.5,
)
scatter = disp.ax_.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, edgecolors="k")
disp.ax_.legend(
    scatter.legend_elements()[0],
    classes,
    loc="lower left",
    title="Classes",
)
_ = disp.ax_.set_title(
    f"3-Class classification\n(k={n_neighbors!r}, metric={metric!r})"
)
plt.show()

# Save image to disk
Path("results/").mkdir(parents=True, exist_ok=True)
plt.savefig(f"results/n_neighbors={n_neighbors}___metric={metric}.png")

Then, we need to update the submission slurm script, adding a further parameter to it:

#!/bin/bash
#SBATCH --job-name=long_job

# We assume, that singularity runs out of the box on your cluster, if not, you will have to
# add a command here that makes the singularity command available on your cluster

singularity exec python_container python train_and_plot.py --n_neighbors $1 --metrics $2

And finally, we need to update the submission python script to also use the metrics values:

import subprocess

neighbors = [1, 2, 4, 8, 16, 32, 64]
metrics = ["euclidean", "manhattan", "l1", "haversine", "cosine"]
for i in neighbors:
    for metric in metrics:
        result = subprocess.run(["sbatch", "submission.sh", f"{i}", f"{metric}"])

Parallelize using Slurm Array jobs

In parallelization/array_jobs.md:

Parallel-2: Create a slurm script and run it.

Let’s assume, that we want to build the same job, that we had with the script submission but we want to use an array job instead of a submission script.

To do this, we need to update the slurm script to use an array instead of an input argument.

Create this submission script and run it on your cluster

In parallelization/array_jobs.md:

Solution: Parallel-2

Here is a script that should be able to run on your cluster:

#!/bin/bash
#SBATCH --job-name=long_job
#SBATCH --array=[1,2,4,8,16,32,64]
#SBATCH --output=output_%A_%a.txt
#SBATCH --error=error_%A_%a.txt

singularity exec python_container python train_and_plot.py --n_neighbors $SLURM_ARRAY_TASK_ID

Other materials

There is an optional exercise with Snakemake from the CodeRefinery lessons materials.