Exercises

Virtual environments

It’s a good idea to run the exercises in a virtual environment. This way you avoid conflicts between the dependencies for difference exercises and your project.

We recommend using conda on Triton. Note that you need to run source activate and not conda activate.

    module load miniconda
    conda create -n ENV_NAME python pip
    source activate ENV_NAME
    pip install -r requirements

We recommend using conda on Triton. Note that you need to run source activate and not conda activate.

    module load Miniconda3
    conda create -n ENV_NAME python pip
    source activate ENV_NAME
    pip install -r requirements

On CSC clusters, you can use pip-containerize to create a container for the virtual environment.

    module purge
    module load tykky
    mkdir MyEnv
    pip-containerize new --prefix MyEnv requirements.txt
    export PATH="$PWD/MyEnv/bin:$PATH"

Exercise 1.1

Try to reproduce the results from How to choose the number of cores by timing a series of runs using the example code on your cluster.

Exercise 1.2 (optional)

Apply the methodology from How to choose the number of cores by timing a series of runs to your own code to find the optimal number of cores to use.

Exercise 1.3

Try to reproduce the results from Measuring and choosing the right amount of memory using the example code on your cluster.

Exercise 1.4 (optional)

Apply the methodology from Measuring and choosing the right amount of memory to your own code to find how much memory it uses.

Exercise 2.1

This code runs a parameter search with a fast simulation step. The function simulate runs on GPUs and is very fast. How would you improve the I/O performance of this code?

for parameter in parameters:
    for datafile in datafiles:
        with open(datafile) as f:
            data = f.read()

        result = simulate(data, parameter)

        with open('results.json', 'a') as f:
            f.write(json.dumps(result))

Exercise 2.2

Find an example machine learning training script in https://github.com/coderefinery/CIFAR100_example. The data used is small enough to run on most systems, and the workflow is not especially problematic from I/O perspectice.

You are preparing to use the workflow for a significantly larger dataset. What should you take into account?
Count the number of file operations in a single epoch.
Study the code and find where the file operations actually happen. Can you improve the workflow to reduce load on the disk. Does this improve performance?

Exercise 2.3

Use dd to generate a large file. Try this on you local machine and on an HPC system. Which is faster? When would the HPC system be slower than a desktop?

Exercise 2.4: Meteorological data processing

Find an example meteorological data processing pipeline at https://github.com/coderefinery/meteorological-data-processing-exercise.

Read the instructions in the readme and generate example data.
Research commonly used data types for this type of data.
When does the process load data? How would you study possible I/O issues?
Can you improve data handling in the code?

Exercise X: Bring your own code and issues

Study I/O patterns in your own code. How much time does your code spend waiting for I/O? Is this a significant portion of the time? How can you improve this?