Exercises
Virtual environments
It’s a good idea to run the exercises in a virtual environment. This way you avoid conflicts between the dependencies for difference exercises and your project.
We recommend using conda on Triton. Note that you need to run
source activate
and not conda activate
.
module load miniconda
conda create -n ENV_NAME python pip
source activate ENV_NAME
pip install -r requirements
We recommend using conda on Triton. Note that you need to run
source activate
and not conda activate
.
module load Miniconda3
conda create -n ENV_NAME python pip
source activate ENV_NAME
pip install -r requirements
On CSC clusters, you can use pip-containerize
to create a container
for the virtual environment.
module purge
module load tykky
mkdir MyEnv
pip-containerize new --prefix MyEnv requirements.txt
export PATH="$PWD/MyEnv/bin:$PATH"
Exercise 1.1
Try to reproduce the results from How to choose the number of cores by timing a series of runs using the example code on your cluster.
Exercise 1.2 (optional)
Apply the methodology from How to choose the number of cores by timing a series of runs to your own code to find the optimal number of cores to use.
Exercise 1.3
Try to reproduce the results from Measuring and choosing the right amount of memory using the example code on your cluster.
Exercise 1.4 (optional)
Apply the methodology from Measuring and choosing the right amount of memory to your own code to find how much memory it uses.
Exercise 2.1
This code runs a parameter search with a fast simulation step. The
function simulate
runs on GPUs and is very fast. How would you
improve the I/O performance of this code?
for parameter in parameters:
for datafile in datafiles:
with open(datafile) as f:
data = f.read()
result = simulate(data, parameter)
with open('results.json', 'a') as f:
f.write(json.dumps(result))
Exercise 2.2
Find an example machine learning training script in https://github.com/coderefinery/CIFAR100_example. The data used is small enough to run on most systems, and the workflow is not especially problematic from I/O perspectice.
You are preparing to use the workflow for a significantly larger dataset. What should you take into account?
Count the number of file operations in a single epoch.
Study the code and find where the file operations actually happen. Can you improve the workflow to reduce load on the disk. Does this improve performance?
Exercise 2.3
Use dd to generate a large file. Try this on you local machine and on an HPC system. Which is faster? When would the HPC system be slower than a desktop?
Exercise 2.4: Meteorological data processing
Find an example meteorological data processing pipeline at https://github.com/coderefinery/meteorological-data-processing-exercise.
Read the instructions in the readme and generate example data.
Research commonly used data types for this type of data.
When does the process load data? How would you study possible I/O issues?
Can you improve data handling in the code?
Exercise X: Bring your own code and issues
Study I/O patterns in your own code. How much time does your code spend waiting for I/O? Is this a significant portion of the time? How can you improve this?