List of exercises

Full list

This is a list of all exercises and solutions in this lesson, mainly as a reference for helpers and instructors. This list is automatically generated from all of the other pages in the lesson. Any single teaching event will probably cover only a subset of these, depending on their interests.

Organizing your projects

In organizing-projects.md:

Recording dependencies

In dependencies.md:

Dependencies-1: Time-capsule of dependencies

Situation: 5 students (A, B, C, D, E) wrote a code that depends on a couple of libraries. They uploaded their projects to GitHub. We now travel 3 years into the future and find their GitHub repositories and try to re-run their code before adapting it.

  • Which version do you expect to be easiest to re-run? Why?

  • What problems do you anticipate in each solution?

    A: You find a couple of library imports across the code but that’s it.

    B: The README file lists which libraries were used but does not mention any versions.

    C: You find a environment.yml file with:

    name: student-project
    channels:
      - conda-forge
    dependencies:
      - scipy
      - numpy
      - sympy
      - click
      - python
      - pip
      - pip:
        - git+https://github.com/someuser/someproject.git@master
        - git+https://github.com/anotheruser/anotherproject.git@master
    

    D: You find a environment.yml file with:

    name: student-project
    channels:
      - conda-forge
    dependencies:
      - scipy=1.3.1
      - numpy=1.16.4
      - sympy=1.4
      - click=7.0
      - python=3.8
      - pip
      - pip:
        - git+https://github.com/someuser/someproject.git@d7b2c7e
        - git+https://github.com/anotheruser/anotherproject.git@sometag
    

    E: You find a environment.yml file with:

    name: student-project
    channels:
      - conda-forge
    dependencies:
      - scipy=1.3.1
      - numpy=1.16.4
      - sympy=1.4
      - click=7.0
      - python=3.8
      - someproject=1.2.3
      - anotherproject=2.3.4
    

In dependencies.md:

In dependencies.md:

Dependencies-2: Create a time-capsule for the future

Now it is time to create your own time-capsule and share it with the future world. If we asked you now which dependencies your project is using, what would you answer? How would you find out? And how would you communicate this information?

Try this either with your own project or inside the “coderefinery” conda environment:

$ conda env export > environment.yml

Have a look at the generated file and discuss what you see.

In future you can re-create this environment with:

$ conda env create -f environment.yml

More information: https://docs.conda.io/en/latest/

See also: https://github.com/mamba-org/mamba

Recording computational steps

In workflow-management.md:

Workflow-1: Scripted solution for processing 4 books

Somebody wrote a script (script.sh) to process all 4 books:

#!/usr/bin/env bash

# loop over all books
for title in abyss isles last sierra; do
    python statistics/count.py data/${title}.txt > statistics/${title}.data
    python plot/plot.py --data-file statistics/${title}.data --plot-file plot/${title}.png
done

We can run it with:

$ bash script.sh
  • What are the advantages of this solution compared to processing all one by one?

  • Is the scripted solution reproducible?

  • Imagine adding more steps to the analysis and imagine the steps being time consuming. What problems do you anticipate with a scripted solution?

In workflow-management.md:

Workflow-2: Workflow solution using Snakemake

How Snakemake works

Somebody wrote a Snakemake solution and the interesting file here is the Snakefile:

# a list of all the books we are analyzing
DATA = glob_wildcards('data/{book}.txt').book

rule all:
    input:
        expand('statistics/{book}.data', book=DATA),
        expand('plot/{book}.png', book=DATA)

# count words in one of our books
rule count_words:
    input:
        script='statistics/count.py',
        book='data/{file}.txt'
    output: 'statistics/{file}.data'
    conda: 'environment.yml'
    log: 'statistics/{file}.log'
    shell: 'python {input.script} {input.book} > {output}'

# create a plot for each book
rule make_plot:
    input:
        script='plot/plot.py',
        book='statistics/{file}.data'
    output: 'plot/{file}.png'
    conda: 'environment.yml'
    log: 'plot/{file}.log'
    shell: 'python {input.script} --data-file {input.book} --plot-file {output}'

Snakemake uses declarative style: we describe dependencies but we let Snakemake figure out the series of steps to produce results (targets). Snakefiles contain rules that relate targets (output) to dependencies (input) and commands (shell).

Exercise goals:

  1. Clone the example to your computer: $ git clone https://github.com/coderefinery/word-count.git

  2. Study the Snakefile. How does it know what to do first and what to do then?

  3. Try to run it. Since version 5.11 one needs to specify number of cores (or jobs) using -j, --jobs or --cores:

    $ snakemake --delete-all-output -j 1
    $ snakemake -j 1
    

    The --delete-all-output part makes sure that we remove all generated files before we start.

  4. Try running snakemake again and observe that and discuss why it refused to rerun all steps:

    $ snakemake -j 1
    
    Building DAG of jobs...
    Nothing to be done (all requested files are present and up to date).
    
  5. Make a tiny modification to the plot.py script and run $ snakemake -j 1 again and observe how it will only re-run the plot steps.

  6. Make a tiny modification to one of the books and run $ snakemake -j 1 again and observe how it only regenerates files for this book.

  7. Discuss possible advantages compared to a scripted solution.

  8. Question for R developers: Imagine you want to rewrite the two Python scripts and use R instead. Which lines in the Snakefile would you have to modify so that it uses your R code?

  9. If you make changes to the Snakefile, validate it using $ snakemake --lint.

Recording environments

In environments.md:

Containers-1: Time travel

Scenario: A researcher has written and published their research code which requires a number of libraries and system dependencies. They ran their code on a Linux computer (Ubuntu). One very nice thing they did was to publish also a container image with all dependencies included, as well as the definition file (below) to create the container image.

Now we travel 3 years into the future and want to reuse their work and adapt it for our data. The container registry where they uploaded the container image however no longer exists. But luckily we still have the definition file (below)! From this we should be able to create a new container image.

  • Can you anticipate problems using the definitions file 3 years after its creation? Which possible problems can you point out?

  • Discuss possible take-aways for creating more reusable containers.

 1Bootstrap: docker
 2From: ubuntu:latest
 3
 4%post
 5    # Set environment variables
 6    export VIRTUAL_ENV=/app/venv
 7
 8    # Install system dependencies and Python 3
 9    apt-get update && \
10    apt-get install -y --no-install-recommends \
11        gcc \
12        libgomp1 \
13        python3 \
14        python3-venv \
15        python3-distutils \
16        python3-pip && \
17    apt-get clean && \
18    rm -rf /var/lib/apt/lists/*
19
20    # Set up the virtual environment
21    python3 -m venv $VIRTUAL_ENV
22    . $VIRTUAL_ENV/bin/activate
23
24    # Install Python libraries
25    pip install --no-cache-dir --upgrade pip && \
26    pip install --no-cache-dir -r /app/requirements.txt
27
28%files
29    # Copy project files
30    ./requirements.txt /app/requirements.txt
31    ./app.py /app/app.py
32    # Copy data
33    /home/myself/data /app/data
34    # Workaround to fix dependency on fancylib
35    /home/myself/fancylib /usr/lib/fancylib
36
37%environment
38    # Set the environment variables
39    export LANG=C.UTF-8 LC_ALL=C.UTF-8
40    export VIRTUAL_ENV=/app/venv
41
42%runscript
43    # Activate the virtual environment
44    . $VIRTUAL_ENV/bin/activate
45    # Run the application
46    python /app/app.py

In environments.md:

In environments.md:

Containers-2: Explore two really useful Docker images

You can try the below if you have Docker installed. If you have Singularity/Apptainer and not Docker, the goal of the exercise can be to run the Docker containers through Singularity/Apptainer.

  1. Run a specific version of Rstudio:

    $ docker run --rm -p 8787:8787 -e PASSWORD=yourpasswordhere rocker/rstudio
    

    Then open your browser to http://localhost:8787 with login rstudio and password “yourpasswordhere” used in the previous command.

    If you want to try an older version you can check the tags at https://hub.docker.com/r/rocker/rstudio/tags and run for example:

    $ docker run --rm -p 8787:8787 -e PASSWORD=yourpasswordhere rocker/rstudio:3.3
    
  2. Run a specific version of Anaconda3 from https://hub.docker.com/r/continuumio/anaconda3:

    $ docker run -i -t continuumio/anaconda3 /bin/bash
    

Sharing code and data

In sharing.md:

Sharing-1: Get a DOI from Zenodo

Digital object identifiers (DOI) are the backbone of the academic reference and metrics system. In this exercise we will see how to make a GitHub repository citable by archiving it on the Zenodo archiving service. Zenodo is a general-purpose open access repository created by OpenAIRE and CERN.

  1. Sign in to Zenodo using your GitHub account. For this exercise, use the sandbox service: https://sandbox.zenodo.org/login/. This is a test version of the real Zenodo platform.

  2. Go to https://sandbox.zenodo.org/account/settings/github/.

  3. Find the repository you wish to publish, and flip the switch to ON.

  4. Go to GitHub and create a release by clicking the Create a new release on the right-hand side (a release is based on a Git tag, but is a higher-level GitHub feature). You will need to enter a tag name (e.g. v0.1) in the “Choose a tag” box.

  5. Creating a new release will trigger Zenodo into archiving your repository, and a DOI badge will be displayed next to your repository after a minute or two. You can include it in your GitHub README file: click the DOI badge and copy the relevant format (Markdown, RST, HTML).