Good practices and tools

Objectives

  • How does good Python code look like? And if we only had 30 minutes, which good practices should we highlight?

  • Some of the points are inspired by the excellent Effective Python book by Brett Slatkin.

Follow the PEP 8 style guide

  • Please browse the PEP 8 style guide so that you are familiar with the most important rules.

  • Using a consistent style makes your code easier to read and understand for others.

  • You don’t have to check and adjust your code manually. There are tools that can do this for you (see below).

Linting and static type checking

A linter is a tool that analyzes source code to detect potential errors, unused imports, unused variables, code style violations, and to improve readability.

We recommend Ruff since it can do both checking and formatting and you don’t have to switch between multiple tools.

Linters and formatters can be configured to your liking

These tools typically have good defaults. But if you don’t like the defaults, you can configure what they should ignore or how they should format or not format.

This code example (which we possibly recognize from the previous section about Profiling) has few problems (highlighted):

import re
import requests


def count_unique_words(file_path: str) -> int:
    unique_words = set()
    forgotten_variable = 13
    with open(file_path, "r", encoding="utf-8") as file:
        for line in file:
            words = re.findall(r"\b\w+\b", line.lower()))
            for word in words:
                unique_words.add(word)
    return len(unique_words)

Please try whether you can locate these problems using Ruff:

$ ruff check

If you use version control and like to have your code checked or formatted before you commit the change, you can use tools like pre-commit.

Many editors can be configured to automatically check your code as you type. Ruff can also be used as a language server.

Use an auto-formatter

Ruff is one of the best tools to automatically format your code according to a consistent style.

To demonstrate how it works, let us try to auto-format a code example which is badly formatted and also difficult to read:

import re
def  count_unique_words (file_path : str)->int:
  unique_words=set()
  with open(file_path,"r",encoding="utf-8") as file:
    for line in file:
      words=re.findall(r"\b\w+\b",line.lower())
      for word in words:
        unique_words.add(word)
  return len(   unique_words   )

Other popular formatters:

Many editors can be configured to automatically format for you when you save the file.

It is possible to automatically format your code in Jupyter notebooks! For this to work you need the following three dependencies installed:

jupyterlab-code-formatter
black
isort

More information and a screen-cast of how this works can be found at https://jupyterlab-code-formatter.readthedocs.io/.

Consider annotating your functions with type hints

Compare these two versions of the same function and discuss how the type hints can help you and the Python interpreter to understand the function better:

def count_unique_words(file_path):
    unique_words = set()
    with open(file_path, "r", encoding="utf-8") as file:
        for line in file:
            words = re.findall(r"\b\w+\b", line.lower())
            for word in words:
                unique_words.add(word)
    return len(unique_words)

A (static) type checker is a tool that checks whether the types of variables in your code match the types that you have specified. Popular tools:

Consider using AI-assisted coding

We can use AI as an assistant/apprentice:

  • Code completion

  • Write a test based on an implementation

  • Write an implementation based on a test

Or we can use AI as a mentor:

  • Explain a concept

  • Improve code

  • Show a different (possibly better) way of implementing the same thing

Screenshot of ChatGPT

Example for using a chat-based AI tool.

Screen-cast of working with GitHub Copilot

Example for using AI to complete code in an editor.

AI tools open up a box of questions which are beyond our scope here

  • Legal

  • Ethical

  • Privacy

  • Lock-in/ monopolies

  • Lack of diversity

  • Will we still need to learn programming?

  • How will it affect learning and teaching programming?

Debugging with print statements

Print-debugging is a simple, effective, and popular way to debug your code like this:

print(f"file_path: {file_path}")

Or more elaborate:

print(f"I am in function count_unique_words and the value of file_path is {file_path}")

But there can be better alternatives:

  • Logging module

    import logging
    
    logging.basicConfig(level=logging.DEBUG)
    
    logging.debug("This is a debug message")
    logging.info("This is an info message")
    
  • IceCream offers compact helper functions for print-debugging

    from icecream import ic
    
    ic(file_path)
    

Often you can avoid using indices

Especially people coming to Python from other languages tend to use indices where they are not needed. Indices can be error-prone (off-by-one errors and reading/writing past the end of the collection).

Iterating

scores = [13, 5, 2, 3, 4, 3]

for i in range(len(scores)):
    print(scores[i])

Enumerate if you need the index

particle_masses = [7.0, 2.2, 1.4, 8.1, 0.9]

for i in range(len(particle_masses)):
    print(f"Particle {i} has mass {particle_masses[i]}")

Zip if you need to iterate over two collections

persons = ["Alice", "Bob", "Charlie", "David", "Eve"]
favorite_ice_creams = ["vanilla", "chocolate", "strawberry", "mint", "chocolate"]

for i in range(len(persons)):
    print(f"{persons[i]} likes {favorite_ice_creams[i]} ice cream")

Unpacking

coordinates = (0.1, 0.2, 0.3)

x = coordinates[0]
y = coordinates[1]
z = coordinates[2]

Prefer catch-all unpacking over indexing/slicing

scores = [13, 5, 2, 3, 4, 3]

sorted_scores = sorted(scores)

smallest = sorted_scores[0]
rest = sorted_scores[1:-1]
largest = sorted_scores[-1]

print(smallest, rest, largest)
# Output: 2 [3, 3, 4, 5] 13

List comprehensions, map, and filter instead of loops

string_numbers = ["1", "2", "3", "4", "5"]

integer_numbers = []
for element in string_numbers:
    integer_numbers.append(int(element))

print(integer_numbers)
# Output: [1, 2, 3, 4, 5]
def is_even(number: int) -> bool:
    return number % 2 == 0


numbers = [1, 2, 3, 4, 5, 6]

even_numbers = []
for number in numbers:
    if is_even(number):
        even_numbers.append(number)

print(even_numbers)
# Output: [2, 4, 6]

Know your collections

How to choose the right collection type:

  • Ordered and modifiable: list

  • Fixed and (rather) immutable: tuple

  • Key-value pairs: dict

  • Dictionary with default values: defaultdict from collections

  • Members are unique, no duplicates: set

  • Optimized operations at both ends: deque from collections

  • Cyclical iteration: cycle from itertools

  • Adding/removing elements in the middle: Create a linked list (e.g. using a dictionary or a dataclass)

  • Priority queue: heapq library

  • Search in sorted collections: bisect library

What to avoid:

  • Need to add/remove elements at the beginning or in the middle? Don’t use a list.

  • Need to make sure that elements are unique? Don’t use a list.

Making functions more ergonomic

  • Less error-prone API functions and fewer backwards-incompatible changes by enforcing keyword-only arguments:

    def send_message(*, message: str, recipient: str) -> None:
        print(f"Sending to {recipient}: {message}")
    
  • Use dataclasses or named tuples or dictionaries instead of too many input or output arguments.

  • Docstrings instead of comments:

    def send_message(*, message: str, recipient: str) -> None:
        """
        Sends a message to a recipient.
    
        Parameters:
        - message (str): The content of the message.
        - recipient (str): The name of the person receiving the message.
        """
        print(f"Sending to {recipient}: {message}")
    
  • Consider using DeprecationWarning from the warnings module for deprecating functions or arguments.

Iterating

  • When working with large lists or large data sets, consider using generators or iterators instead of lists. Discuss and compare these two:

    even_numbers1 = [number for number in range(10000000) if number % 2 == 0]
    
    even_numbers2 = (number for number in range(10000000) if number % 2 == 0)
    
  • Beware of functions which iterate over the same collection multiple times. With generators, you can iterate only once.

  • Know about itertools which provides a lot of functions for working with iterators.

Dataclasses

Dataclasses are often a good alternative to regular classes:

class Point:
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z

    def __repr__(self):
        return f"Point(x={self.x}, y={self.y}, z={self.z})"

    def __eq__(self, other):
        if not isinstance(other, Point):
            return NotImplemented
        return self.x == other.x and self.y == other.y and self.z == other.z

Use relative paths and pathlib

  • Scripts that read data from absolute paths are not portable and typically break when shared with a colleague or support help desk or reused by the next student/PhD student/postdoc.

  • pathlib is a modern and portable way to handle paths in Python.

Project structure

  • As your project grows from a simple script, you should consider organizing your code into modules and packages.

  • Function too long? Consider splitting it into multiple functions.

  • File too long? Consider splitting it into multiple files.

  • Difficult to name a function or file? It might be doing too much or unrelated things.

  • If your script can be imported into other scripts, wrap your main function in a if __name__ == "__main__": block:

    def main():
        ...
    
    if __name__ == "__main__":
        main()
    
  • Why this construct? You can try to either import or run the following script:

    if __name__ == "__main__":
        print("I am being run as a script")  # importing will not run this part
    else:
        print("I am being imported")
    
  • Try to have all code inside some function. This can make it easier to understand, test, and reuse. It can also help Python to free up memory when the function is done.

Reading and writing files

  • Good construct to know to read a file:

    with open("input.txt", "r") as file:
        for line in file:
            print(line)
    
  • Reading a huge data file? Read and process it in chunks or buffered or use a library which does it for you.

  • On supercomputers, avoid reading and writing thousands of small files.

  • For input files, consider using standard formats like CSV, YAML, or TOML - then you don’t need to write a parser.

Use subprocess instead of os.system

  • Many things can go wrong when launching external processes from Python. The subprocess module is the recommended way to do this.

  • os.system is not portable and not secure enough.

Parallelizing

  • Use one of the many libraries: multiprocessing, mpi4py, Dask, Parsl, …

  • Identify independent tasks.

  • More often than not, you can convert an expensive loop into a command-line tool and parallelize it using workflow management tools like Snakemake.