Good practices and tools
Objectives
How does good Python code look like? And if we only had 30 minutes, which good practices should we highlight?
Some of the points are inspired by the excellent Effective Python book by Brett Slatkin.
Follow the PEP 8 style guide
Please browse the PEP 8 style guide so that you are familiar with the most important rules.
Using a consistent style makes your code easier to read and understand for others.
You don’t have to check and adjust your code manually. There are tools that can do this for you (see below).
Linting and static type checking
A linter is a tool that analyzes source code to detect potential errors, unused imports, unused variables, code style violations, and to improve readability.
We recommend Ruff since it can do both checking and formatting and you don’t have to switch between multiple tools.
Linters and formatters can be configured to your liking
These tools typically have good defaults. But if you don’t like the defaults, you can configure what they should ignore or how they should format or not format.
This code example (which we possibly recognize from the previous section about Profiling) has few problems (highlighted):
import re
import requests
def count_unique_words(file_path: str) -> int:
unique_words = set()
forgotten_variable = 13
with open(file_path, "r", encoding="utf-8") as file:
for line in file:
words = re.findall(r"\b\w+\b", line.lower()))
for word in words:
unique_words.add(word)
return len(unique_words)
Please try whether you can locate these problems using Ruff:
$ ruff check
If you use version control and like to have your code checked or formatted before you commit the change, you can use tools like pre-commit.
Many editors can be configured to automatically check your code as you type. Ruff can also be used as a language server.
Use an auto-formatter
Ruff is one of the best tools to automatically format your code according to a consistent style.
To demonstrate how it works, let us try to auto-format a code example which is badly formatted and also difficult to read:
import re
def count_unique_words (file_path : str)->int:
unique_words=set()
with open(file_path,"r",encoding="utf-8") as file:
for line in file:
words=re.findall(r"\b\w+\b",line.lower())
for word in words:
unique_words.add(word)
return len( unique_words )
import re
def count_unique_words(file_path: str) -> int:
unique_words = set()
with open(file_path, "r", encoding="utf-8") as file:
for line in file:
words = re.findall(r"\b\w+\b", line.lower())
for word in words:
unique_words.add(word)
return len(unique_words)
This was done using:
$ ruff format
Other popular formatters:
Many editors can be configured to automatically format for you when you save the file.
It is possible to automatically format your code in Jupyter notebooks! For this to work you need the following three dependencies installed:
jupyterlab-code-formatter
black
isort
More information and a screen-cast of how this works can be found at https://jupyterlab-code-formatter.readthedocs.io/.
Consider annotating your functions with type hints
Compare these two versions of the same function and discuss how the type hints can help you and the Python interpreter to understand the function better:
def count_unique_words(file_path):
unique_words = set()
with open(file_path, "r", encoding="utf-8") as file:
for line in file:
words = re.findall(r"\b\w+\b", line.lower())
for word in words:
unique_words.add(word)
return len(unique_words)
def count_unique_words(file_path: str) -> int:
unique_words = set()
with open(file_path, "r", encoding="utf-8") as file:
for line in file:
words = re.findall(r"\b\w+\b", line.lower())
for word in words:
unique_words.add(word)
return len(unique_words)
A (static) type checker is a tool that checks whether the types of variables in your code match the types that you have specified. Popular tools:
Consider using AI-assisted coding
We can use AI as an assistant/apprentice:
Code completion
Write a test based on an implementation
Write an implementation based on a test
Or we can use AI as a mentor:
Explain a concept
Improve code
Show a different (possibly better) way of implementing the same thing
AI tools open up a box of questions which are beyond our scope here
Legal
Ethical
Privacy
Lock-in/ monopolies
Lack of diversity
Will we still need to learn programming?
How will it affect learning and teaching programming?
Debugging with print statements
Print-debugging is a simple, effective, and popular way to debug your code like this:
print(f"file_path: {file_path}")
Or more elaborate:
print(f"I am in function count_unique_words and the value of file_path is {file_path}")
But there can be better alternatives:
Often you can avoid using indices
Especially people coming to Python from other languages tend to use indices where they are not needed. Indices can be error-prone (off-by-one errors and reading/writing past the end of the collection).
Iterating
scores = [13, 5, 2, 3, 4, 3]
for i in range(len(scores)):
print(scores[i])
scores = [13, 5, 2, 3, 4, 3]
for score in scores:
print(score)
Enumerate if you need the index
particle_masses = [7.0, 2.2, 1.4, 8.1, 0.9]
for i in range(len(particle_masses)):
print(f"Particle {i} has mass {particle_masses[i]}")
particle_masses = [7.0, 2.2, 1.4, 8.1, 0.9]
for i, mass in enumerate(particle_masses):
print(f"Particle {i} has mass {mass}")
Zip if you need to iterate over two collections
persons = ["Alice", "Bob", "Charlie", "David", "Eve"]
favorite_ice_creams = ["vanilla", "chocolate", "strawberry", "mint", "chocolate"]
for i in range(len(persons)):
print(f"{persons[i]} likes {favorite_ice_creams[i]} ice cream")
persons = ["Alice", "Bob", "Charlie", "David", "Eve"]
favorite_ice_creams = ["vanilla", "chocolate", "strawberry", "mint", "chocolate"]
for person, flavor in zip(persons, favorite_ice_creams):
print(f"{person} likes {flavor} ice cream")
Unpacking
coordinates = (0.1, 0.2, 0.3)
x = coordinates[0]
y = coordinates[1]
z = coordinates[2]
coordinates = (0.1, 0.2, 0.3)
x, y, z = coordinates
Prefer catch-all unpacking over indexing/slicing
scores = [13, 5, 2, 3, 4, 3]
sorted_scores = sorted(scores)
smallest = sorted_scores[0]
rest = sorted_scores[1:-1]
largest = sorted_scores[-1]
print(smallest, rest, largest)
# Output: 2 [3, 3, 4, 5] 13
scores = [13, 5, 2, 3, 4, 3]
sorted_scores = sorted(scores)
smallest, *rest, largest = sorted(scores)
print(smallest, rest, largest)
# Output: 2 [3, 3, 4, 5] 13
List comprehensions, map, and filter instead of loops
string_numbers = ["1", "2", "3", "4", "5"]
integer_numbers = []
for element in string_numbers:
integer_numbers.append(int(element))
print(integer_numbers)
# Output: [1, 2, 3, 4, 5]
string_numbers = ["1", "2", "3", "4", "5"]
integer_numbers = [int(element) for element in string_numbers]
print(integer_numbers)
# Output: [1, 2, 3, 4, 5]
string_numbers = ["1", "2", "3", "4", "5"]
integer_numbers = list(map(int, string_numbers))
print(integer_numbers)
# Output: [1, 2, 3, 4, 5]
def is_even(number: int) -> bool:
return number % 2 == 0
numbers = [1, 2, 3, 4, 5, 6]
even_numbers = []
for number in numbers:
if is_even(number):
even_numbers.append(number)
print(even_numbers)
# Output: [2, 4, 6]
def is_even(number: int) -> bool:
return number % 2 == 0
numbers = [1, 2, 3, 4, 5, 6]
even_numbers = [number for number in numbers if is_even(number)]
print(even_numbers)
# Output: [2, 4, 6]
def is_even(number: int) -> bool:
return number % 2 == 0
numbers = [1, 2, 3, 4, 5, 6]
even_numbers = list(filter(is_even, numbers))
print(even_numbers)
# Output: [2, 4, 6]
Know your collections
How to choose the right collection type:
Ordered and modifiable:
list
Fixed and (rather) immutable:
tuple
Key-value pairs:
dict
Dictionary with default values:
defaultdict
fromcollections
Members are unique, no duplicates:
set
Optimized operations at both ends:
deque
fromcollections
Cyclical iteration:
cycle
fromitertools
Adding/removing elements in the middle: Create a linked list (e.g. using a dictionary or a dataclass)
Priority queue:
heapq
librarySearch in sorted collections:
bisect
library
What to avoid:
Need to add/remove elements at the beginning or in the middle? Don’t use a list.
Need to make sure that elements are unique? Don’t use a list.
Making functions more ergonomic
Less error-prone API functions and fewer backwards-incompatible changes by enforcing keyword-only arguments:
def send_message(*, message: str, recipient: str) -> None: print(f"Sending to {recipient}: {message}")
Use dataclasses or named tuples or dictionaries instead of too many input or output arguments.
Docstrings instead of comments:
def send_message(*, message: str, recipient: str) -> None: """ Sends a message to a recipient. Parameters: - message (str): The content of the message. - recipient (str): The name of the person receiving the message. """ print(f"Sending to {recipient}: {message}")
Consider using
DeprecationWarning
from thewarnings
module for deprecating functions or arguments.
Iterating
When working with large lists or large data sets, consider using generators or iterators instead of lists. Discuss and compare these two:
even_numbers1 = [number for number in range(10000000) if number % 2 == 0] even_numbers2 = (number for number in range(10000000) if number % 2 == 0)
Beware of functions which iterate over the same collection multiple times. With generators, you can iterate only once.
Know about
itertools
which provides a lot of functions for working with iterators.
Dataclasses
Dataclasses are often a good alternative to regular classes:
class Point:
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
def __repr__(self):
return f"Point(x={self.x}, y={self.y}, z={self.z})"
def __eq__(self, other):
if not isinstance(other, Point):
return NotImplemented
return self.x == other.x and self.y == other.y and self.z == other.z
from dataclasses import dataclass
@dataclass
class Point:
x: float
y: float
z: float
from collections import namedtuple
Point = namedtuple("Point", ["x", "y", "z"])
Use relative paths and pathlib
Scripts that read data from absolute paths are not portable and typically break when shared with a colleague or support help desk or reused by the next student/PhD student/postdoc.
pathlib
is a modern and portable way to handle paths in Python.
Project structure
As your project grows from a simple script, you should consider organizing your code into modules and packages.
Function too long? Consider splitting it into multiple functions.
File too long? Consider splitting it into multiple files.
Difficult to name a function or file? It might be doing too much or unrelated things.
If your script can be imported into other scripts, wrap your main function in a
if __name__ == "__main__":
block:def main(): ... if __name__ == "__main__": main()
Why this construct? You can try to either import or run the following script:
if __name__ == "__main__": print("I am being run as a script") # importing will not run this part else: print("I am being imported")
Try to have all code inside some function. This can make it easier to understand, test, and reuse. It can also help Python to free up memory when the function is done.
Reading and writing files
Good construct to know to read a file:
with open("input.txt", "r") as file: for line in file: print(line)
Reading a huge data file? Read and process it in chunks or buffered or use a library which does it for you.
On supercomputers, avoid reading and writing thousands of small files.
For input files, consider using standard formats like CSV, YAML, or TOML - then you don’t need to write a parser.
Use subprocess instead of os.system
Many things can go wrong when launching external processes from Python. The
subprocess
module is the recommended way to do this.os.system
is not portable and not secure enough.
Parallelizing
Use one of the many libraries:
multiprocessing
,mpi4py
, Dask, Parsl, …Identify independent tasks.
More often than not, you can convert an expensive loop into a command-line tool and parallelize it using workflow management tools like Snakemake.