Concepts
Embarrassingly parallel code
Embarrassingly parallel code is code that essentially runs the same functionality for a large variety of input parameters, where each step does not depend on anything other than the input data. Here are two types of situations in pseudo-code that show embarrassinlgy parallel execution:
data = load_and_preprocess_data
for i in range(len(data)) do:
res[i] = calculate(data[i])
or
parameters1 = [ 1, 2, 3, 4, 5, 6 ]
parameters2 = [ "a", "b", "c", "d" ]
res = []
for p1 in parameters1:
for p2 in parameters2:
res.append(calculate(p1, p2))
The first being a pipeline applied to every datapoint in some input data, the second a classical (hyper-)parameter sweep for a model. Essentially you are running the same type of computation for each entry of a set of data. This is the type of code that can easily be parallelized on a scheduler level.
Size of jobs
In general, jobs should not be too short on a cluster. There is always an overhead in scheduling tasks, and if jobs of an array job are too short, this overhead can result in a waste of resources. If you have code that runs in a second but needs to be run millions of times, try to batch executions into batches, such that each individual job takes 10-60 minutes. This way, the overhead of starting and scheduling the jobs is reduced significantly.