We clone an example repository.
This example contains two scripts that should in a very simplified way represent a real data processing pipeline.
The script count.py
counts the 10 most frequent characters in a text,
read from standard input:
$ cat data/lorem.in | ./count.py
This produces:
i 42
e 38
t 32
o 29
u 29
a 29
n 24
l 22
r 22
d 19
The output of this script can be piped into a second script called
plot.py
which “plots” the result in a bar “chart”:
$ cat data/lorem.in | ./count.py | ./plot.py
Which produces:
i #####################
e ###################
t ################
o ##############
u ##############
a ##############
n ############
l ###########
r ###########
d #########
The scripts are extremely simplified versions of real life scripts.
# rule
target: dependencies
command(s)
# another rule
target: dependencies
command(s)
# ...
There are tabs in front of the commands, not spaces!
Create a file called Makefile
with the following content (mind the tabs):
all: data/lorem.tmp data/lorem.out
data/lorem.tmp: data/lorem.in
cat data/lorem.in | ./count.py > data/lorem.tmp
data/lorem.out: data/lorem.tmp
cat data/lorem.tmp | ./plot.py > data/lorem.out
And test it:
$ make
make
.make
again without modifying the input text.How does Make know that the outputs need to be rebuilt or not?
Try also “updating” the input file with touch
:
$ touch data/*.in
$ make
Now we will try a more sophisticated example:
SRCS = $(wildcard data/*.in)
OBJS = $(patsubst %.in,%.tmp,$(SRCS))
OBJS += $(patsubst %.in,%.out,$(SRCS))
all: $(OBJS)
# otherwise intermediate tmp files would be deleted
.PRECIOUS: %.tmp
%.tmp: %.in
cat $< | ./count.py > $@
%.out: %.tmp
cat $< | ./plot.py > $@
Discuss the changes and the motivations behind these changes.
Now we have a Makefile which can process thousands of files. It will also discover which files need to be rebuilt if we modify inputs.
In this simple example the dependencies are not branched but in a real life example we can imagine complex dependency trees.
Now let us simulate the situation that the processing takes a lot of time.
For this edit count.py
and plot.py
and insert an artificial pause (maybe 2
or 3 seconds) by changing num_seconds_sleep
.
Then time the entire processing:
$ touch data/*.in
$ time make
cat data/lorem.in | ./count.py > data/lorem.tmp
cat data/lorem.tmp | ./plot.py > data/lorem.out
cat data/shakespeare.in | ./count.py > data/shakespeare.tmp
cat data/shakespeare.tmp | ./plot.py > data/shakespeare.out
cat data/faust.in | ./count.py > data/faust.tmp
cat data/faust.tmp | ./plot.py > data/faust.out
make 0.22s user 0.04s system 2% cpu 12.289 total
Now if you have 4 cores in your laptop (adapt accordingly), try:
$ touch data/*.in
$ time make -j4
cat data/lorem.in | ./count.py > data/lorem.tmp
cat data/lorem.tmp | ./plot.py > data/lorem.out
cat data/shakespeare.in | ./count.py > data/shakespeare.tmp
cat data/shakespeare.tmp | ./plot.py > data/shakespeare.out
cat data/faust.in | ./count.py > data/faust.tmp
cat data/faust.tmp | ./plot.py > data/faust.out
make -j4 0.27s user 0.03s system 7% cpu 4.131 total
Discuss what just happened.
Makefiles express targets, rules, and dependencies.