Git under the hood

Objectives

Verify that branches are pointers to commits and extremely lightweight.

Instructor note

10 min teaching/type-along
15 min exercise

Down the rabbit hole

When working with Git, you will never need to go inside .git, but in this exercise we will, in order to learn about how branches are implemented in Git.

For this exercise create a new repository and commit a couple of changes.

Now that we’ve made a couple of commits let us look at what is happening under the hood.

$ cd .git
$ ls -l

drwxr-xr-x   - user 25 Aug 15:51 branches
.rw-r--r-- 499 user 25 Aug 15:52 COMMIT_EDITMSG
.rw-r--r--  92 user 25 Aug 15:51 config
.rw-r--r--  73 user 25 Aug 15:51 description
.rw-r--r--  21 user 25 Aug 15:51 HEAD
drwxr-xr-x   - user 25 Aug 15:51 hooks
.rw-r--r-- 137 user 25 Aug 15:52 index
drwxr-xr-x   - user 25 Aug 15:51 info
drwxr-xr-x   - user 25 Aug 15:52 logs
drwxr-xr-x   - user 25 Aug 15:52 objects
drwxr-xr-x   - user 25 Aug 15:51 refs

Git stores everything under the .git folder in your repository. In fact, the .git directory is the Git repository.

Previously when you wrote the commit messages using your text editor, they were in fact saved to COMMIT_EDITMSG.

Each commit in Git is stored as a “blob”. This blob contains information about the author and the commit message. The blob references another blob that lists the files present in the directory at the time and references blobs that record the state of each file.

Commits are referenced by a SHA-1 hash (a 40-character hexadecimal string).

A commit inside Git — States of a Git file. Image from the Pro Git book. License CC BY 3.0.

Once you have several commits, each commit blob also links to the hash of the previous commit. The commits form a directed acyclic graph (do not worry if the term is not familiar).

A commit and its parents. Image from the Pro Git book. License CC BY 3.0.

All branches and tags in Git are pointers to commits.

Git is basically a content-addressed storage system

CAS: “mechanism for storing information that can be retrieved based on its content, not its storage location”
Content address is the content digest (SHA-1 checksum)
Stored data does not change - so when we modify commits, we always create new commits. Git doesn’t delete these right away, which is why it is very hard to lose data if you commit it once.

Let us poke a bit into raw objects! Start with:

$ git cat-file -p HEAD

Then explore the tree object, then the file object, etc. recursively using the hashes you see.

Demonstration: experimenting with branches

Let us lift the hood and create few branches manually. The goal of this exercise is to hopefully create an “Aha!” moment and provide us a good understanding of the underlying model.

We are starting from the main branch and create an idea branch:

$ git status

On branch main
nothing to commit, working tree clean

$ git switch --create idea

Switched to a new branch 'idea'

$ git branch

* idea
  main

Now let us go in:

$ cd .git
$ cd refs/heads
$ ls -l

.rw-r--r-- 41 user 25 Aug 15:54 idea
.rw-r--r-- 41 user 25 Aug 15:52 main

Let us check what the idea file looks like (do not worry if the hash is different):

$ cat idea

045e3db14740c60684d745e5fb891ae71e335611

Now let us replicate this file:

$ cp idea idea-2
$ cp idea idea-3
$ cp idea idea-4
$ cp idea idea-5

Let us go up two levels and inspect the file HEAD:

$ cd ../..
$ cat HEAD

ref: refs/heads/idea

Let us open this file and change it to:

ref: refs/heads/idea-3

Now we are ready for the aha moment! First let us go back to the working area:

$ cd ..

Now - on which branch are we?

$ git branch

  idea
  idea-2
* idea-3
  idea-4
  idea-5
  main

Discussion

Discuss the findings with other course participants.