Exploring Git

I think Git may reasonably be my favorite piece of software ever written. I love the C# Programming Language, and I’m a huge fan of TypeScript. Visual Studio is pretty cool, and so is Azure, but at the center of nearly all the development work I do (including this blog), I use Git. It’s simply amazing. I’ve spent a ton of time deconstructing my own commit history, ranging from the most simple, linear commit graphs, to graphs including several merges (particularly, multiple merges on experimental development branches). This post explores some of the features that I feel are most pertinent to understanding and using Git.

I’m preparing to give a presentation on Git at work in the near future, and felt that blogging about Git would help me get into the proper headspace for the eventual presentation (which is in about a week). Part of the reason why I write this blog is to explore ideas and research gaps in my own knowledge, and I hope that both you and I are mutual benefactors of this effort.

I’ll just get this out of the way up front. My favorite features in Git are:

Fantastic command-line interface
Porcelain and plumbing commands
The fact that Git is distributed. I’ll never see centralized systems the same after using Git
The brilliant, light-weight branching model
Tags. They’re amazing

Some people may not appreciate it, but I tend to really enjoy the command-line interface in Git. Perhaps it’s because I accepted that a GUI would just never offer the same power as the CLI, or perhaps it’s because I had to dive into the CLI early on with Git, when I made my first Open-Source contribution. Whatever the reason, the Git command-line just makes me happy. I think it’s very well thought-out, and personally find it to be a very productive experience.

The next item on the list is porcelain and plumbing commands. These are actually very interesting, and expose the true power of Git to any user. Porcelain commands are those “top-level” commands, like add and commit, that you normally use on the Git CLI, but they’re really just sugar on top of a bunch of plumbing commands (low-level Git sub-commands that do all the work). The hallmark of this whole approach is that it’s very light-weight and Unix-like (which is not surprising, given Git’s origins). It’s trivial to do things like obtain a revision (commit) list for a branch.

If you’re not familiar with what it means for Git to be distributed, allow me to explain. Many version control systems employ a single, central database, that all work gets checked into. This database is the “master,” and any operation involving the history of a project must be queried through the central system. This means that the database API must be either made available over the public internet, or that you need a VPN to access the system. Since I initially adopted Git, I’ve found centralized version control systems to be terribly unproductive and highly frustrating. Git supports a fully-offline workflow: the only time you need to connect to any other instance is when performing synchronization (such as getting the latest patches). Every Git repository is simply a node on a network, much the same way as your computer or your phone is a node in a larger network.

Git being distributed doesn’t mean that there aren’t shared Git servers: several platforms, such as Github, Azure DevOps, and Gitlab, provide centralized hosting systems. These are fanstatic for enabling collaboration, but the power of Git being distributed is that these centralized servers are not required for basic operations, such as viewing the commit history. A significant detail about this is that you’re only ever guaranteed to view a snapshot of the history: unless your repository instance is somehow designated as being authoritative (which usually occurs through community consensus), then your history is just a snapshot. This is a great feature, though; because no single instance is considered to be authoritative, it means any instance can be authoritative. If the agreed upon Git server somehow gets corrupted, you just need to re-push a clone of the repository. You can manage Git repository backups by simply synchronizing a Git mirror on a nightly basis, and your backup is done: no third-party software, no additional licensing required. How cool is that?!

Before I get much further into this post, I would like to establish what degree of experience I expect a lot of readers will have. I’ll assume that you have a basic understanding, and possibly some experience, using some or all of the following Git features:

Cloning repositories. This is a common first step for many Git new-comers
Pushing to and pulling from Git remotes (servers)
Making and committing changes to a Git repository
Potentially, creating Git branches, and possibly merging them with other branches

If you haven’t used any of these, but have used other version control systems, then I believe you’ll still be able to follow along with most of what I have to say here. With that disclaimer out of the way, let’s dig in!

An Anatomical View of Commits in Git

There are three primary objects that Git uses for tracking your content:

Blobs (Binary Large Objects)
Trees
Commits

A blob is very simply the raw, binary content of a file. This can be plain text, compiled code, or whatever: it’s just the content of a file. An important feature of Git that it stores blob objects one and only one time. It’s rather simple to demonstrate this, using the following Git commands:

# Create a scratch dir, and change directories to it
mkdir ./example
cd ./example

# Convert this directory into a Git repository
git init

# Create dummy files
echo Hello, World! > file1.txt
echo Hello, World! > file2.txt
echo Hello, World! > file3.txt
echo Hello, World! > file4.txt
echo Hello, World! > file5.txt

# Stage and commit the files
git add file*.txt
git commit -m "Add example files"

# List the files with their content
git show --raw

# Move back up to the original directory
cd ..

# For Windows users
rmdir /S /Q ./example

# For Bash users
rm -rf ./example

The result of running git show --raw should be similar to this:

commit 58eb8d5000f0767549e200098407c2c939f545d1
Author: John Doe <john@example.com>
Date:   Wed May 8 22:44:05 2019 -0400

    Add example files

:000000 100644 0000000... 8ab686e... A  file1.txt
:000000 100644 0000000... 8ab686e... A  file2.txt
:000000 100644 0000000... 8ab686e... A  file3.txt
:000000 100644 0000000... 8ab686e... A  file4.txt
:000000 100644 0000000... 8ab686e... A  file5.txt

The part of this command’s output that says 0000000... 8ab686e... A tells us a few things:

The SHA1 hash of the original object. In this case, there is no original object (hence the 0000000... part)
The SHA1 hash of the new object. This is the 8ab686e... part, which is an abbreviated hash
The A part tells us that an object was added (or created, or some other synonym)

This demonstrates that Git stores discrete content one and only one time. Each of these files contains the exact same content, given from the “Hello, World!” listings in the example above. It stores this content in an internal objects directory, where the first two characters of the commit (8a in this case) are the directory name, and the remainder of the object hash is the file name. Git doesn’t attempt to store blobs multiple times, because it wouldn’t be terribly economical.

One note about this example: if you’re using Windows, and specifically Windows PowerShell, you may get a different commit hash. This has to do with how things like new line characters and work encodings work across different environments.

Blobs aren’t the only objects that get stored one-and-only-one time. Every single object that Git creates gets hashed, and the hash is used as a unique identifier. If Git finds that an object already exists, it simply re-uses it.

Git doesn’t associate blobs directly with files. Instead, it creates trees. A tree is a Git representation of a directory structure. Trees are how Git links blob instances with relative paths (for a repository), as well as mapping sub-trees. Here’s what a sample tree object might look like:

100644 blob 5edff5abac3f31cd8ff26045151c7278e9503e5e    .editorconfig
100644 blob 45c150536e5f3888554c294f27539c5d41072467    .gitignore
040000 tree 9ead4dcd336627a532665b0f91504ff505607982    src

The first part is a set of flags. These are used when Git re-constructs the tree on the file system. The next part tells you whether the item is a tree or a blob. As noted, blobs are stored one-and-only-one time in Git, and trees are how they get associated to being files on the filesystem. Next, we see another SHA1 hash of the object. Finally, we have the name of the object. As you can see above, elements of a tree are listed in file system-sorted order (.editorconfig, .gitignore, src). When a tree is the child of another tree, we’d refer to it as a sub-tree. This intrinsically means that trees in Git are recursive, just like they are on your filesystem.

This is all integrated through the commit object, which has a few attributes:

A single tree hash, which refers to the root of the repository
One or more parent commits
Author info (the person that created the commit)
Committer info (the person that committed the commit)
A commit message

Commit messages are like e-mails: they have a subject line, followed by a message body. Usually, I’ll type the message body as either a bulleted list, or in paragraph form. I also use markdown syntax, because several tools will at some point display commit messages the same way. Here’s a sample commit:

tree a34f789016317ef654ba2839f33bd3b8cbb8352c
parent 541fed384c76d0f94db213b230300cddba8b1e89
author John Doe <john@example.com> 1557289111 -0400
committer John Doe <john@example.com> 1557289111 -0400

Example commit: this is the "subject" line

Message body. I'm using a paragraph style here, but for many changes, will
employ a bulleted-list, similar to:

 - One space before the "bullet" token
 - A token, such as `-` or `*`
 - One space after the token
 - The text, using a psuedo-sentence style (without ending punctuation)

Here, you see the tree <hash> object, as well as the hash for the tree. Next, there’s that parent <hash> thing. That’s the parent commit. For merge commits, you’ll see at least two parents, but Git can easily merge more than just two commits. The author info tells us who created the commit, their contact details, and a Unix timestamp for when the commit was made. The commit info tells us who actually applied the commit to the commit graph (committed, merged - which is still committing - rebased, etc.). I formatted this message body to explain the remainder, and how I generally structure my commits.

And that’s pretty much all you “need” to know about commits. Git commits are beautifully straight forward! Next, we’ll review branches and tags.

Git Branches

I mentioned that one of my favorite things about Git is the branching model. In many version control systems, a branch is literally a cloned folder from some point in the history. This means you end up with additional folder paths on your hard disk, and unfortunately, the common systems I’ve worked with (SVN, TFVC) store the branch as a second directory on the server. If you only have two branches, then this isn’t terrible (though it’s still not great), but when you scale up to having many branches, you end up with lots of duplicated folders. Even worse, a lot of the branches are “sticky” in your history, and never really go away.

Which leads me to why I love Git branches. A Git branch consists of a very simple object: it’s just a file in Git’s private repository directory, under .git/refs/heads. Most repositories will have a branch called master. If you navigate to .git/refs/heads/ from your repository root, you’ll see a file called master. When I examine this file in the repository I’m currently working in, it looks like this:

3afad76c02773ddf753a13e05821ad0537560c3a

There’s a little plumbing command called cat-file we can use to query what this hash refers to:

> git cat-file -t 3afad76c02773ddf753a13e05821ad0537560c3a
commit

That commit is the same commit I referred to earlier in this post. The content is:

> git cat-file -p 3afad76c02773ddf753a13e05821ad0537560c3a
tree a34f789016317ef654ba2839f33bd3b8cbb8352c
parent 541fed384c76d0f94db213b230300cddba8b1e89
author John Doe <john@example.com> 1557289111 -0400
committer John Doe <john@example.com> 1557289111 -0400

Example commit: this is the "subject" line

Message body. I'm using a paragraph style here, but for many changes, will employ
a bulleted-list, similar to:

 - One space before the "bullet" token
 - A token, such as `-` or `*`
 - One space after the token
 - The text, using a psuedo-sentence style (usually, without ending punctuation)

In Git, you navigate between branches using the git checkout command. When you perform a checkout, Git performs essentially these steps:

Open the branch file and obtain the commit hash
Locate the commit in the history, and open it’s tree object
Recursively delete any files or directories not listed in the tree
Recursively re-instantiate all other files and directories in the tree

So what am I getting at here about branches? Well, let’s recap:

A branch is just a file stored in Git’s private directory
A branch file just records the SHA1 hash of a commit
When you navigate branches (using the checkout command), Git reconstructs that commit recursively down the tree

The consequence of these facts is that a Git branch is nothing more than a commit pointer. There are no additional directories, such as with other version control systems. Branches are “non-persistent” - they don’t stick around in history after they’ve been created and destroyed. And this is why I love branches in Git. “Destroying” a branch just means that you’ve deleted it (using the git branch command with either the -d or -D option). They’re not purged from history completely, though: there will always be a record of a branch in the Git history (assuming you’re not re-writing commits), but only if you choose to publish it by merging the commit (for this post, we’ll forget about things like publishing via the git push command).

Git Tags

We’ve covered a whole lot of ground here. We’ve discussed the most basic elements in Git, including blobs, trees, and commits. We’ve reviewed how commits are linked together to form a commit graph. And finally, we’ve talked about branches, and why I personally feel Git’s branching model is superior to other systems I’ve used. The last concept I want to review is tags, because I feel they complete the story in terms of Git repository concepts. If you’re coming from SVN, these should be familiar. If you’re coming from TFVC, the analogue in TFVC is a “label.”

So what are tags, and why do I feel they’re important? Well, a tag is just an alias for a commit … or, more succinctly, a commit pointer. That’s all. Nothing crazy, and nothing special.

But you just said that a branch is just a commit pointer. And now you’re telling me that a tag is nothing more than a commit pointer too? What gives?! If branches and tags are both “just commit pointers,” then why the distinction, and why should I choose one over the other?

Perhaps you’re not having this insane monologue with yourself. If not, I apologize. In any case, I’m going to attempt to answer all these questions as succinctly as I know how.

Tags are just named commits, or commit-pointers, and branches are just commit pointers. In each case, they can point at any commit in the repository’s history. So how are they different from each other? It’s actually quite simple: you can write new commits to Git branches, but you can’t write new commits to a Git tag. Here’s a short list of things you might do with tags:

Create tags using the git tag <name> [commit-id] operation
Overwrite an existing tag (using the -f option)
Delete a tag (using the -d option)

You can also conditionally add a description to a tag using the -a option.

There is one interesting point about tags: if you run the cat-file sub-command with the -t option, and a tag name, Git will report that an object is a tag, but only if you’ve annotated the tag. Otherwise, cat-file will just resolve the commit ID. An annotated tag includes some additional attributes that are not found on “normal” tags:

object 541fed384c76d0f94db213b230300cddba8b1e89
type commit
tag example-tag
tagger John Doe <john@example.com> 1557293448 -0400

This is an example tag, which is annotated.

The first line tells us that we’re pointing at an object with (abbreviated) ID 541fed. The next line tells us that object is a commit. The rest is pretty self-explanatory.

So when should you use a tag, and when should you use a commit? My general guidance is that you use branches when you’re continuously incrementing the history, such as the case with having a shared develop branch and a release-quality master branch, or when you have two parallel branches that are equivalent to dual master branches. An example of why you might choose to do this is you’re maintaining a library or a framework of some kind, you’re planning to make a breaking change, and you expect to patch both the version following the change, and the previous major version with subsequent minor patches (I’ve had to do this before).

Unlike branches, tags are more retrospective in nature. We use tags to identify events that have occurred in a repository’s history. One common use of tags is the identification of releases to a project in it’s history. You can use them for other things too, but I’m having a hard time recalling other use-cases for tagging. In any case, if you just want to name an event in your repository’s history, that’s what you use tags for. They’re more useful when accompanied by annotations, but naming any special event in history is better than leaving it to guesswork.

The final piece I’ll say about tags is this: just because you don’t have a branch, doesn’t mean that you can’t continue incrementing the history for a tag. In fact, it’s fairly trivial, but beyond the scope of things I’d like to cover in this post.

Wrapping Up

So I’ve talked about a whole bunch of stuff on here. I hope this broadens your view of Git. I didn’t bother trying to talk about everything there is to know to get started with Git, and mostly assume that you have some experience using Git. If I’m mistaken in that assumption, then I’d recommend you go create an account with a service that will host your Git repositories for you. I personally like and recommend recommend Github and Azure DevOps, but have heard great things about Gitlab. Once you’ve got an account, I’d recommend learning how to use all the following:

git clone
git pull
git add
git commit
git push

Once you feel you’ve got the hang of using Git, this post may be worth re-reading. There are also great videos on YouTube and other places that go into much greater depth than I have here. Finally, I can’t recommend the Pro Git book enough, which you can read for free on the Git website.

I hope this post has helped improve your understanding of how Git works. Now, get out there and start committing!

- Brian