Pragmatic Version Control Using Git

The following is an extract from the Pragmatic Bookshelf title Pragmatic Version Control Using Git by Travis Swicegood.

This extract is formatted in HTML, and so has a different layout to the book itself. To some extent this layout depends on how your browser is set up. Note that this extract may contain color—the printed book will be grayscale.

Visit the book's home page to purchase this title.

Extracted from Chapter 2Version Control the Git Way

Simply put, a version control system (VCS) is a methodology or tool that helps you keep track of changes you make to the files in your project. In its simplest, manual form, a VCS is just you creating a copy of the file you're working with and adding the date and time to the end of it.

Being pragmatic, we want something that will help automate that process. This is where VCS tools come in. They track all of the changes for us, keeping a copy of every change made to the code in our projects.

Distributed version control systems (DVCS) are no different. Their main goal is still to help us track changes we make to the projects we're working on. The difference between VCS and DVCS is how developers communicate their changes to each other.

In this chapter, we'll explore just what a VCS is and how DVCS---Git in particular---is different from the traditional, centralized model. You'll learn:

All of these ideas revolve around the repository, so let's start there.

The Repository

The repository is the place that the version control system keeps track of all of the changes you make. Most store the current state of the code, along with when the change was made, who made it, and a text log message that explains why they made the change.

Originally, these repositories were only accessible if you had access to the machine they were stored on. This model didn't scale so tools such as CVS and later Subversion were created to remove that bottleneck. They utilize a network connection to transmit changes to and from the repository.

Subversion and CVS follow a centralized repository model. The centralized model, which is shown in Figure 1, gives each developer their own copy of the current code in repository. They make changes to their copy and when those changes are ready to share they send them back to the central repository so it can create a new revision.

repo-centralized2.png

Figure 1. Centralized Repository Model

This model scales better than the old one, but only so far. It still requires that you maintain a network connection to the central repository in order to track the changes that you make.

What if you are somewhere that doesn't have a reliable Internet connection? If you hop on a plane, you have no chance of an Internet connection while you're in the air. And what about that funky little coffee shop down the street with no wifi?

This is where the distributed version control systems (DVCS) come into play. With these, each developer has their own copy of the repository. Making a change to your copy affects just your repository.

Git follows this DVCS model, but provides a number of ways to keep your repository in sync with everyone else on the team. You can treat it like a simple centralized repository like Subversion or CVS or you can create a complex distributed model based on meritocracy or any number of ways in between.

Figure 2 shows what a decentralized model might look like. Every developer just shares their changes directly through their own public repository.

repo-distributed2.png

Figure 2. Fully Decentralized Repository

Of course, keeping everyone in sync on even a small team of developers can be time consuming with this model. Often, one person will be charged with keeping track of all of the changes. Everyone else on the team just uses that repository to keep up-to-date on the latest features.

This model is considered decentralized because there is no one central repository that will have everything. By convention, a team can accept that the team lead has the latest version, but nothing is stopping Joe or Jane from sharing changes directly.

The second way is through a shared repository. Figure 3 shows how this would look on a small team. It is very similar to the standard centralized model we talked about earlier. The difference is that each developer still has their own repository to track their changes.

repo-shared-simple.png

Figure 3. Simple Shared Repository

The shared repository model scales well. Each team of developers can maintain their own public shared repository. The teams pull from other teams' repository as necessary.

Sitting above all of the teams is the authoritative repository, the repository within the company that has the latest release-ready code. Each team pushes their finished changes back to this main repository. The code from this repository is then used to build releases from. The diagram in Figure 4 shows how this would look.

repo-shared-complex.png

Figure 4. Multiple Shared Repositories

These are the most common ways a repository can be structured and provide you with a good jumping off point to structure your repositories. Be sure to check the in the Joe Asks note "What repository layout should I use?" for more information on how to choose which layout to use.

However you decide to structure your repositories, with Git each developer will have their own private repository to keep track of their changes. They only need to have access to their computer and they are ready to start making changes. Now that you know what a repository is and how they work in Git, let's look at how to decide what to store in them.

Joe asks:
Joe asks:
What repository layout should I use?

“Shared, decentralized, some hybrid of the two. All of these decisions are making my head spin. How do I know which repository layout is right for me?”

There's no simple answer. Each layout has its pluses and minuses. For people coming to Git from a traditional centralized repository such as Subversion or CVS, the shared model will be the most familiar and possibly the easiest for everyone to understand.

For developers who have never used version control or who are wanting to change completely, I recommend a hybrid of the fully decentralized and shared methods. In it, there is one main repository that has the latest release-ready code.

Every developer will have their own public repository to share their changes, but the team leads are responsible for pushing their team's finished changes back to the main repository.

This gives the team leads an extra opportunity to review the code that was created by their team before sharing it with everyone else. It makes sure there are multiple checks along the way so only the most ready code makes it into the main repository.

Since there is no wrong or right way to handle repositories in Git, though, experiment with couple of different methods. You might find that your team thrives in a fully decentralized environment, or a shared repository might be all you need.

What Should We Store?

The short answer: everything. The slightly longer answer is everything that you need to work on your project. Your repository should have a copy of everything you need to modify it, enhance it, and build new versions of your project.

The first and most obvious thing you should store in it is your project's source code. Without that, you can't fix bugs or implement new features.

Most projects have some sort of build files. A couple of common ones are Makefile or Ant build.xml files. These need to be stored so you can compile your source code into something usable.

Other common items to store in your repository are sample configuration files, documentation, images that are used in the application, and of course your unit tests.

Determining what to include is easy. Ask yourself, "if I didn't have X, could I do my work on this project?" If the answer is no, you couldn't, it should be included.

Like all good rules, there is an exception. It doesn't apply to tools that you should use. You should include the Ant build.xml file, but not the entire Ant program.

It's not a hard exception, though. Sometimes storing a copy of Ant or JUnit, or some other program in your repository can make sure that the entire team is using the same version of the tools you use. These should be stored separately from your project, however.

Working Trees

So far we've discussed the repository and talked about all of the files that you're storing in it, but we haven't talked about where you make all of your changes. This happens in your working tree.

Some VCS refer to this as your working copy, but in Git it's the working tree. Git treats your entire history as a “tree” of changes. The working tree is the current view of the tree that you're working with.

The content of the working tree is the files from your repository---the source code, build files, unit tests, and so on.

People coming to Git for the first time from another VCS often have trouble separating what the working tree and the repository is. In a version control system like Subversion, your repository exists “over there” on another server.

In Git, “over there” just means inside the .git directory inside your project's directory on your local computer. This means you can look at the history of the repository and see what's changed without having to communicate with the repository on another server.

So how do you get this working tree in the first place? Well, you can start your own project and then add Git's repository to it; or you can clone an existing repository.

Cloning makes a copy of another repository, then checks out a copy of its master branch---its main line of development. That check out becomes your working tree. We'll talk more about cloning repositories in Section Cloning a Remote Repository.

Of course, a VCS is all about tracking changes. So far, we've talked about repositories and your working tree---your current view of your repository---but we haven't talked about those changes yet. Now we'll cover that.