Managing Huge Repositories with Git
Linus Torvalds created Git in the mid 2000s to solve a problem that other open source version control systems at that time could not—to be distributed, reliable and fast.
As he mentions in this Google tech talk on Git, Git was created out of necessity for the Linux project. At the time the talk was given, Git was very young and people were getting used to it. It seemed to solve all the problems faced by version control software, and this contributed to its meteoric rise.
Fast forward a few years, and people started noticing the first real flaw in Git: it was difficult to handle very large repositories. How large are we talking here? It's Facebook large. Facebook's team projected that in a few years, a simple
git status would take up to half a minute to show the result, as Facebook adds a large number of commits from thousands of developers every day. Facebook shifted its whole codebase to Mercurial, and its team actively started contributing to Mercurial to meet Facebook's needs.
Where did Git fail? A Mercurial contributor, Prasoon Shukla, on the comparison of scaling in Git and Mercurial, says that this can be attributed to the way Mercurial and Git store commits. Mercurial manages a couple of objects (or files) for each file in your repository, whereas Git creates an object for each commit. Therefore, on increasing the number of commits, the number of objects in Mercurial remains constant, in contrast to a linear increase in Git. Therefore, when you run a simple
git status command, Git has to sift through all these objects, which takes a considerable amount of time (in spite of the high efficiency of Git).
Another area where Git might fall short is managing large binary files in the repository. Because Git tracks the changes in files, it’s not able to interpret the change in the content of binary files. And the size of the repository increases with every commit, because Git has to store the exact binary, rather than the change from the last version.
Over the years, developers of Git have tried to solve these problems. Each third-party service has come up with solutions to enable Git to manage larger repositories—such as GitHub's Large File Storage extension.
This post looks at techniques that can be used to handle large repositories in Git—in terms of large histories, as well as the presence of large binary files, or both.
Projects with a Large Number of Commits
I'll firstly look at a few ways to manage repositories with large histories more efficiently.
Shallow clone of repositories
As mentioned earlier, the primary reason why projects with large histories slow down is the huge number of commits. In a distributed system like Git, when you clone a repository, its full project history gets downloaded. However, Git provides a way to specify the number of commits you want to have in your clone of a project. This is known as a shallow clone. When you get the number of commits down, your Git operations run faster.
To perform shallow cloning, you need to add the
--depth option, with the number of commits we want, to the
git clone --depth [number_of_commits] [url_of_remote]
In earlier versions of Git, there was limited support for shallow clones. If your truncated history didn't stretch long enough, you weren't allowed to push or pull. However, with the release of Git 1.9.0, support for shallow clones was increased significantly.
Clone a single branch
When you clone a repository, all the branches in the remote get downloaded. (If you run
git branch in a newly cloned repository, it shows only the
master branch. You should run
git branch -a to list all the branches that were a part of the remote.) It's probable that many of the commits present in other branches are irrelevant to one developer's work. Therefore, you can clone just the
master or the branch relevant to your development. Doing so significantly reduces the number of commits that make up the history of the cloned version, especially if branches in the repository have divergent histories.
To clone only a single branch of a remote, you can run the following command:
git clone [url_of_remote] --branch [branch_name] --single-branch
This command instructs Git to clone only the
branch_name branch from the remote.
Projects With Large Files
The next problem that arises is the presence of large binary files (which are not traditional text files). Changes in binary files are not tracked by Git, which is why any change in a binary file is stored as the binary file itself. If binary files are large (like 3D models, or graphic designs), the size of the repository increases considerably with every changing commit.
One way out of the problem of large files is to use submodules, which enable you to manage one Git repository within another. You can create a submodule, which contains all your binary files, keeping the rest of the code separately in the parent repository, and update the submodule only when necessary. This logically separates the core part of your project from the large files, and helps in managing them separately.
Using a third-party extension
There are many extensions for Git, built by other developers, to handle large files. One option is to use git-annex, which allows you to manage files in Git without checking the contents of the file into Git. Another extension (the development of which has now been stopped) is git-bigfiles.
GitHub recently launched Git Large File Storage, an open source extension for Git, to manage large binary files in Git. LFS stores these large files in a remote server like GitHub, whereas only text pointers are stored in your Git repository. SitePoint recently published a tutorial on how to get started with Git LFS.
In a short period of time, Git LFS has gained popularity, signaling that it provides a good way of handing such large binaries in Git.
It has been a long time since Facebook announced its move to Mercurial (although it still continues to use Git for side projects like ReactJS). It's good to see Git developers and third-party developers have both reacted positively to it and come up with innovative solutions to the problems at hand. If you're thinking about learning version control, I'd recommend you go for Git—as the future is definitely bright!
If you'd like to learn more about Git and its amazing powers, check out Shaumik's new book Jump Start Git, published right here at SitePoint!
- Understand Git’s core philosophy.
- Get started with Git: install it, learn the basic commands, and set up your first project.
- Work with Git as part of a collaborative team.
- Use Git’s debugging tools for maximum debug efficiency.
- Take control with Git’s advanced features: reflog, rebase, stash, and more.
- Use Git with cloud-based Git repository host services like Github and Bitbucket.
- See how Git’s used effectively on large open-source projects.