Web
Article

Managing Huge Repositories with Git

By Shaumik Daityari

huge git cat Linus Torvalds created Git in the mid 2000s to solve a problem that other open source version control systems at that time could not—to be distributed, reliable and fast.

As he mentions in this Google tech talk on Git, Git was created out of necessity for the Linux project. At the time the talk was given, Git was very young and people were getting used to it. It seemed to solve all the problems faced by version control software, and this contributed to its meteoric rise.

Git's Shortcomings

Fast forward a few years, and people started noticing the first real flaw in Git: it was difficult to handle very large repositories. How large are we talking here? It's Facebook large. Facebook's team projected that in a few years, a simple git status would take up to half a minute to show the result, as Facebook adds a large number of commits from thousands of developers every day. Facebook shifted its whole codebase to Mercurial, and its team actively started contributing to Mercurial to meet Facebook's needs.

Where did Git fail? A Mercurial contributor, Prasoon Shukla, on the comparison of scaling in Git and Mercurial, says that this can be attributed to the way Mercurial and Git store commits. Mercurial manages a couple of objects (or files) for each file in your repository, whereas Git creates an object for each commit. Therefore, on increasing the number of commits, the number of objects in Mercurial remains constant, in contrast to a linear increase in Git. Therefore, when you run a simple git status command, Git has to sift through all these objects, which takes a considerable amount of time (in spite of the high efficiency of Git).

Another area where Git might fall short is managing large binary files in the repository. Because Git tracks the changes in files, it’s not able to interpret the change in the content of binary files. And the size of the repository increases with every commit, because Git has to store the exact binary, rather than the change from the last version.

Over the years, developers of Git have tried to solve these problems. Each third-party service has come up with solutions to enable Git to manage larger repositories—such as GitHub's Large File Storage extension.

This post looks at techniques that can be used to handle large repositories in Git—in terms of large histories, as well as the presence of large binary files, or both.

Projects with a Large Number of Commits

I'll firstly look at a few ways to manage repositories with large histories more efficiently.

Shallow clone of repositories

As mentioned earlier, the primary reason why projects with large histories slow down is the huge number of commits. In a distributed system like Git, when you clone a repository, its full project history gets downloaded. However, Git provides a way to specify the number of commits you want to have in your clone of a project. This is known as a shallow clone. When you get the number of commits down, your Git operations run faster.

To perform shallow cloning, you need to add the --depth option, with the number of commits we want, to the clone command:

git clone --depth [number_of_commits] [url_of_remote]

In earlier versions of Git, there was limited support for shallow clones. If your truncated history didn't stretch long enough, you weren't allowed to push or pull. However, with the release of Git 1.9.0, support for shallow clones was increased significantly.

Clone a single branch

When you clone a repository, all the branches in the remote get downloaded. (If you run git branch in a newly cloned repository, it shows only the master branch. You should run git branch -a to list all the branches that were a part of the remote.) It's probable that many of the commits present in other branches are irrelevant to one developer's work. Therefore, you can clone just the master or the branch relevant to your development. Doing so significantly reduces the number of commits that make up the history of the cloned version, especially if branches in the repository have divergent histories.

To clone only a single branch of a remote, you can run the following command:

git clone [url_of_remote] --branch [branch_name] --single-branch

This command instructs Git to clone only the branch_name branch from the remote.

Projects With Large Files

happy git cat The next problem that arises is the presence of large binary files (which are not traditional text files). Changes in binary files are not tracked by Git, which is why any change in a binary file is stored as the binary file itself. If binary files are large (like 3D models, or graphic designs), the size of the repository increases considerably with every changing commit.

Using submodules

One way out of the problem of large files is to use submodules, which enable you to manage one Git repository within another. You can create a submodule, which contains all your binary files, keeping the rest of the code separately in the parent repository, and update the submodule only when necessary. This logically separates the core part of your project from the large files, and helps in managing them separately.

Using a third-party extension

There are many extensions for Git, built by other developers, to handle large files. One option is to use git-annex, which allows you to manage files in Git without checking the contents of the file into Git. Another extension (the development of which has now been stopped) is git-bigfiles.

GitHub recently launched Git Large File Storage, an open source extension for Git, to manage large binary files in Git. LFS stores these large files in a remote server like GitHub, whereas only text pointers are stored in your Git repository. SitePoint recently published a tutorial on how to get started with Git LFS.

In a short period of time, Git LFS has gained popularity, signaling that it provides a good way of handing such large binaries in Git.

Final Thoughts

It has been a long time since Facebook announced its move to Mercurial (although it still continues to use Git for side projects like ReactJS). It's good to see Git developers and third-party developers have both reacted positively to it and come up with innovative solutions to the problems at hand. If you're thinking about learning version control, I'd recommend you go for Git—as the future is definitely bright!


If you'd like to learn more about Git and its amazing powers, check out Shaumik's new book Jump Start Git, published right here at SitePoint!

Jump Start Git cover

  • Understand Git’s core philosophy.
  • Get started with Git: install it, learn the basic commands, and set up your first project.
  • Work with Git as part of a collaborative team.
  • Use Git’s debugging tools for maximum debug efficiency.
  • Take control with Git’s advanced features: reflog, rebase, stash, and more.
  • Use Git with cloud-based Git repository host services like Github and Bitbucket.
  • See how Git’s used effectively on large open-source projects.
  • http://careersreport.com velvet.cain@aol.com

    I will show excellent internet job opportunity… three-five hrs of work /a day… Payment at the end of each week… Bonuses…Payment of 6-9 thousand dollars /a month… Merely several hrs of spare time, desktop or laptop, most basic knowing of web and dependable internet connection is what is needed…Get more information by visiting ~my page

  • http://careersreport.com Denise Landry

    I want to show great work opportunity… three to five hours of work daily… Weekly paycheck… Bonus *opportunities…Payscale of $6k to $9k /a month… Just few hours of your free time, any kind of computer, elementary understanding of web and stable connection is what is required…Get informed more about it by visiting my profile>page

  • http://careersreport.com Rachel Delacruz

    Follow success of many who are making! profit monthly by doing an online job… Learn more on my~profile

  • btistaa

    how to manage a large number of repos?

  • http://www.hexedit.com Andrew Phillips

    One way to better handle binary files is to store them as text instead if the tool that creates them supports it, Thence Git can delta the text.

    For example, Delphi resource files are binary by default but there is an option to store them as text. Another example is to save Word documents as .DOCX (xml) files rather than .DOC (binary).

Recommended

Learn Coding Online
Learn Web Development

Start learning web development and design for free with SitePoint Premium!

Instant Website Review

Use Woorank to analyze and optimize your website to improve your website to improve your ranking!

Run a review to see how your site can improve across 70+ metrics!

Get the latest in Front-end, once a week, for free.