Speeding Up Git with HTTP Caching

Written by Alve Elde | 8/26/25 11:15 PM

In this blog post, we will take a look at how Git clients communicate with Git repositories and how we can employ caching to drastically increase the speed of Git operations.

Drastic as in this:

  
      time git clone --depth 1 -q https://github.com/torvalds/linux.git

real    1m28,628s
user    0m31,416s
sys     0m6,501s

Versus this:

  
      time git clone --depth 1 -q http://localhost:6081/torvalds/linux.git

real    0m13,723s
user    0m13,375s
sys     0m2,114s

That's right, in the right circumstances, caching Git can net you a 6.5x speedup!

I get that you as a developer don't clone the Linux kernel every day, but for the many workers in a CI/CD pipeline somewhere, reducing the time it takes to retrieve source code from Git repositories will complete the work faster. That really adds up over time.

Our goals are as follows:

Reduce CI/CD pipeline runtimes by speeding up Git clone
Reduce developer friction by speeding up Git fetch and Git pull
Safely cache both public and private Git repositories

And to do that without breaking everything, we have a couple constraints:

Access control for private repositories must always be maintained
The Git cache must stay up-to-date with the upstream repositories

With the "what" and "why" out of the way, let's get into the "how".

A Brief Overview of Git Protocols

All Git clients communicate with repositories using the Git wire protocol. This protocol can run over 4 distinct transports: HTTP, SSH, Local, and Git. In practice, you are going to use either HTTP or SSH when communicating with a remote repository, and for reasons that will become apparent later, we will focus on Git over HTTP.

The Git HTTP transport comes in two flavors: the dumb protocol and the smart protocol. Can you tell that Git has a bit of a personality? The dumb protocol is more or less obsolete these days, so we are going to pretend it doesn't exist.

But we are not done yet; there are two distinct versions of the smart protocol: v1 (sometimes called v0) and v2. Protocol v2 is relatively new, with GitHub adding support late 2018 and not that much different from v1. Since v1 is more straightforward, I’m going to stick with that for this blog post.

Dissecting a Git Clone

Let's start off with a practical example, here is a bog-standard Git clone:

  
      alve@hyperion:~$ git clone https://github.com/varnishcache/varnish-cache.git
Cloning into 'varnish-cache'...
remote: Enumerating objects: 176943, done.
remote: Counting objects: 100% (1045/1045), done.
remote: Compressing objects: 100% (372/372), done.
remote: Total 176943 (delta 838), reused 699 (delta 673), pack-reused 175898 (from 3)
Receiving objects: 100% (176943/176943), 36.60 MiB | 3.12 MiB/s, done.
Resolving deltas: 100% (142147/142147), done.

When the clone is complete, we are left with a new and shiny clone of the Varnish Cache source code. Before we can start caching these operations, we first need to understand how the client and the server communicate.

By placing Varnish as a proxy between myself and GitHub, we can run the clone again, but this time direct it at the Varnish instance running on my local machine:

  
      alve@hyperion:~$ git clone http://localhost:6084/varnishcache/varnish-cache.git
- - ✂ - -

Note: I'm not using git's http.proxy configuration option because the Git client would establish a HTTP CONNECT tunnel to GitHub, which would prevent us from observing the individual HTTP calls.

Using varnishncsa to inspect the request URLs, we can see the that there are actually only two HTTP requests made by the client:

  
      alve@hyperion:~$ varnishncsa -F "%r"
GET http://localhost:6084/varnishcache/varnish-cache.git/info/refs?service=git-upload-pack HTTP/1.1
POST http://localhost:6084/varnishcache/varnish-cache.git/git-upload-pack HTTP/1.1

The first request GET /info/refs?service=git-upload-pack is our client asking the repository to list all its references, or in other words, “give me the tips of your branches and tags”. And the response from the repository is a long list of commit SHAs paired with Git references, like the following:

  
      - - ✂ - -
003c7d20e995c3c98644eb1c58a136628b12e9f00a78 refs/heads/1.0
003c93e944c9f728a4b9da506e622592e4e3688a805c refs/heads/1.1
003cef2cbad5843a607236b45e5f50fa4318e0580e04 refs/heads/1.2
003c394ef395e023d42bc46c589a90d56e14f6133a41 refs/heads/2.0
- - ✂ - -

Note: If you are wondering about the first 4 characters of each line being the same, that's just the pkt-line format telling us how long each line is.

Now our Git client knows what the current state of the repository is, and has to decide which references it wants. Since we ran a Git clone, our client decides it wants all of it. So it sends a POST /git-upload-pack with a request body describing all the things it wants:

  
      - - ✂ - -
0032want 7d20e995c3c98644eb1c58a136628b12e9f00a78
0032want 93e944c9f728a4b9da506e622592e4e3688a805c
0032want ef2cbad5843a607236b45e5f50fa4318e0580e04
0032want 394ef395e023d42bc46c589a90d56e14f6133a41
- - ✂ - -

Note: The list is quite a bit longer, but it’s very repetitive, so this is a small excerpt.

The request body for the POST /git-upload-pack request contains (some of) the commit hashes that were received from the GET /info/refs?service=git-upload-pack request. While the syntax differs slightly, it’s easy to understand that through this request I’m selecting the branches and tags I need as part of my Git clone command.

When the repository receives the client wishlist, it will go ahead and wrap them all up into one big packfile and send it to the client. Our client decompresses the pack and resolves the deltas to build a Git repository on our local file system that we can actually use.

And that’s it! I’m glossing over all the things that happen on the client and server side before and after the transfer because it's not super relevant.

A Generation That Ignores History

There is a trick to speed up Git clones, even without any caching. Our Git clone example above cloned the entire repository, including its history. As the history gets longer, cloning the repository takes longer. While the history is important for developers working on the codebase, it is less relevant for transient workflows that just want a snapshot of the codebase. This is when you use the git clone --depth 1 trick that I used when cloning the linux kernel at the top.

When you set --depth 1, the client’s POST /git-upload-pack request body will include the line deepen 1, which tells the repository that we only want the history at a depth of …one!

This is a trick typically employed by CI/CD pipelines to reduce runtime, and I would highly recommend using it where you can. Not only does it decrease the amount of data you send, but it can significantly reduce the amount of CPU-heavy post processing you need to do in the final delta resolution stage.

Git Fetch is Just a Less Greedy Clone

To keep your local clone up-to-date with the remote repository, you typically run a Git fetch (or a Git pull, which is just shorthand for Git fetch + Git merge). My own VS Code editor does this periodically in the background, as does my ArgoCD deployment.

I learn best when I can actually see the gears turning, so we are going to observe the fetch operation the same way we did with the clone. And to do that without having to actually push something to the remote repository, we can engage in a bit of Git time traveling:

  
      git reset --hard <sha>
git update-ref refs/remotes/origin/<branch> <sha>
git reflog expire --expire=now --all
git gc --prune=now
git fetch

Replace <sha> with the commit SHA you want to travel back to, and <branch> with the branch you are currently on. Inspecting HTTP traffic between our client and server, we are met by a familiar couple of log lines:

  
      GET http://localhost:6084/varnishcache/varnish-cache.git/info/refs?service=git-upload-pack HTTP/1.1
POST http://localhost:6084/varnishcache/varnish-cache.git/git-upload-pack HTTP/1.1

The same two requests we saw for Git clone are being used to fetch the latest changes from the repository. If we inspect the request body though, the content is quite different:

  
      00a4want 51a117587524cbdd59e43567e6cbd5a76e6a39ff (capabilities omitted)
0000
0032have 8282cff4b31dce12e100d4d6c78d30b1f4689dd3
0032have be83e3dae8265fdc4c91f11d5778b20ceb4e2479
0032have 7d46abdf9c5a3f119f645c8de6d87efffe3889b8
0032have 66e26fc5b617b13f77f425a8e1674f43596b6f5f
- - ✂ - -

The first line is the client saying what it wants, and the subsequent lines signal which commits it already has. When the repository receives this message, it will find the commit the client wants and search back through the history until it finds a commit the client has. All the commits in-between are wrapped into a single packfile and sent to the client.

Unless your local repository is either tiny or very outdated, fetch operations will be much faster than the initial clone. But each fetch operation still consumes a bit of CPU to complete on the repository side, and if you are self-hosting a busy Git server, you may already have experienced load-related issues.

Git Caching

Okay, now that we understand how Git clone and Git fetch works, we can finally talk about how to implement caching at the HTTP level.

Let's get /info/refs?service=git-upload-pack out of the way first. Caching the response to this request for any significant amount of time would lead to issues. Pipelines could end up either failing or working on old code, and developers would not see the changes you just pushed. We can probably cache it for a handful of milliseconds to reap the benefits of request coalescing, but any more than that is likely to cause issues. Make it tunable and everyone’s happy :)

The real speedup is to be gained in caching the packfiles, as they can be several orders of magnitude larger and the responses are generally produced on-demand by the Git repository server (consuming significant CPU in the process). There are two approaches we could take here:

Cache the entire packfile as a single unit
Cache the individual components of the packfile and assemble packs on demand

The first approach lends itself to a caching proxy server like Varnish, because the same POST request should always lead to the same response from the Git repository. As long as we add the post request body to the cache key (in addition to the relevant headers), it's safe to cache packfiles as complete HTTP responses. This is the fastest and most resource efficient way to handle POST /git-upload-pack requests.

The downside of caching entire packfiles is that you can only serve them to similar clients requesting that exact set of Git references. While that may seem like a big deal, it’s not as uncommon as you may think. CI/CD pipelines are good examples of this, where many identical workers clone the repository at the same time. Developers fetching frequently, or using development environments like VS Code that runs Git fetch periodically, will also benefit from this type of caching.

Still, we are on a mission to reduce the end-to-end runtime of pipelines, and for that we are going to need some way to cache the individual components of a packfile. When the repository has changed and a new pipeline is started, we would ideally produce a single Git fetch from our Git cache that only retrieves the latest changes.

If you are thinking to yourself, "sounds like we need a Git mirror", then you would be correct. Git already has built-in functionality to mirror another Git repository, and we are taking advantage of that here. With a Git-mirror sidecar we have developed in-house, we are able to mirror Git repositories on demand and fetch only the latest commit pushed to the upstream branch.

By combining the two approaches, we get the best of both worlds. Caching entire packfiles is fast and incredibly scaleable, while maintaining local Git mirrors lets us avoid the cache miss penalties almost entirely.

Caching Private Repositories

Git does not have any authentication mechanisms built-in to the protocol itself, but that hasn’t stopped people from bolting it on top. If you create a private repository on GitHub, you will be presented with a wide variety of options to authenticate clients, including deploy keys, Personal Access Tokens (PAT), OAuth, and GitHub Apps. But at the Git transport level, they all boil down to two approaches: SSH keys or HTTP Authorization headers.

While SSH keys are often easier to deal with in smaller teams, larger organizations tend to prefer HTTP authorization because it makes it easier to control who can do what with which Git repositories. This is in fact one of the major reasons we are focusing on caching the HTTP Git traffic specifically.

Since we are placing a cache between the Git repository and the user, we make sure that the user has read access to that repository before we serve anything from the cache. While we could configure a separate access control system in the cache itself, it would be really annoying to have to keep the two systems in sync. No, we are going to use the Git repository itself as our “access control engine”, piggybacking off whatever access control has been configured there.

While we could do a bunch of complicated things to have our cache nodes talk to GitHub’s API to check the permissions of each user, there is a much simpler approach that is also platform agnostic. Remember that GET /info/refs?service=git-upload-pack? When our Git client sends this to a private repository, it gets a “401 Unauthorized” response status which prompts the client to re-attempt the request with its access credentials. If our proxy just forwards this next attempt to the Git repository and receives the response, we now know whether the client is authorized to read from the repository. And if we remember this for the subsequent POST /git-upload-pack, we can go ahead and serve straight from cache to that client. That is what we call a preflight authorization request, and it’s both lightweight and cacheable.

This is a simple, effective, and secure method of authorizing any Git HTTP client to any Git repository. And the best thing is, it’s completely automatic, no configuration needed.

Results

Benchmarking Git operations has turned out to be slightly tricky. It turns out that decompressing the packfile and resolving deltas on the client side can take a significant amount of time, especially when you cut down the data transfer time on cache HITs. I got wildly different results on my developer laptop compared to a tiny cloud VM because the VM was so CPU constrained.

As a reminder, here are the goals we set out to accomplish:

Reduce CI/CD pipeline runtimes by speeding up Git clone
Reduce developer friction by speeding up Git fetch and Git pull
Safely cache both public and private Git repositories

We want to speed up Git operations as a whole, but even if we make the data transfer from server to client instant, keep in mind that client-local processing will still take some amount of time.

When cloning varnish-cache to my local laptop from GitHub, I get pretty consistent results at about 12 seconds from start to finish:

  
    
    
    
    
      time git clone https://github.com/varnishcache/varnish-cache.git
Cloning into 'varnish-cache'...
remote: Enumerating objects: 177168, done.
remote: Counting objects: 100% (1086/1086), done.
remote: Compressing objects: 100% (347/347), done.
remote: Total 177168 (delta 898), reused 739 (delta 739), pack-reused 176082 (from 4)
Receiving objects: 100% (177168/177168), 36.76 MiB | 3.71 MiB/s, done.
Resolving deltas: 100% (142320/142320), done.

real    0m11,887s
user    0m6,693s
sys     0m1,243s
    

And when I do the same clone through a Git cache running on my laptop, it takes a little over 2 seconds:

  
    
    
    
    
      time git clone http://localhost:6081/varnishcache/varnish-cache.git
Cloning into 'varnish-cache'...
remote: Enumerating objects: 177168, done.
remote: Counting objects: 100% (1086/1086), done.
remote: Compressing objects: 100% (347/347), done.
remote: Total 177168 (delta 898), reused 739 (delta 739), pack-reused 176082 (from 4)
Receiving objects: 100% (177168/177168), 36.75 MiB | 55.76 MiB/s, done.
Resolving deltas: 100% (142324/142324), done.

real    0m2,211s
user    0m4,876s
sys     0m0,661s
    

Notice something weird? I am using the time tool to measure the time it takes for the clone command to finish, and the three last lines of output are from that tool. The first line “real” is “wall clock time”, as in the real world time it took from the moment I executed the command to the time it finished. But the second line “user” is interesting, as that is saying that the command took almost 5 seconds of user time. What’s going on here?

If we take a look in the time manual, we find the explanation:

  
       U      Total number of CPU-seconds that the process used directly (in user mode), in seconds.

I’m pretty sure this means that our clone operation is taking almost 5 seconds of CPU time, but at least some of that work can be done concurrently, leading to a lower real-time execution. However, this nicely illustrates how CPU heavy those Git operations can be.

So let’s try to use the –depth=1 trick to get rid of the history, as that should cut down on the CPU usage a great deal. First, we try clone directly from GitHub:

  
    
    
    
    
      time git clone --depth=1 https://github.com/varnishcache/varnish-cache.git
Cloning into 'varnish-cache'...
remote: Enumerating objects: 1770, done.
remote: Counting objects: 100% (1770/1770), done.
remote: Compressing objects: 100% (1714/1714), done.
remote: Total 1770 (delta 177), reused 568 (delta 49), pack-reused 0 (from 0)
Receiving objects: 100% (1770/1770), 2.48 MiB | 3.77 MiB/s, done.
Resolving deltas: 100% (177/177), done.

real    0m1,641s
user    0m0,261s
sys     0m0,084s
    

That’s a pretty significant difference! Wall clock time went down from 12 seconds to under 2 seconds, and CPU time was reduced from over 6 seconds to just 0.26 seconds. Here is what the same operation looks like with the Git cache in place:

  
    
    
    
    
      time git clone --depth=1 http://localhost:6081/varnishcache/varnish-cache.git
Cloning into 'varnish-cache'...
remote: Enumerating objects: 1770, done.
remote: Counting objects: 100% (1770/1770), done.
remote: Compressing objects: 100% (1713/1713), done.
remote: Total 1770 (delta 177), reused 567 (delta 50), pack-reused 0 (from 0)
Receiving objects: 100% (1770/1770), 2.48 MiB | 36.75 MiB/s, done.
Resolving deltas: 100% (177/177), done.

real    0m0,464s
user    0m0,127s
sys     0m0,051s
    

Total wall clock time is down to under 0.5 seconds and the CPU time has plummeted even further to 0.12 seconds.

I think the results speak for themselves, but I find it fascinating that the combination of –depth=1 and a Git cache can take a regular Git clone operation from almost 12 seconds to less than 0.5 seconds. Git gud!

Why Not Use a CDN?

I am of the opinion that all things that can be cached should be cached. Call it an occupational injury. The content delivery industry has figured this out a long time ago, and these days most web and streaming services use some form of Content Delivery Network (CDN), which are basically just globally distributed HTTP caches you put in front of your web server.

But once you enter the world of devops, caching often seems to be an afterthought. This is especially strange considering the cost of egress from many cloud platforms, not to mention the cost of having developers sitting around twiddling their thumbs. And I am not just talking about having to wait tens of minutes until my pipelines complete (or fail), the few seconds here and there spent waiting for Git repositories, docker images, software packages, data sets, etc. to download can really kill your flow as a developer.

While CDNs are a good way to distribute content to the masses, that strategy simply is not a good fit for the devops use case. Compared to normal web clients, developers and CI/CD pipelines are fewer in numbers, hyper-concentrated, and pull much larger volumes of data. In other words, we need to place the caches much closer to the users than even the largest CDNs can provide. Close as in the same building, same room, or even the same rack.

The good news is that we are beginning to see a shift towards hyper-localized caching for devops environments. Companies are looking for ways to both save money and increase productivity, and I think people are also kind of sick of having to wait for the same data to be downloaded from the cloud again and again. Especially when a platform goes down and everything grinds to a halt. It’s high time for caching to become a standard component of a modern devops tech stack.

I'm currently working on adding Git caching capabilities to our platform acceleration solutions. If you think this would be useful to your company and would like to take it for a spin, let us know or you can register for the webinar below to learn more 👇

View full post