Git LFS Architecture Note#
A description of the architecture of the Git LFS service implementation.
Astronomical software code repositories often contain data for test or demo purposes. These large binary files don’t tend to benefit from version control beyond provenance and update tracking, and their history expands repository size with no return. Historically there have been a number of services to deal with this, such as git-annex and git-fat, but in terms of workflow, storing files in those back ends was too much of a departure from what seems like “normal” workflow for users. Lacking a satisfactory option, the core package afwdata was left on the in-house gitolite server after the rest of the codebase migrated to GitHub.
In 2015, GitHub released a protocol and an open source reference implementation of Git LFS, a specification for dealing with large binary files in git. Aside from some upfront setup pain, the workflow was very close to “normal” GitHub flow. GitHub also released a paid hosted service for those files. Given the demand for storing data in our repositories, the cost would be non-trivial. More, we did not wish to get in a position where developers are self-censoring over what to store. Finally, the largest a Git LFS repository can be, if hosted at GitHub, is 5GB, no matter how much we’d be willing to pay.
Following a successful RFC, we decided to proceed with a Git LFS service. This would give users the advantages of working predominantly with the GitHub services, while allowing us to offer unmetered storage at the back end.
In 2023, we were faced with the facts that we did not want to store our data at AWS anymore, and that the software we used for Git LFS was dangerously ancient and unmaintained. After some consideration, we chose a different product to provide Git LFS service, and proposed it in RFC-966. This document now describes the current state of Git LFS at the Rubin Observatory.
After installing the
git lfs client,
a user can commit additions and modifications to LFS-backed data using
git commands. The repo’s
what files are tracked by Git LFS.
Our Git LFS server is hosted by Roundtable and is protected by Gafaelfawr. Before the user can commit Git
LFS-tracked data, they must first request a token with the
write:git-lfs scope. Having done this, they must update their
~/.git-credentials file (or other Git credentials manager) with that
token, and update the Git LFS URL to
org is usually
repo is the repository name, e.g.
This process is described in the Developer Guide.
When the user pushes commits with LFS-tracked data, they are in fact generating two requests. The first one goes to the Git server containing the Git LFS pointer for the file. It looks something like this:
The second request is made by the
git lfs client (due to the
smudge and clean filters) and uses the
.lfsconfig to locate
the Git LFS server it should be addressing. In our case that will be
git-lfs-rw.lsst.cloud. The Git LFS server checks whether the requested
blob exists in the backing store, and hands the client a URL that it
can use to retrive or push it. The client then uses the URL to fetch or push to our object store.
It is recommended that users use the current stable
3.4.0 at the time of writing).
These are the repos involved in this deployment: