Git LFS Architecture Note#
Abstract
A description of the architecture of the Git LFS service implementation.
Motivation#
Astronomical software code repositories often contain data for test or demo purposes. These large binary files don’t tend to benefit from version control beyond provenance and update tracking, and their history expands repository size with no return. Historically there have been a number of services to deal with this, such as git-annex and git-fat, but in terms of workflow, storing files in those back ends was too much of a departure from what seems like “normal” workflow for users. Lacking a satisfactory option, the core package afwdata was left on the in-house gitolite server after the rest of the codebase migrated to GitHub.
In 2015, GitHub released a protocol and an open source reference implementation of Git LFS, a specification for dealing with large binary files in git. Aside from some upfront setup pain, the workflow was very close to “normal” GitHub flow. GitHub also released a paid hosted service for those files. Given the demand for storing data in our repositories, the cost would be non-trivial. More, we did not wish to get in a position where developers are self-censoring over what to store. Finally, the largest a Git LFS repository can be, if hosted at GitHub, is 5GB, no matter how much we’d be willing to pay.
Following a successful RFC, we decided to proceed with a Git LFS service. This would give users the advantages of working predominantly with the GitHub services, while allowing us to offer unmetered storage at the back end.
In 2023, we were faced with the facts that we did not want to store our data at AWS anymore, and that the software we used for Git LFS was dangerously ancient and unmaintained. After some consideration, we chose a different product to provide Git LFS service, and proposed it in RFC-966. This document now describes the current state of Git LFS at the Rubin Observatory.
Architecture#
After installing the git lfs
client,
a user can commit additions and modifications to LFS-backed data using
the normal git
commands. The repo’s .gitattributes
specifies
what files are tracked by Git LFS.
Our Git LFS server is hosted by Roundtable and is protected by Gafaelfawr. Before the user can commit Git
LFS-tracked data, they must first request a token with the
write:git-lfs
scope. Having done this, they must update their
~/.git-credentials
file (or other Git credentials manager) with that
token, and update the Git LFS URL to
https://git-lfs-rw.lsst.cloud/<org>/<repo>
(where org
is usually
lsst
and repo
is the repository name, e.g. testdata_subaru
).
This process is described in the Developer Guide.
When the user pushes commits with LFS-tracked data, they are in fact generating two requests. The first one goes to the Git server containing the Git LFS pointer for the file. It looks something like this:
version https://git-lfs.github.com/spec/v1
oid sha256:7a6943ac4d8337727b93f410cf51b1ce748dabe9dc8e85c8942c97dd5c0a49e9
size 123840
The second request is made by the git lfs
client (due to the
smudge and clean filters) and uses the .lfsconfig
to locate
the Git LFS server it should be addressing. In our case that will be
git-lfs-rw.lsst.cloud
. The Git LFS server checks whether the requested
blob exists in the backing store, and hands the client a URL that it
can use to retrive or push it. The client then uses the URL to fetch or push to our object store.
Other git-lfs.lsst.cloud
(and git-lfs-rw.lsst.cloud
) components:
Giftless provides the Git LFS server.
Google Cloud Storage provides the backing store, using the s3 object storage protocol.
Client Requirements#
It is recommended that users use the current stable git-lfs
client
version
(3.4.0
at the time of writing).
Repositories#
These are the repos involved in this deployment:
-
The application configuration is held in
applications/giftless
. -
This is the server implementation. It uses Datopian’s released Giftless container
Documentation#
GitHub’s Git Large File Storage website
This is GitHub’s git-lfs website and links to the canonical client source code, issues and documentation.
-
LSST’s developer guide for using git-lfs.
-
The source for this note.