Setup and Config
Getting and Creating Projects
Basic Snapshotting
Branching and Merging
Sharing and Updating Projects
Inspection and Comparison
Patching
Debugging
External Systems
Server Admin
Guides
- gitattributes
- Command-line interface conventions
- Everyday Git
- Frequently Asked Questions (FAQ)
- Glossary
- Hooks
- gitignore
- gitmodules
- Revisions
- Submodules
- Tutorial
- Workflows
- All guides...
Administration
Plumbing Commands
- 2.32.1 → 2.47.1 no changes
- 2.32.0 06/06/21
The sparse-checkout feature allows users to focus a working directory on
a subset of the files at HEAD. The cone mode patterns, enabled by
core.sparseCheckoutCone
, allow for very fast pattern matching to
discover which files at HEAD belong in the sparse-checkout cone.
Three important scale dimensions for a Git working directory are:
-
HEAD
: How many files are present atHEAD
? -
Populated: How many files are within the sparse-checkout cone.
-
Modified: How many files has the user modified in the working directory?
We will use big-O notation — O(X) — to denote how expensive certain operations are in terms of these dimensions.
These dimensions are ordered by their magnitude: users (typically) modify
fewer files than are populated, and we can only populate files at HEAD
.
Problems occur if there is an extreme imbalance in these dimensions. For
example, if HEAD
contains millions of paths but the populated set has
only tens of thousands, then commands like git status
and git add
can
be dominated by operations that require O(HEAD
) operations instead of
O(Populated). Primarily, the cost is in parsing and rewriting the index,
which is filled primarily with files at HEAD
that are marked with the
SKIP_WORKTREE
bit.
The sparse-index intends to take these commands that read and modify the
index from O(HEAD
) to O(Populated). To do this, we need to modify the
index format in a significant way: add "sparse directory" entries.
With cone mode patterns, it is possible to detect when an entire
directory will have its contents outside of the sparse-checkout definition.
Instead of listing all of the files it contains as individual entries, a
sparse-index contains an entry with the directory name, referencing the
object ID of the tree at HEAD
and marked with the SKIP_WORKTREE
bit.
If we need to discover the details for paths within that directory, we
can parse trees to find that list.
At time of writing, sparse-directory entries violate expectations about the index format and its in-memory data structure. There are many consumers in the codebase that expect to iterate through all of the index entries and see only files. In fact, these loops expect to see a reference to every staged file. One way to handle this is to parse trees to replace a sparse-directory entry with all of the files within that tree as the index is loaded. However, parsing trees is slower than parsing the index format, so that is a slower operation than if we left the index alone. The plan is to make all of these integrations "sparse aware" so this expansion through tree parsing is unnecessary and they use fewer resources than when using a full index.
The implementation plan below follows four phases to slowly integrate with the sparse-index. The intention is to incrementally update Git commands to interact safely with the sparse-index without significant slowdowns. This may not always be possible, but the hope is that the primary commands that users need in their daily work are dramatically improved.
Phase I: Format and initial speedups
During this phase, Git learns to enable the sparse-index and safely parse
one. Protections are put in place so that every consumer of the in-memory
data structure can operate with its current assumption of every file at
HEAD
.
At first, every index parse will call a helper method,
ensure_full_index()
, which scans the index for sparse-directory entries
(pointing to trees) and replaces them with the full list of paths (with
blob contents) by parsing tree objects. This will be slower in all cases.
The only noticeable change in behavior will be that the serialized index
file contains sparse-directory entries.
To start, we use a new required index extension, sdir
, to allow
inserting sparse-directory entries into indexes with file format
versions 2, 3, and 4. This prevents Git versions that do not understand
the sparse-index from operating on one, while allowing tools that do not
understand the sparse-index to operate on repositories as long as they do
not interact with the index. A new format, index v5, will be introduced
that includes sparse-directory entries by default. It might also
introduce other features that have been considered for improving the
index, as well.
Next, consumers of the index will be guarded against operating on a
sparse-index by inserting calls to ensure_full_index()
or
expand_index_to_path()
. If a specific path is requested, then those will
be protected from within the index_file_exists()
and index_name_pos()
API calls: they will call ensure_full_index()
if necessary. The
intention here is to preserve existing behavior when interacting with a
sparse-checkout. We don’t want a change to happen by accident, without
tests. Many of these locations may not need any change before removing the
guards, but we should not do so without tests to ensure the expected
behavior happens.
It may be desirable to change the behavior of some commands in the presence of a sparse index or more generally in any sparse-checkout scenario. In such cases, these should be carefully communicated and tested. No such behavior changes are intended during this phase.
During a scan of the codebase, not every iteration of the cache entries
needs an ensure_full_index()
check. The basic reasons include:
-
The loop is scanning for entries with non-zero stage. These entries are not collapsed into a sparse-directory entry.
-
The loop is scanning for submodules. These entries are not collapsed into a sparse-directory entry.
-
The loop is part of the index API, especially around reading or writing the format.
-
The loop is checking for correct order of cache entries and that is correct if and only if the sparse-directory entries are in the correct location.
-
The loop ignores entries with the
SKIP_WORKTREE
bit set, or is otherwise already aware of sparse directory entries. -
The sparse-index is disabled at this point when using the split-index feature, so no effort is made to protect the split-index API.
Even after inserting these guards, we will keep expanding sparse-indexes
for most Git commands using the command_requires_full_index
repository
setting. This setting will be on by default and disabled one builtin at a
time until we have sufficient confidence that all of the index operations
are properly guarded.
To complete this phase, the commands git status
and git add
will be
integrated with the sparse-index so that they operate with O(Populated)
performance. They will be carefully tested for operations within and
outside the sparse-checkout definition.
Phase II: Careful integrations
This phase focuses on ensuring that all index extensions and APIs work
well with a sparse-index. This requires significant increases to our test
coverage, especially for operations that interact with the working
directory outside of the sparse-checkout definition. Some of these
behaviors may not be the desirable ones, such as some tests already
marked for failure in t1092-sparse-checkout-compatibility.sh
.
The index extensions that may require special integrations are:
-
FS Monitor
-
Untracked cache
While integrating with these features, we should look for patterns that might lead to better APIs for interacting with the index. Coalescing common usage patterns into an API call can reduce the number of places where sparse-directories need to be handled carefully.
Phase III: Important command speedups
At this point, the patterns for testing and implementing sparse-directory logic should be relatively stable. This phase focuses on updating some of the most common builtins that use the index to operate as O(Populated). Here is a potential list of commands that could be valuable to integrate at this point:
-
git commit
-
git checkout
-
git merge
-
git rebase
Hopefully, commands such as git merge
and git rebase
can benefit
instead from merge algorithms that do not use the index as a data
structure, such as the merge-ORT strategy. As these topics mature, we
may enable the ORT strategy by default for repositories using the
sparse-index feature.
Along with git status
and git add
, these commands cover the majority
of users' interactions with the working directory. In addition, we can
integrate with these commands:
-
git grep
-
git rm
These have been proposed as some whose behavior could change when in a repo with a sparse-checkout definition. It would be good to include this behavior automatically when using a sparse-index. Some clarity is needed to make the behavior switch clear to the user.
This phase is the first where parallel work might be possible without too much conflicts between topics.
Phase IV: The long tail
This last phase is less a "phase" and more "the new normal" after all of the previous work.
To start, the command_requires_full_index
option could be removed in
favor of expanding only when hitting an API guard.
There are many Git commands that could use special attention to operate as O(Populated), while some might be so rare that it is acceptable to leave them with additional overhead when a sparse-index is present.
Here are some commands that might be useful to update:
-
git sparse-checkout set
-
git am
-
git clean
-
git stash