git/repository.h

262 lines
7.2 KiB
C
Raw Normal View History

#ifndef REPOSITORY_H
#define REPOSITORY_H
struct config_set;
fsmonitor: config settings are repository-specific Move fsmonitor config settings to a new and opaque `struct fsmonitor_settings` structure. Add a lazily-loaded pointer to this into `struct repo_settings` Create an `enum fsmonitor_mode` type in `struct fsmonitor_settings` to represent the state of fsmonitor. This lets us represent which, if any, fsmonitor provider (hook or IPC) is enabled. Create `fsm_settings__get_*()` getters to lazily look up fsmonitor- related config settings. Get rid of the `core_fsmonitor` global variable. Move the code to lookup the existing `core.fsmonitor` config value into the fsmonitor settings. Create a hook pathname variable in `struct fsmonitor-settings` and only set it when in hook mode. Extend the definition of `core.fsmonitor` to be either a boolean or a hook pathname. When true, the builtin FSMonitor is used. When false or unset, no FSMonitor (neither builtin nor hook) is used. The existing `core_fsmonitor` global variable was used to store the pathname to the fsmonitor hook *and* it was used as a boolean to see if fsmonitor was enabled. This dual usage and global visibility leads to confusion when we add the IPC-based provider. So lets hide the details in fsmonitor-settings.c and let it decide which provider to use in the case of multiple settings. This avoids cluttering up repo-settings.c with these private details. A future commit in builtin-fsmonitor series will add the ability to disqualify worktrees for various reasons, such as being mounted from a remote volume, where fsmonitor should not be started. Having the config settings hidden in fsmonitor-settings.c allows such worktree restrictions to override the config values used. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-25 19:02:46 +01:00
struct fsmonitor_settings;
struct git_hash_algo;
struct index_state;
struct lock_file;
struct pathspec;
struct raw_object_store;
struct submodule_cache;
struct promisor_remote_config;
struct remote_state;
enum untracked_cache_setting {
repo-settings.c: simplify the setup Simplify the setup code in repo-settings.c in various ways, making the code shorter, easier to read, and requiring fewer hacks to do the same thing as it did before: Since 7211b9e7534 (repo-settings: consolidate some config settings, 2019-08-13) we have memset() the whole "settings" structure to -1 in prepare_repo_settings(), and subsequently relied on the -1 value. Most of the fields did not need to be initialized to -1, and because we were doing that we had the enum labels "UNTRACKED_CACHE_UNSET" and "FETCH_NEGOTIATION_UNSET" purely to reflect the resulting state created this memset() in prepare_repo_settings(). No other code used or relied on them, more on that below. For the rest most of the subsequent "are we -1, then read xyz" can simply be removed by re-arranging what we read first. E.g. when setting the "index.version" setting we should have first read "feature.experimental", so that it (and "feature.manyfiles") can provide a default for our "index.version". Instead the code setting it, added when "feature.manyFiles"[1] was created, was using the UPDATE_DEFAULT_BOOL() macro added in an earlier commit[2]. That macro is now gone, since it was only needed for this pattern of reading things in the wrong order. This also fixes an (admittedly obscure) logic error where we'd conflate an explicit "-1" value in the config with our own earlier memset() -1. We can also remove the UPDATE_DEFAULT_BOOL() wrapper added in [3]. Using it is redundant to simply using the return value from repo_config_get_bool(), which is non-zero if the provided key exists in the config. Details on edge cases relating to the memset() to -1, continued from "more on that below" above: * UNTRACKED_CACHE_KEEP: In [4] the "unset" and "keep" handling for core.untrackedCache was consolidated. But it while we understand the "keep" value, we don't handle it differently than the case of any other unknown value. So let's retain UNTRACKED_CACHE_KEEP and remove the UNTRACKED_CACHE_UNSET setting (which was always implicitly UNTRACKED_CACHE_KEEP before). We don't need to inform any code after prepare_repo_settings() that the setting was "unset", as far as anyone else is concerned it's core.untrackedCache=keep. if "core.untrackedcache" isn't present in the config. * FETCH_NEGOTIATION_UNSET & FETCH_NEGOTIATION_NONE: Since these two two enum fields added in [5] don't rely on the memzero() setting them to "-1" anymore we don't have to provide them with explicit values. 1. c6cc4c5afd2 (repo-settings: create feature.manyFiles setting, 2019-08-13) 2. 31b1de6a09b (commit-graph: turn on commit-graph by default, 2019-08-13) 3. 31b1de6a09b (commit-graph: turn on commit-graph by default, 2019-08-13) 4. ad0fb659993 (repo-settings: parse core.untrackedCache, 2019-08-13) 5. aaf633c2ad1 (repo-settings: create feature.experimental setting, 2019-08-13) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-21 15:13:02 +02:00
UNTRACKED_CACHE_KEEP,
UNTRACKED_CACHE_REMOVE,
UNTRACKED_CACHE_WRITE,
};
enum fetch_negotiation_setting {
FETCH_NEGOTIATION_CONSECUTIVE,
repo-settings.c: simplify the setup Simplify the setup code in repo-settings.c in various ways, making the code shorter, easier to read, and requiring fewer hacks to do the same thing as it did before: Since 7211b9e7534 (repo-settings: consolidate some config settings, 2019-08-13) we have memset() the whole "settings" structure to -1 in prepare_repo_settings(), and subsequently relied on the -1 value. Most of the fields did not need to be initialized to -1, and because we were doing that we had the enum labels "UNTRACKED_CACHE_UNSET" and "FETCH_NEGOTIATION_UNSET" purely to reflect the resulting state created this memset() in prepare_repo_settings(). No other code used or relied on them, more on that below. For the rest most of the subsequent "are we -1, then read xyz" can simply be removed by re-arranging what we read first. E.g. when setting the "index.version" setting we should have first read "feature.experimental", so that it (and "feature.manyfiles") can provide a default for our "index.version". Instead the code setting it, added when "feature.manyFiles"[1] was created, was using the UPDATE_DEFAULT_BOOL() macro added in an earlier commit[2]. That macro is now gone, since it was only needed for this pattern of reading things in the wrong order. This also fixes an (admittedly obscure) logic error where we'd conflate an explicit "-1" value in the config with our own earlier memset() -1. We can also remove the UPDATE_DEFAULT_BOOL() wrapper added in [3]. Using it is redundant to simply using the return value from repo_config_get_bool(), which is non-zero if the provided key exists in the config. Details on edge cases relating to the memset() to -1, continued from "more on that below" above: * UNTRACKED_CACHE_KEEP: In [4] the "unset" and "keep" handling for core.untrackedCache was consolidated. But it while we understand the "keep" value, we don't handle it differently than the case of any other unknown value. So let's retain UNTRACKED_CACHE_KEEP and remove the UNTRACKED_CACHE_UNSET setting (which was always implicitly UNTRACKED_CACHE_KEEP before). We don't need to inform any code after prepare_repo_settings() that the setting was "unset", as far as anyone else is concerned it's core.untrackedCache=keep. if "core.untrackedcache" isn't present in the config. * FETCH_NEGOTIATION_UNSET & FETCH_NEGOTIATION_NONE: Since these two two enum fields added in [5] don't rely on the memzero() setting them to "-1" anymore we don't have to provide them with explicit values. 1. c6cc4c5afd2 (repo-settings: create feature.manyFiles setting, 2019-08-13) 2. 31b1de6a09b (commit-graph: turn on commit-graph by default, 2019-08-13) 3. 31b1de6a09b (commit-graph: turn on commit-graph by default, 2019-08-13) 4. ad0fb659993 (repo-settings: parse core.untrackedCache, 2019-08-13) 5. aaf633c2ad1 (repo-settings: create feature.experimental setting, 2019-08-13) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-21 15:13:02 +02:00
FETCH_NEGOTIATION_SKIPPING,
FETCH_NEGOTIATION_NOOP,
};
refs: introduce reftable backend Due to scalability issues, Shawn Pearce has originally proposed a new "reftable" format more than six years ago [1]. Initially, this new format was implemented in JGit with promising results. Around two years ago, we have then added the "reftable" library to the Git codebase via a4bbd13be3 (Merge branch 'hn/reftable', 2021-12-15). With this we have landed all the low-level code to read and write reftables. Notably missing though was the integration of this low-level code into the Git code base in the form of a new ref backend that ties all of this together. This gap is now finally closed by introducing a new "reftable" backend into the Git codebase. This new backend promises to bring some notable improvements to Git repositories: - It becomes possible to do truly atomic writes where either all refs are committed to disk or none are. This was not possible with the "files" backend because ref updates were split across multiple loose files. - The disk space required to store many refs is reduced, both compared to loose refs and packed-refs. This is enabled both by the reftable format being a binary format, which is more compact, and by prefix compression. - We can ignore filesystem-specific behaviour as ref names are not encoded via paths anymore. This means there is no need to handle case sensitivity on Windows systems or Unicode precomposition on macOS. - There is no need to rewrite the complete refdb anymore every time a ref is being deleted like it was the case for packed-refs. This means that ref deletions are now constant time instead of scaling linearly with the number of refs. - We can ignore file/directory conflicts so that it becomes possible to store both "refs/heads/foo" and "refs/heads/foo/bar". - Due to this property we can retain reflogs for deleted refs. We have previously been deleting reflogs together with their refs to avoid file/directory conflicts, which is not necessary anymore. - We can properly enumerate all refs. With the "files" backend it is not easily possible to distinguish between refs and non-refs because they may live side by side in the gitdir. Not all of these improvements are realized with the current "reftable" backend implementation. At this point, the new backend is supposed to be a drop-in replacement for the "files" backend that is used by basically all Git repositories nowadays. It strives for 1:1 compatibility, which means that a user can expect the same behaviour regardless of whether they use the "reftable" backend or the "files" backend for most of the part. Most notably, this means we artificially limit the capabilities of the "reftable" backend to match the limits of the "files" backend. It is not possible to create refs that would end up with file/directory conflicts, we do not retain reflogs, we perform stricter-than-necessary checks. This is done intentionally due to two main reasons: - It makes it significantly easier to land the "reftable" backend as tests behave the same. It would be tough to argue for each and every single test that doesn't pass with the "reftable" backend. - It ensures compatibility between repositories that use the "files" backend and repositories that use the "reftable" backend. Like this, hosters can migrate their repositories to use the "reftable" backend without causing issues for clients that use the "files" backend in their clones. It is expected that these artificial limitations may eventually go away in the long term. Performance-wise things very much depend on the actual workload. The following benchmarks compare the "files" and "reftable" backends in the current version: - Creating N refs in separate transactions shows that the "files" backend is ~50% faster. This is not surprising given that creating a ref only requires us to create a single loose ref. The "reftable" backend will also perform auto compaction on updates. In real-world workloads we would likely also want to perform pack loose refs, which would likely change the picture. Benchmark 1: update-ref: create refs sequentially (refformat = files, refcount = 1) Time (mean ± σ): 2.1 ms ± 0.3 ms [User: 0.6 ms, System: 1.7 ms] Range (min … max): 1.8 ms … 4.3 ms 133 runs Benchmark 2: update-ref: create refs sequentially (refformat = reftable, refcount = 1) Time (mean ± σ): 2.7 ms ± 0.1 ms [User: 0.6 ms, System: 2.2 ms] Range (min … max): 2.4 ms … 2.9 ms 132 runs Benchmark 3: update-ref: create refs sequentially (refformat = files, refcount = 1000) Time (mean ± σ): 1.975 s ± 0.006 s [User: 0.437 s, System: 1.535 s] Range (min … max): 1.969 s … 1.980 s 3 runs Benchmark 4: update-ref: create refs sequentially (refformat = reftable, refcount = 1000) Time (mean ± σ): 2.611 s ± 0.013 s [User: 0.782 s, System: 1.825 s] Range (min … max): 2.597 s … 2.622 s 3 runs Benchmark 5: update-ref: create refs sequentially (refformat = files, refcount = 100000) Time (mean ± σ): 198.442 s ± 0.241 s [User: 43.051 s, System: 155.250 s] Range (min … max): 198.189 s … 198.670 s 3 runs Benchmark 6: update-ref: create refs sequentially (refformat = reftable, refcount = 100000) Time (mean ± σ): 294.509 s ± 4.269 s [User: 104.046 s, System: 190.326 s] Range (min … max): 290.223 s … 298.761 s 3 runs - Creating N refs in a single transaction shows that the "files" backend is significantly slower once we start to write many refs. The "reftable" backend only needs to update two files, whereas the "files" backend needs to write one file per ref. Benchmark 1: update-ref: create many refs (refformat = files, refcount = 1) Time (mean ± σ): 1.9 ms ± 0.1 ms [User: 0.4 ms, System: 1.4 ms] Range (min … max): 1.8 ms … 2.6 ms 151 runs Benchmark 2: update-ref: create many refs (refformat = reftable, refcount = 1) Time (mean ± σ): 2.5 ms ± 0.1 ms [User: 0.7 ms, System: 1.7 ms] Range (min … max): 2.4 ms … 3.4 ms 148 runs Benchmark 3: update-ref: create many refs (refformat = files, refcount = 1000) Time (mean ± σ): 152.5 ms ± 5.2 ms [User: 19.1 ms, System: 133.1 ms] Range (min … max): 148.5 ms … 167.8 ms 15 runs Benchmark 4: update-ref: create many refs (refformat = reftable, refcount = 1000) Time (mean ± σ): 58.0 ms ± 2.5 ms [User: 28.4 ms, System: 29.4 ms] Range (min … max): 56.3 ms … 72.9 ms 40 runs Benchmark 5: update-ref: create many refs (refformat = files, refcount = 1000000) Time (mean ± σ): 152.752 s ± 0.710 s [User: 20.315 s, System: 131.310 s] Range (min … max): 152.165 s … 153.542 s 3 runs Benchmark 6: update-ref: create many refs (refformat = reftable, refcount = 1000000) Time (mean ± σ): 51.912 s ± 0.127 s [User: 26.483 s, System: 25.424 s] Range (min … max): 51.769 s … 52.012 s 3 runs - Deleting a ref in a fully-packed repository shows that the "files" backend scales with the number of refs. The "reftable" backend has constant-time deletions. Benchmark 1: update-ref: delete ref (refformat = files, refcount = 1) Time (mean ± σ): 1.7 ms ± 0.1 ms [User: 0.4 ms, System: 1.2 ms] Range (min … max): 1.6 ms … 2.1 ms 316 runs Benchmark 2: update-ref: delete ref (refformat = reftable, refcount = 1) Time (mean ± σ): 1.8 ms ± 0.1 ms [User: 0.4 ms, System: 1.3 ms] Range (min … max): 1.7 ms … 2.1 ms 294 runs Benchmark 3: update-ref: delete ref (refformat = files, refcount = 1000) Time (mean ± σ): 2.0 ms ± 0.1 ms [User: 0.5 ms, System: 1.4 ms] Range (min … max): 1.9 ms … 2.5 ms 287 runs Benchmark 4: update-ref: delete ref (refformat = reftable, refcount = 1000) Time (mean ± σ): 1.9 ms ± 0.1 ms [User: 0.5 ms, System: 1.3 ms] Range (min … max): 1.8 ms … 2.1 ms 217 runs Benchmark 5: update-ref: delete ref (refformat = files, refcount = 1000000) Time (mean ± σ): 229.8 ms ± 7.9 ms [User: 182.6 ms, System: 46.8 ms] Range (min … max): 224.6 ms … 245.2 ms 6 runs Benchmark 6: update-ref: delete ref (refformat = reftable, refcount = 1000000) Time (mean ± σ): 2.0 ms ± 0.0 ms [User: 0.6 ms, System: 1.3 ms] Range (min … max): 2.0 ms … 2.1 ms 3 runs - Listing all refs shows no significant advantage for either of the backends. The "files" backend is a bit faster, but not by a significant margin. When repositories are not packed the "reftable" backend outperforms the "files" backend because the "reftable" backend performs auto-compaction. Benchmark 1: show-ref: print all refs (refformat = files, refcount = 1, packed = true) Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms] Range (min … max): 1.5 ms … 2.0 ms 1729 runs Benchmark 2: show-ref: print all refs (refformat = reftable, refcount = 1, packed = true) Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms] Range (min … max): 1.5 ms … 1.8 ms 1816 runs Benchmark 3: show-ref: print all refs (refformat = files, refcount = 1000, packed = true) Time (mean ± σ): 4.3 ms ± 0.1 ms [User: 0.9 ms, System: 3.3 ms] Range (min … max): 4.1 ms … 4.6 ms 645 runs Benchmark 4: show-ref: print all refs (refformat = reftable, refcount = 1000, packed = true) Time (mean ± σ): 4.5 ms ± 0.2 ms [User: 1.0 ms, System: 3.3 ms] Range (min … max): 4.2 ms … 5.9 ms 643 runs Benchmark 5: show-ref: print all refs (refformat = files, refcount = 1000000, packed = true) Time (mean ± σ): 2.537 s ± 0.034 s [User: 0.488 s, System: 2.048 s] Range (min … max): 2.511 s … 2.627 s 10 runs Benchmark 6: show-ref: print all refs (refformat = reftable, refcount = 1000000, packed = true) Time (mean ± σ): 2.712 s ± 0.017 s [User: 0.653 s, System: 2.059 s] Range (min … max): 2.692 s … 2.752 s 10 runs Benchmark 7: show-ref: print all refs (refformat = files, refcount = 1, packed = false) Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms] Range (min … max): 1.5 ms … 1.9 ms 1834 runs Benchmark 8: show-ref: print all refs (refformat = reftable, refcount = 1, packed = false) Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms] Range (min … max): 1.4 ms … 2.0 ms 1840 runs Benchmark 9: show-ref: print all refs (refformat = files, refcount = 1000, packed = false) Time (mean ± σ): 13.8 ms ± 0.2 ms [User: 2.8 ms, System: 10.8 ms] Range (min … max): 13.3 ms … 14.5 ms 208 runs Benchmark 10: show-ref: print all refs (refformat = reftable, refcount = 1000, packed = false) Time (mean ± σ): 4.5 ms ± 0.2 ms [User: 1.2 ms, System: 3.3 ms] Range (min … max): 4.3 ms … 6.2 ms 624 runs Benchmark 11: show-ref: print all refs (refformat = files, refcount = 1000000, packed = false) Time (mean ± σ): 12.127 s ± 0.129 s [User: 2.675 s, System: 9.451 s] Range (min … max): 11.965 s … 12.370 s 10 runs Benchmark 12: show-ref: print all refs (refformat = reftable, refcount = 1000000, packed = false) Time (mean ± σ): 2.799 s ± 0.022 s [User: 0.735 s, System: 2.063 s] Range (min … max): 2.769 s … 2.836 s 10 runs - Printing a single ref shows no real difference between the "files" and "reftable" backends. Benchmark 1: show-ref: print single ref (refformat = files, refcount = 1) Time (mean ± σ): 1.5 ms ± 0.1 ms [User: 0.4 ms, System: 1.0 ms] Range (min … max): 1.4 ms … 1.8 ms 1779 runs Benchmark 2: show-ref: print single ref (refformat = reftable, refcount = 1) Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms] Range (min … max): 1.4 ms … 2.5 ms 1753 runs Benchmark 3: show-ref: print single ref (refformat = files, refcount = 1000) Time (mean ± σ): 1.5 ms ± 0.1 ms [User: 0.3 ms, System: 1.1 ms] Range (min … max): 1.4 ms … 1.9 ms 1840 runs Benchmark 4: show-ref: print single ref (refformat = reftable, refcount = 1000) Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms] Range (min … max): 1.5 ms … 2.0 ms 1831 runs Benchmark 5: show-ref: print single ref (refformat = files, refcount = 1000000) Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms] Range (min … max): 1.5 ms … 2.1 ms 1848 runs Benchmark 6: show-ref: print single ref (refformat = reftable, refcount = 1000000) Time (mean ± σ): 1.6 ms ± 0.1 ms [User: 0.4 ms, System: 1.1 ms] Range (min … max): 1.5 ms … 2.1 ms 1762 runs So overall, performance depends on the usecases. Except for many sequential writes the "reftable" backend is roughly on par or significantly faster than the "files" backend though. Given that the "files" backend has received 18 years of optimizations by now this can be seen as a win. Furthermore, we can expect that the "reftable" backend will grow faster over time when attention turns more towards optimizations. The complete test suite passes, except for those tests explicitly marked to require the REFFILES prerequisite. Some tests in t0610 are marked as failing because they depend on still-in-flight bug fixes. Tests can be run with the new backend by setting the GIT_TEST_DEFAULT_REF_FORMAT environment variable to "reftable". There is a single known conceptual incompatibility with the dumb HTTP transport. As "info/refs" SHOULD NOT contain the HEAD reference, and because the "HEAD" file is not valid anymore, it is impossible for the remote client to figure out the default branch without changing the protocol. This shortcoming needs to be handled in a subsequent patch series. As the reftable library has already been introduced a while ago, this commit message will not go into the details of how exactly the on-disk format works. Please refer to our preexisting technical documentation at Documentation/technical/reftable for this. [1]: https://public-inbox.org/git/CAJo=hJtyof=HRy=2sLP0ng0uZ4=S-DpZ5dR1aF+VHVETKG20OQ@mail.gmail.com/ Original-idea-by: Shawn Pearce <spearce@spearce.org> Based-on-patch-by: Han-Wen Nienhuys <hanwen@google.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-07 08:20:31 +01:00
#define REF_STORAGE_FORMAT_UNKNOWN 0
#define REF_STORAGE_FORMAT_FILES 1
#define REF_STORAGE_FORMAT_REFTABLE 2
struct repo_settings {
int initialized;
int core_commit_graph;
commit-graph: pass repo_settings instead of repository The parse_commit_graph() function takes a 'struct repository *' pointer, but it only ever accesses config settings (either directly or through the .settings field of the repo struct). Move all relevant config settings into the repo_settings struct, and update parse_commit_graph() and its existing callers so that it takes 'struct repo_settings *' instead. Callers of parse_commit_graph() will now need to call prepare_repo_settings() themselves, or initialize a 'struct repo_settings' directly. Prior to ab14d0676c (commit-graph: pass a 'struct repository *' in more places, 2020-09-09), parsing a commit-graph was a pure function depending only on the contents of the commit-graph itself. Commit ab14d0676c introduced a dependency on a `struct repository` pointer, and later commits such as b66d84756f (commit-graph: respect 'commitGraph.readChangedPaths', 2020-09-09) added dependencies on config settings, which were accessed through the `settings` field of the repository pointer. This field was initialized via a call to `prepare_repo_settings()`. Additionally, this fixes an issue in fuzz-commit-graph: In 44c7e62 (2021-12-06, repo-settings:prepare_repo_settings only in git repos), prepare_repo_settings was changed to issue a BUG() if it is called by a process whose CWD is not a Git repository. The combination of commits mentioned above broke fuzz-commit-graph, which attempts to parse arbitrary fuzzing-engine-provided bytes as a commit graph file. Prior to this change, parse_commit_graph() called prepare_repo_settings(), but since we run the fuzz tests without a valid repository, we are hitting the BUG() from 44c7e62 for every test case. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Josh Steadmon <steadmon@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-07-14 23:43:06 +02:00
int commit_graph_generation_version;
int commit_graph_read_changed_paths;
int gc_write_commit_graph;
int fetch_write_commit_graph;
repository.h: don't use a mix of int and bitfields Change the bitfield added in 58300f47432 (sparse-index: add index.sparse config option, 2021-03-30) and 3964fc2aae7 (sparse-index: add guard to ensure full index, 2021-03-30) to just use an "int" boolean instead. It might be smart to optimize the space here in the future, but by consistently using an "int" we can take its address and pass it to repo_cfg_bool(), and therefore don't need to handle "sparse_index" as a special-case when reading the "index.sparse" setting. There's no corresponding config for "command_requires_full_index", but let's change it too for consistency and to prevent future bugs creeping in due to one of these being "unsigned". Using "int" consistently also prevents subtle bugs or undesired control flow creeping in here. Before the preceding commit the initialization of "command_requires_full_index" in prepare_repo_settings() did nothing, i.e. this: r->settings.command_requires_full_index = 1 Was redundant to the earlier memset() to -1. Likewise for "sparse_index" added in 58300f47432 (sparse-index: add index.sparse config option, 2021-03-30) the code and comment added there was misleading, we weren't initializing it to off, but re-initializing it from "1" to "0", and then finally checking the config, and perhaps setting it to "1" again. I.e. we could have applied this patch before the preceding commit: + assert(r->settings.command_requires_full_index == 1); r->settings.command_requires_full_index = 1; /* * Initialize this as off. */ + assert(r->settings.sparse_index == 1); r->settings.sparse_index = 0; if (!repo_config_get_bool(r, "index.sparse", &value) && value) r->settings.sparse_index = 1; Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-21 15:13:03 +02:00
int command_requires_full_index;
int sparse_index;
pack-revindex: introduce `pack.readReverseIndex` Since 1615c567b8 (Documentation/config/pack.txt: advertise 'pack.writeReverseIndex', 2021-01-25), we have had the `pack.writeReverseIndex` configuration option, which tells Git whether or not it is allowed to write a ".rev" file when indexing a pack. Introduce a complementary configuration knob, `pack.readReverseIndex` to control whether or not Git will read any ".rev" file(s) that may be available on disk. This option is useful for debugging, as well as disabling the effect of ".rev" files in certain instances. This is useful because of the trade-off[^1] between the time it takes to generate a reverse index (slow from scratch, fast when reading an existing ".rev" file), and the time it takes to access a record (the opposite). For example, even though it is faster to use the on-disk reverse index when computing the on-disk size of a packed object, it is slower to enumerate the same value for all objects. Here are a couple of examples from linux.git. When computing the above for a single object, using the on-disk reverse index is significantly faster: $ git rev-parse HEAD >in $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --batch-check="%(objectsize:disk)" <in' Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in Time (mean ± σ): 302.5 ms ± 12.5 ms [User: 258.7 ms, System: 43.6 ms] Range (min … max): 291.1 ms … 328.1 ms 10 runs Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in Time (mean ± σ): 3.9 ms ± 0.3 ms [User: 1.6 ms, System: 2.4 ms] Range (min … max): 2.0 ms … 4.4 ms 801 runs Summary 'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" <in' ran 77.29 ± 7.14 times faster than 'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" <in' , but when instead trying to compute the on-disk object size for all objects in the repository, using the ".rev" file is a disadvantage over creating the reverse index from scratch: $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --batch-check="%(objectsize:disk)" --batch-all-objects' Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" --batch-all-objects Time (mean ± σ): 8.258 s ± 0.035 s [User: 7.949 s, System: 0.308 s] Range (min … max): 8.199 s … 8.293 s 10 runs Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" --batch-all-objects Time (mean ± σ): 16.976 s ± 0.107 s [User: 16.706 s, System: 0.268 s] Range (min … max): 16.839 s … 17.105 s 10 runs Summary 'git.compile -c pack.readReverseIndex=false cat-file --batch-check="%(objectsize:disk)" --batch-all-objects' ran 2.06 ± 0.02 times faster than 'git.compile -c pack.readReverseIndex=true cat-file --batch-check="%(objectsize:disk)" --batch-all-objects' Luckily, the results when running `git cat-file` with `--unordered` are closer together: $ hyperfine -L v false,true 'git.compile -c pack.readReverseIndex={v} cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects' Benchmark 1: git.compile -c pack.readReverseIndex=false cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects Time (mean ± σ): 5.066 s ± 0.105 s [User: 4.792 s, System: 0.274 s] Range (min … max): 4.943 s … 5.220 s 10 runs Benchmark 2: git.compile -c pack.readReverseIndex=true cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects Time (mean ± σ): 6.193 s ± 0.069 s [User: 5.937 s, System: 0.255 s] Range (min … max): 6.145 s … 6.356 s 10 runs Summary 'git.compile -c pack.readReverseIndex=false cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects' ran 1.22 ± 0.03 times faster than 'git.compile -c pack.readReverseIndex=true cat-file --unordered --batch-check="%(objectsize:disk)" --batch-all-objects' Because the equilibrium point between these two is highly machine- and repository-dependent, allow users to configure whether or not they will read any ".rev" file(s) with this configuration knob. [^1]: Generating a reverse index in memory takes O(N) time (where N is the number of objects in the repository), since we use a radix sort. Reading an entry from an on-disk ".rev" file is slower since each operation is bound by disk I/O instead of memory I/O. In order to compute the on-disk size of a packed object, we need to find the offset of our object, and the adjacent object (the on-disk size difference of these two). Finding the first offset requires a binary search. Finding the latter involves a single .rev lookup. Signed-off-by: Taylor Blau <me@ttaylorr.com> Acked-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-13 00:20:30 +02:00
int pack_read_reverse_index;
pack-bitmap.c: use commit boundary during bitmap traversal When reachability bitmap coverage exists in a repository, Git will use a different (and hopefully faster) traversal to compute revision walks. Consider a set of positive and negative tips (which we'll refer to with their standard bitmap parlance by "wants", and "haves"). In order to figure out what objects exist between the tips, the existing traversal in `prepare_bitmap_walk()` does something like: 1. Consider if we can even compute the set of objects with bitmaps, and fall back to the usual traversal if we cannot. For example, pathspec limiting traversals can't be computed using bitmaps (since they don't know which objects are at which paths). The same is true of certain kinds of non-trivial object filters. 2. If we can compute the traversal with bitmaps, partition the (dereferenced) tips into two object lists, "haves", and "wants", based on whether or not the objects have the UNINTERESTING flag, respectively. 3. Fall back to the ordinary object traversal if either (a) there are more than zero haves, none of which are in the bitmapped pack or MIDX, or (b) there are no wants. 4. Construct a reachability bitmap for the "haves" side by walking from the revision tips down to any existing bitmaps, OR-ing in any bitmaps as they are found. 5. Then do the same for the "wants" side, stopping at any objects that appear in the "haves" bitmap. 6. Filter the results if any object filter (that can be easily computed with bitmaps alone) was given, and then return back to the caller. When there is good bitmap coverage relative to the traversal tips, this walk is often significantly faster than an ordinary object traversal because it can visit far fewer objects. But in certain cases, it can be significantly *slower* than the usual object traversal. Why? Because we need to compute complete bitmaps on either side of the walk. If either one (or both) of the sides require walking many (or all!) objects before they get to an existing bitmap, the extra bitmap machinery is mostly or all overhead. One of the benefits, however, is that even if the walk is slower, bitmap traversals are guaranteed to provide an *exact* answer. Unlike the traditional object traversal algorithm, which can over-count the results by not opening trees for older commits, the bitmap walk builds an exact reachability bitmap for either side, meaning the results are never over-counted. But producing non-exact results is OK for our traversal here (both in the bitmap case and not), as long as the results are over-counted, not under. Relaxing the bitmap traversal to allow it to produce over-counted results gives us the opportunity to make some significant improvements. Instead of the above, the new algorithm only has to walk from the *boundary* down to the nearest bitmap, instead of from each of the UNINTERESTING tips. The boundary-based approach still has degenerate cases, but we'll show in a moment that it is often a significant improvement. The new algorithm works as follows: 1. Build a (partial) bitmap of the haves side by first OR-ing any bitmap(s) that already exist for UNINTERESTING commits between the haves and the boundary. 2. For each commit along the boundary, add it as a fill-in traversal tip (where the traversal terminates once an existing bitmap is found), and perform fill-in traversal. 3. Build up a complete bitmap of the wants side as usual, stopping any time we intersect the (partial) haves side. 4. Return the results. And is more-or-less equivalent to using the *old* algorithm with this invocation: $ git rev-list --objects --use-bitmap-index $WANTS --not \ $(git rev-list --objects --boundary $WANTS --not $HAVES | perl -lne 'print $1 if /^-(.*)/') The new result performs significantly better in many cases, particularly when the distance from the boundary commit(s) to an existing bitmap is shorter than the distance from (all of) the have tips to the nearest bitmapped commit. Note that when using the old bitmap traversal algorithm, the results can be *slower* than without bitmaps! Under the new algorithm, the result is computed faster with bitmaps than without (at the cost of over-counting the true number of objects in a similar fashion as the non-bitmap traversal): # (Computing the number of tagged objects not on any branches # without bitmaps). $ time git rev-list --count --objects --tags --not --branches 20 real 0m1.388s user 0m1.092s sys 0m0.296s # (Computing the same query using the old bitmap traversal). $ time git rev-list --count --objects --tags --not --branches --use-bitmap-index 19 real 0m22.709s user 0m21.628s sys 0m1.076s # (this commit) $ time git.compile rev-list --count --objects --tags --not --branches --use-bitmap-index 19 real 0m1.518s user 0m1.234s sys 0m0.284s The new algorithm is still slower than not using bitmaps at all, but it is nearly a 15-fold improvement over the existing traversal. In a more realistic setting (using my local copy of git.git), I can observe a similar (if more modest) speed-up: $ argv="--count --objects --branches --not --tags" hyperfine \ -n 'no bitmaps' "git.compile rev-list $argv" \ -n 'existing traversal' "git.compile rev-list --use-bitmap-index $argv" \ -n 'boundary traversal' "git.compile -c pack.useBitmapBoundaryTraversal=true rev-list --use-bitmap-index $argv" Benchmark 1: no bitmaps Time (mean ± σ): 124.6 ms ± 2.1 ms [User: 103.7 ms, System: 20.8 ms] Range (min … max): 122.6 ms … 133.1 ms 22 runs Benchmark 2: existing traversal Time (mean ± σ): 368.6 ms ± 3.0 ms [User: 325.3 ms, System: 43.1 ms] Range (min … max): 365.1 ms … 374.8 ms 10 runs Benchmark 3: boundary traversal Time (mean ± σ): 167.6 ms ± 0.9 ms [User: 139.5 ms, System: 27.9 ms] Range (min … max): 166.1 ms … 169.2 ms 17 runs Summary 'no bitmaps' ran 1.34 ± 0.02 times faster than 'boundary traversal' 2.96 ± 0.05 times faster than 'existing traversal' Here, the new algorithm is also still slower than not using bitmaps, but represents a more than 2-fold improvement over the existing traversal in a more modest example. Since this algorithm was originally written (nearly a year and a half ago, at the time of writing), the bitmap lookup table shipped, making the new algorithm's result more competitive. A few other future directions for improving bitmap traversal times beyond not using bitmaps at all: - Decrease the cost to decompress and OR together many bitmaps together (particularly when enumerating the uninteresting side of the walk). Here we could explore more efficient bitmap storage techniques, like Roaring+Run and/or use SIMD instructions to speed up ORing them together. - Store pseudo-merge bitmaps, which could allow us to OR together fewer "summary" bitmaps (which would also help with the above). Helped-by: Jeff King <peff@peff.net> Helped-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-05-08 19:38:12 +02:00
int pack_use_bitmap_boundary_traversal;
int pack_use_multi_pack_reuse;
repository: create read_replace_refs setting The 'read_replace_refs' global specifies whether or not we should respect the references of the form 'refs/replace/<oid>' to replace which object we look up when asking for '<oid>'. This global has caused issues when it is not initialized properly, such as in b6551feadfd (merge-tree: load default git config, 2023-05-10). To make this more robust, move its config-based initialization out of git_default_config and into prepare_repo_settings(). This provides a repository-scoped version of the 'read_replace_refs' global. The global still has its purpose: it is disabled process-wide by the GIT_NO_REPLACE_OBJECTS environment variable or by a call to disable_replace_refs() in some specific Git commands. Since we already encapsulated the use of the constant inside replace_refs_enabled(), we can perform the initialization inside that method, if necessary. This solves the problem of forgetting to check the config, as we will check it before returning this value. Due to this encapsulation, the global can move to be static within replace-object.c. There is an interesting behavior change possible here: we now have a repository-scoped understanding of this config value. Thus, if there was a command that recurses into submodules and might follow replace refs, then it would now respect the core.useReplaceRefs config value in each repository. 'git grep --recurse-submodules' is such a command that recurses into submodules in-process. We can demonstrate the granularity of this config value via a test in t7814. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-06 15:24:37 +02:00
/*
* Does this repository have core.useReplaceRefs=true (on by
* default)? This provides a repository-scoped version of this
* config, though it could be disabled process-wide via some Git
* builtins or the --no-replace-objects option. See
* replace_refs_enabled() for more details.
*/
int read_replace_refs;
fsmonitor: config settings are repository-specific Move fsmonitor config settings to a new and opaque `struct fsmonitor_settings` structure. Add a lazily-loaded pointer to this into `struct repo_settings` Create an `enum fsmonitor_mode` type in `struct fsmonitor_settings` to represent the state of fsmonitor. This lets us represent which, if any, fsmonitor provider (hook or IPC) is enabled. Create `fsm_settings__get_*()` getters to lazily look up fsmonitor- related config settings. Get rid of the `core_fsmonitor` global variable. Move the code to lookup the existing `core.fsmonitor` config value into the fsmonitor settings. Create a hook pathname variable in `struct fsmonitor-settings` and only set it when in hook mode. Extend the definition of `core.fsmonitor` to be either a boolean or a hook pathname. When true, the builtin FSMonitor is used. When false or unset, no FSMonitor (neither builtin nor hook) is used. The existing `core_fsmonitor` global variable was used to store the pathname to the fsmonitor hook *and* it was used as a boolean to see if fsmonitor was enabled. This dual usage and global visibility leads to confusion when we add the IPC-based provider. So lets hide the details in fsmonitor-settings.c and let it decide which provider to use in the case of multiple settings. This avoids cluttering up repo-settings.c with these private details. A future commit in builtin-fsmonitor series will add the ability to disqualify worktrees for various reasons, such as being mounted from a remote volume, where fsmonitor should not be started. Having the config settings hidden in fsmonitor-settings.c allows such worktree restrictions to override the config values used. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-25 19:02:46 +01:00
struct fsmonitor_settings *fsmonitor; /* lazily loaded */
int index_version;
int index_skip_hash;
enum untracked_cache_setting core_untracked_cache;
int pack_use_sparse;
enum fetch_negotiation_setting fetch_negotiation_algorithm;
int core_multi_pack_index;
};
struct repo_path_cache {
char *squash_msg;
char *merge_msg;
char *merge_rr;
char *merge_mode;
char *merge_head;
char *fetch_head;
char *shallow;
};
struct repository {
/* Environment */
/*
* Path to the git directory.
* Cannot be NULL after initialization.
*/
char *gitdir;
/*
* Path to the common git directory.
* Cannot be NULL after initialization.
*/
char *commondir;
/*
* Holds any information related to accessing the raw object content.
*/
struct raw_object_store *objects;
/*
* All objects in this repository that have been parsed. This structure
* owns all objects it references, so users of "struct object *"
* generally do not need to free them; instead, when a repository is no
* longer used, call parsed_object_pool_clear() on this structure, which
* is called by the repositories repo_clear on its desconstruction.
*/
struct parsed_object_pool *parsed_objects;
/*
* The store in which the refs are held. This should generally only be
* accessed via get_main_ref_store(), as that will lazily initialize
* the ref object.
*/
struct ref_store *refs_private;
/*
* Contains path to often used file names.
*/
struct repo_path_cache cached_paths;
/*
* Path to the repository's graft file.
* Cannot be NULL after initialization.
*/
char *graft_file;
/*
* Path to the current worktree's index file.
* Cannot be NULL after initialization.
*/
char *index_file;
/*
* Path to the working directory.
* A NULL value indicates that there is no working directory.
*/
char *worktree;
/*
* Path from the root of the top-level superproject down to this
* repository. This is only non-NULL if the repository is initialized
* as a submodule of another repository.
*/
char *submodule_prefix;
struct repo_settings settings;
/* Subsystems */
/*
* Repository's config which contains key-value pairs from the usual
* set of config files (i.e. repo specific .git/config, user wide
* ~/.gitconfig, XDG config file and the global /etc/gitconfig)
*/
struct config_set *config;
/* Repository's submodule config as defined by '.gitmodules' */
struct submodule_cache *submodule_cache;
/*
* Repository's in-memory index.
* 'repo_read_index()' can be used to populate 'index'.
*/
struct index_state *index;
/* Repository's remotes and associated structures. */
struct remote_state *remote_state;
/* Repository's current hash algorithm, as serialized on disk. */
const struct git_hash_algo *hash_algo;
/* Repository's compatibility hash algorithm. */
const struct git_hash_algo *compat_hash_algo;
/* Repository's reference storage format, as serialized on disk. */
unsigned int ref_storage_format;
/* A unique-id for tracing purposes. */
int trace2_repo_id;
upload-pack: disable commit graph more gently for shallow traversal When the client has asked for certain shallow options like "deepen-since", we do a custom rev-list walk that pretends to be shallow. Before doing so, we have to disable the commit-graph, since it is not compatible with the shallow view of the repository. That's handled by 829a321569 (commit-graph: close_commit_graph before shallow walk, 2018-08-20). That commit literally closes and frees our repo->objects->commit_graph struct. That creates an interesting problem for commits that have _already_ been parsed using the commit graph. Their commit->object.parsed flag is set, their commit->graph_pos is set, but their commit->maybe_tree may still be NULL. When somebody later calls repo_get_commit_tree(), we see that we haven't loaded the tree oid yet and try to get it from the commit graph. But since it has been freed, we segfault! So the root of the issue is a data dependency between the commit's lazy-load of the tree oid and the fact that the commit graph can go away mid-process. How can we resolve it? There are a couple of general approaches: 1. The obvious answer is to avoid loading the tree from the graph when we see that it's NULL. But then what do we return for the tree oid? If we return NULL, our caller in do_traverse() will rightly complain that we have no tree. We'd have to fallback to loading the actual commit object and re-parsing it. That requires teaching parse_commit_buffer() to understand re-parsing (i.e., not starting from a clean slate and not leaking any allocated bits like parent list pointers). 2. When we close the commit graph, walk through the set of in-memory objects and clear any graph_pos pointers. But this means we also have to "unparse" any such commits so that we know they still need to open the commit object to fill in their trees. So it's no less complicated than (1), and is more expensive (since we clear objects we might not later need). 3. Stop freeing the commit-graph struct. Continue to let it be used for lazy-loads of tree oids, but let upload-pack specify that it shouldn't be used for further commit parsing. 4. Push the whole shallow rev-list out to its own sub-process, with the commit-graph disabled from the start, giving it a clean memory space to work from. I've chosen (3) here. Options (1) and (2) would work, but are non-trivial to implement. Option (4) is more expensive, and I'm not sure how complicated it is (shelling out for the actual rev-list part is easy, but we do then parse the resulting commits internally, and I'm not clear which parts need to be handling shallow-ness). The new test in t5500 triggers this segfault, but see the comments there for how horribly intimate it has to be with how both upload-pack and commit graphs work. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-12 16:44:45 +02:00
/* True if commit-graph has been disabled within this process. */
int commit_graph_disabled;
/* Configurations related to promisor remotes. */
char *repository_format_partial_clone;
struct promisor_remote_config *promisor_remote_config;
/* Configurations */
int repository_format_worktree_config;
/* Indicate if a repository has a different 'commondir' from 'gitdir' */
unsigned different_commondir:1;
};
extern struct repository *the_repository;
#ifdef USE_THE_INDEX_VARIABLE
extern struct index_state the_index;
#endif
/*
* Define a custom repository layout. Any field can be NULL, which
* will default back to the path according to the default layout.
*/
struct set_gitdir_args {
const char *commondir;
const char *object_dir;
const char *graft_file;
const char *index_file;
const char *alternate_db;
int disable_ref_updates;
};
void repo_set_gitdir(struct repository *repo, const char *root,
const struct set_gitdir_args *extra_args);
void repo_set_worktree(struct repository *repo, const char *path);
void repo_set_hash_algo(struct repository *repo, int algo);
void repo_set_compat_hash_algo(struct repository *repo, int compat_algo);
void repo_set_ref_storage_format(struct repository *repo, unsigned int format);
void initialize_the_repository(void);
RESULT_MUST_BE_USED
int repo_init(struct repository *r, const char *gitdir, const char *worktree);
/*
* Initialize the repository 'subrepo' as the submodule at the given path. If
* the submodule's gitdir cannot be found at <path>/.git, this function calls
* submodule_from_path() to try to find it. treeish_name is only used if
* submodule_from_path() needs to be called; see its documentation for more
* information.
* Return 0 upon success and a non-zero value upon failure.
*/
struct object_id;
RESULT_MUST_BE_USED
int repo_submodule_init(struct repository *subrepo,
struct repository *superproject,
const char *path,
const struct object_id *treeish_name);
void repo_clear(struct repository *repo);
/*
* Populates the repository's index from its index_file, an index struct will
* be allocated if needed.
*
* Return the number of index entries in the populated index or a value less
* than zero if an error occurred. If the repository's index has already been
* populated then the number of entries will simply be returned.
*/
int repo_read_index(struct repository *repo);
int repo_hold_locked_index(struct repository *repo,
struct lock_file *lf,
int flags);
int repo_read_index_unmerged(struct repository *);
/*
* Opportunistically update the index but do not complain if we can't.
* The lockfile is always committed or rolled back.
*/
void repo_update_index_if_able(struct repository *, struct lock_file *);
void prepare_repo_settings(struct repository *r);
/*
* Return 1 if upgrade repository format to target_version succeeded,
* 0 if no upgrade is necessary, and -1 when upgrade is not possible.
*/
int upgrade_repository_format(int target_version);
#endif /* REPOSITORY_H */