git/commit-reach.h

143 lines
4.4 KiB
C
Raw Permalink Normal View History

#ifndef COMMIT_REACH_H
#define COMMIT_REACH_H
commit-reach.h: add missing declarations (hdr-check) Add the necessary #includes and forward declarations to allow the header file to pass the 'hdr-check' target. Note that, since this header includes the commit-slab implementation header file (indirectly via commit-slab.h), some of the commit-slab inline functions (e.g contains_cache_at_peek()) will not compile without the complete type of 'struct commit'. Hence, we replace the forward declaration of 'struct commit' with the an #include of the 'commit.h' header file. It is possible, using the 'commit-slab-{decl,impl}.h' files, to avoid this inclusion of the 'commit.h' header. Commit a9f1f1f9f8 ("commit-slab.h: code split", 2018-05-19) separated the commit-slab interface from its implementation, to allow for the definition of a public commit-slab data structure. This enabled us to avoid including the commit-slab implementation in a header file, which could result in the replication of the commit-slab functions in each compilation unit in which it was included. Indeed, if you compile with optimizations disabled, then run this script: $ cat -n dup-static.sh 1 #!/bin/sh 2 3 nm $1 | grep ' t ' | cut -d' ' -f3 | sort | uniq -c | 4 sort -rn | grep -v ' 1' $ $ ./dup-static.sh git | grep contains 24 init_contains_cache_with_stride 24 init_contains_cache 24 contains_cache_peek 24 contains_cache_at_peek 24 contains_cache_at 24 clear_contains_cache $ you will find 24 copies of the commit-slab routines for the contains_cache. Of course, when you enable optimizations again, these duplicate static functions (mostly) disappear. Compiling with gcc at -O2, leaves two static functions, thus: $ nm commit-reach.o | grep contains_cache 0000000000000870 t contains_cache_at_peek.isra.1.constprop.6 $ nm ref-filter.o | grep contains_cache 00000000000002b0 t clear_contains_cache.isra.14 $ However, using a shared 'contains_cache' would result in all six of the above functions as external public functions in the git binary. At present, only three of these functions are actually called, so the trade-off seems to favour letting the compiler inline the commit-slab functions. Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-27 03:53:57 +02:00
#include "commit.h"
#include "commit-slab.h"
struct commit_list;
struct ref_filter;
commit-reach.h: add missing declarations (hdr-check) Add the necessary #includes and forward declarations to allow the header file to pass the 'hdr-check' target. Note that, since this header includes the commit-slab implementation header file (indirectly via commit-slab.h), some of the commit-slab inline functions (e.g contains_cache_at_peek()) will not compile without the complete type of 'struct commit'. Hence, we replace the forward declaration of 'struct commit' with the an #include of the 'commit.h' header file. It is possible, using the 'commit-slab-{decl,impl}.h' files, to avoid this inclusion of the 'commit.h' header. Commit a9f1f1f9f8 ("commit-slab.h: code split", 2018-05-19) separated the commit-slab interface from its implementation, to allow for the definition of a public commit-slab data structure. This enabled us to avoid including the commit-slab implementation in a header file, which could result in the replication of the commit-slab functions in each compilation unit in which it was included. Indeed, if you compile with optimizations disabled, then run this script: $ cat -n dup-static.sh 1 #!/bin/sh 2 3 nm $1 | grep ' t ' | cut -d' ' -f3 | sort | uniq -c | 4 sort -rn | grep -v ' 1' $ $ ./dup-static.sh git | grep contains 24 init_contains_cache_with_stride 24 init_contains_cache 24 contains_cache_peek 24 contains_cache_at_peek 24 contains_cache_at 24 clear_contains_cache $ you will find 24 copies of the commit-slab routines for the contains_cache. Of course, when you enable optimizations again, these duplicate static functions (mostly) disappear. Compiling with gcc at -O2, leaves two static functions, thus: $ nm commit-reach.o | grep contains_cache 0000000000000870 t contains_cache_at_peek.isra.1.constprop.6 $ nm ref-filter.o | grep contains_cache 00000000000002b0 t clear_contains_cache.isra.14 $ However, using a shared 'contains_cache' would result in all six of the above functions as external public functions in the git binary. At present, only three of these functions are actually called, so the trade-off seems to favour letting the compiler inline the commit-slab functions. Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-27 03:53:57 +02:00
struct object_id;
struct object_array;
int repo_get_merge_bases(struct repository *r,
struct commit *rev1,
struct commit *rev2,
struct commit_list **result);
int repo_get_merge_bases_many(struct repository *r,
struct commit *one, int n,
struct commit **twos,
struct commit_list **result);
/* To be used only when object flags after this call no longer matter */
int repo_get_merge_bases_many_dirty(struct repository *r,
struct commit *one, int n,
struct commit **twos,
struct commit_list **result);
int get_octopus_merge_bases(struct commit_list *in, struct commit_list **result);
int repo_is_descendant_of(struct repository *r,
struct commit *commit,
struct commit_list *with_commit);
int repo_in_merge_bases(struct repository *r,
struct commit *commit,
struct commit *reference);
int repo_in_merge_bases_many(struct repository *r,
struct commit *commit,
int nr_reference, struct commit **reference,
int ignore_missing_commits);
/*
* Takes a list of commits and returns a new list where those
* have been removed that can be reached from other commits in
* the list. It is useful for, e.g., reducing the commits
* randomly thrown at the git-merge command and removing
* redundant commits that the user shouldn't have given to it.
*
* This function destroys the STALE bit of the commit objects'
* flags.
*/
struct commit_list *reduce_heads(struct commit_list *heads);
/*
* Like `reduce_heads()`, except it replaces the list. Use this
* instead of `foo = reduce_heads(foo);` to avoid memory leaks.
*/
void reduce_heads_replace(struct commit_list **heads);
int ref_newer(const struct object_id *new_oid, const struct object_id *old_oid);
/*
* Unknown has to be "0" here, because that's the default value for
* contains_cache slab entries that have not yet been assigned.
*/
enum contains_result {
CONTAINS_UNKNOWN = 0,
CONTAINS_NO,
CONTAINS_YES
};
define_commit_slab(contains_cache, enum contains_result);
int commit_contains(struct ref_filter *filter, struct commit *commit,
struct commit_list *list, struct contains_cache *cache);
/*
* Determine if every commit in 'from' can reach at least one commit
* that is marked with 'with_flag'. As we traverse, use 'assign_flag'
* as a marker for commits that are already visited. Do not walk
commit-reach: make can_all_from_reach... linear The can_all_from_reach_with_flags() algorithm is currently quadratic in the worst case, because it calls the reachable() method for every 'from' without tracking which commits have already been walked or which can already reach a commit in 'to'. Rewrite the algorithm to walk each commit a constant number of times. We also add some optimizations that should work for the main consumer of this method: fetch negotitation (haves/wants). The first step includes using a depth-first-search (DFS) from each 'from' commit, sorted by ascending generation number. We do not walk beyond the minimum generation number or the minimum commit date. This DFS is likely to be faster than the existing reachable() method because we expect previous ref values to be along the first-parent history. If we find a target commit, then we mark everything in the DFS stack as a RESULT. This expands the set of targets for the other 'from' commits. We also mark the visited commits using 'assign_flag' to prevent re- walking the same commits. We still need to clear our flags at the end, which is why we will have a total of three visits to each commit. Performance was measured on the Linux repository using 'test-tool reach can_all_from_reach'. The input included rows seeded by tag values. The "small" case included X-rows as v4.[0-9]* and Y-rows as v3.[0-9]*. This mimics a (very large) fetch that says "I have all major v3 releases and want all major v4 releases." The "large" case included X-rows as "v4.*" and Y-rows as "v3.*". This adds all release-candidate tags to the set, which does not greatly increase the number of objects that are considered, but does increase the number of 'from' commits, demonstrating the quadratic nature of the previous code. Small Case: Before: 1.52 s After: 0.26 s Large Case: Before: 3.50 s After: 0.27 s Note how the time increases between the two cases in the two versions. The new code increases relative to the number of commits that need to be walked, but not directly relative to the number of 'from' commits. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20 18:33:28 +02:00
* commits with date below 'min_commit_date' or generation below
* 'min_generation'.
*/
int can_all_from_reach_with_flag(struct object_array *from,
unsigned int with_flag,
unsigned int assign_flag,
commit-reach: make can_all_from_reach... linear The can_all_from_reach_with_flags() algorithm is currently quadratic in the worst case, because it calls the reachable() method for every 'from' without tracking which commits have already been walked or which can already reach a commit in 'to'. Rewrite the algorithm to walk each commit a constant number of times. We also add some optimizations that should work for the main consumer of this method: fetch negotitation (haves/wants). The first step includes using a depth-first-search (DFS) from each 'from' commit, sorted by ascending generation number. We do not walk beyond the minimum generation number or the minimum commit date. This DFS is likely to be faster than the existing reachable() method because we expect previous ref values to be along the first-parent history. If we find a target commit, then we mark everything in the DFS stack as a RESULT. This expands the set of targets for the other 'from' commits. We also mark the visited commits using 'assign_flag' to prevent re- walking the same commits. We still need to clear our flags at the end, which is why we will have a total of three visits to each commit. Performance was measured on the Linux repository using 'test-tool reach can_all_from_reach'. The input included rows seeded by tag values. The "small" case included X-rows as v4.[0-9]* and Y-rows as v3.[0-9]*. This mimics a (very large) fetch that says "I have all major v3 releases and want all major v4 releases." The "large" case included X-rows as "v4.*" and Y-rows as "v3.*". This adds all release-candidate tags to the set, which does not greatly increase the number of objects that are considered, but does increase the number of 'from' commits, demonstrating the quadratic nature of the previous code. Small Case: Before: 1.52 s After: 0.26 s Large Case: Before: 3.50 s After: 0.27 s Note how the time increases between the two cases in the two versions. The new code increases relative to the number of commits that need to be walked, but not directly relative to the number of 'from' commits. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-20 18:33:28 +02:00
time_t min_commit_date,
timestamp_t min_generation);
int can_all_from_reach(struct commit_list *from, struct commit_list *to,
int commit_date_cutoff);
commit-reach: implement get_reachable_subset The existing reachability algorithms in commit-reach.c focus on finding merge-bases or determining if all commits in a set X can reach at least one commit in a set Y. However, for two commits sets X and Y, we may also care about which commits in Y are reachable from at least one commit in X. Implement get_reachable_subset() which answers this question. Given two arrays of commits, 'from' and 'to', return a commit_list with every commit from the 'to' array that is reachable from at least one commit in the 'from' array. The algorithm is a simple walk starting at the 'from' commits, using the PARENT2 flag to indicate "this commit has already been added to the walk queue". By marking the 'to' commits with the PARENT1 flag, we can determine when we see a commit from the 'to' array. We remove the PARENT1 flag as we add that commit to the result list to avoid duplicates. The order of the resulting list is a reverse of the order that the commits are discovered in the walk. There are a couple shortcuts to avoid walking more than we need: 1. We determine the minimum generation number of commits in the 'to' array. We do not walk commits with generation number below this minimum. 2. We count how many distinct commits are in the 'to' array, and decrement this count when we discover a 'to' commit during the walk. If this number reaches zero, then we can terminate the walk. Tests will be added using the 'test-tool reach' helper in a subsequent commit. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-02 14:14:45 +01:00
/*
* Return a list of commits containing the commits in the 'to' array
* that are reachable from at least one commit in the 'from' array.
* Also add the given 'flag' to each of the commits in the returned list.
*
* This method uses the PARENT1 and PARENT2 flags during its operation,
* so be sure these flags are not set before calling the method.
*/
struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
struct commit **to, int nr_to,
unsigned int reachable_flag);
commit-reach: implement ahead_behind() logic Fully implement the commit-counting logic required to determine ahead/behind counts for a batch of commit pairs. This is a new library method within commit-reach.h. This method will be linked to the for-each-ref builtin in the next change. The interface for ahead_behind() uses two arrays. The first array of commits contains the list of all starting points for the walk. This includes all tip commits _and_ base commits. The second array specifies base/tip pairs by pointing to commits within the first array, by index. The second array also stores the resulting ahead/behind counts for each of these pairs. This implementation of ahead_behind() allows multiple bases, if desired. Even with multiple bases, there is only one commit walk used for counting the ahead/behind values, saving time when the base/tip ranges overlap significantly. This interface for ahead_behind() also makes it very easy to call ensure_generations_valid() on the entire array of bases and tips. This call is necessary because it is critical that the walk that counts ahead/behind values never walks a commit more than once. Without generation numbers on every commit, there is a possibility that a commit date skew could cause the walk to revisit a commit and then double-count it. For this reason, it is strongly recommended that 'git ahead-behind' is only run in a repository with a commit-graph file that covers most of the reachable commits, storing precomputed generation numbers. If no commit-graph exists, this walk will be much slower as it must walk all reachable commits in ensure_generations_valid() before performing the counting logic. It is possible to detect if generation numbers are available at run time and redirect the implementation to another algorithm that does not require this property. However, that implementation requires a commit walk per base/tip pair _and_ can be slower due to the commit date heuristics required. Such an implementation could be considered in the future if there is a reason to include it, but most Git hosts should already be generating a commit-graph file as part of repository maintenance. Most Git clients should also be generating commit-graph files as part of background maintenance or automatic GCs. Now, let's discuss the ahead/behind counting algorithm. The first array of commits are considered the starting commits. The index within that array will play a critical role. We create a new commit slab that maps commits to a bitmap. For a given commit (anywhere in the history), its bitmap stores information relative to which of the input commits can reach that commit. The ith bit will be on if the ith commit from the starting list can reach that commit. It is important to notice that these bitmaps are not the typical "reachability bitmaps" that are stored in .bitmap files. Instead of signalling which objects are reachable from the current commit, they instead signal "which starting commits can reach me?" It is also important to know that the bitmap is not necessarily "complete" until we walk that commit. We will perform a commit walk by generation number in such a way that we can guarantee the bitmap is correct when we visit that commit. At the beginning of the ahead_behind() method, we initialize the bitmaps for each of the starting commits. By enabling the ith bit for the ith starting commit, we signal "the ith commit can reach itself." We walk commits by popping the commit with maximum generation number out of the queue, guaranteeing that we will never walk a child of that commit in any future steps. As we walk, we load the bitmap for the current commit and perform two main steps. The _second_ step examines each parent of the current commit and adds the current commit's bitmap bits to each parent's bitmap. (We create a new bitmap for the parent if this is our first time seeing that parent.) After adding the bits to the parent's bitmap, the parent is added to the walk queue. Due to this passing of bits to parents, the current commit has a guarantee that the ith bit is enabled on its bitmap if and only if the ith commit can reach the current commit. The first step of the walk is to examine the bitmask on the current commit and decide which ranges the commit is in or not. Due to the "bit pushing" in the second step, we have a guarantee that the ith bit of the current commit's bitmap is on if and only if the ith starting commit can reach it. For each ahead_behind_count struct, check the base_index and tip_index to see if those bits are enabled on the current bitmap. If exactly one bit is enabled, then increment the corresponding 'ahead' or 'behind' count. This increment is the reason we _absolutely need_ to walk commits at most once. The only subtle thing to do with this walk is to check to see if a parent has all bits on in its bitmap, in which case it becomes "stale" and is marked with the STALE bit. This allows queue_has_nonstale() to be the terminating condition of the walk, which greatly reduces the number of commits walked if all of the commits are nearby in history. It avoids walking a large number of common commits when there is a deep history. We also use the helper method insert_no_dup() to add commits to the priority queue without adding them multiple times. This uses the PARENT2 flag. Thus, we must clear both the STALE and PARENT2 bits of all commits, in case ahead_behind() is called multiple times in the same process. Co-authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-20 12:26:53 +01:00
struct ahead_behind_count {
/**
* As input, the *_index members indicate which positions in
* the 'tips' array correspond to the tip and base of this
* comparison.
*/
size_t tip_index;
size_t base_index;
/**
* These values store the computed counts for each side of the
* symmetric difference:
*
* 'ahead' stores the number of commits reachable from the tip
* and not reachable from the base.
*
* 'behind' stores the number of commits reachable from the base
* and not reachable from the tip.
*/
unsigned int ahead;
unsigned int behind;
};
/*
* Given an array of commits and an array of ahead_behind_count pairs,
* compute the ahead/behind counts for each pair.
*/
void ahead_behind(struct repository *r,
struct commit **commits, size_t commits_nr,
struct ahead_behind_count *counts, size_t counts_nr);
commit-reach: add tips_reachable_from_bases() Both 'git for-each-ref --merged=<X>' and 'git branch --merged=<X>' use the ref-filter machinery to select references or branches (respectively) that are reachable from a set of commits presented by one or more --merged arguments. This happens within reach_filter(), which uses the revision-walk machinery to walk history in a standard way. However, the commit-reach.c file is full of custom searches that are more efficient, especially for reachability queries that can terminate early when reachability is discovered. Add a new tips_reachable_from_bases() method to commit-reach.c and call it from within reach_filter() in ref-filter.c. This affects both 'git branch' and 'git for-each-ref' as tested in p1500-graph-walks.sh. For the Linux kernel repository, we take an already-fast algorithm and make it even faster: Test HEAD~1 HEAD ------------------------------------------------------------------- 1500.5: contains: git for-each-ref --merged 0.13 0.02 -84.6% 1500.6: contains: git branch --merged 0.14 0.02 -85.7% 1500.7: contains: git tag --merged 0.15 0.03 -80.0% (Note that we remove the iterative 'git rev-list' test from p1500 because it no longer makes sense as a comparison to 'git for-each-ref' and would just waste time running it for these comparisons.) The algorithm is implemented in commit-reach.c in the method tips_reachable_from_base(). This method takes a string_list of tips and assigns the 'util' for each item with the value 1 if the base commit can reach those tips. Like other reachability queries in commit-reach.c, the fastest way to search for "can A reach B?" is to do a depth-first search up to the generation number of B, preferring to explore first parents before later parents. While we must walk all reachable commits up to that generation number when the answer is "no", the depth-first search can answer "yes" much faster than other approaches in most cases. This search becomes trickier when there are multiple targets for the depth-first search. The commits with lower generation number are more likely to be within the history of the start commit, but we don't want to waste time searching commits of low generation number if the commit target with lowest generation number has already been found. The trick here is to take the input commits and sort them by generation number in ascending order. Track the index within this order as min_generation_index. When we find a commit, if its index in the list is equal to min_generation_index, then we can increase the generation number boundary of our search to the next-lowest value in the list. With this mechanism, the number of commits to search is minimized with respect to the depth-first search heuristic. We will walk all commits up to the minimum generation number of a commit that is _not_ reachable from the start, but we will walk only the necessary portion of the depth-first search for the reachable commits of lower generation. Add extra tests for this behavior in t6600-test-reach.sh as the interesting data shape of that repository can sometimes demonstrate corner case bugs. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-20 12:26:55 +01:00
/*
* For all tip commits, add 'mark' to their flags if and only if they
* are reachable from one of the commits in 'bases'.
*/
void tips_reachable_from_bases(struct repository *r,
struct commit_list *bases,
struct commit **tips, size_t tips_nr,
int mark);
#endif