Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PATH WALK I: The path-walk API #1818

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions Documentation/technical/api-path-walk.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
Path-Walk API
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, karthik nayak wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

> From: Derrick Stolee <[email protected]>
>
> In anticipation of a few planned applications, introduce the most basic form
> of a path-walk API. It currently assumes that there are no UNINTERESTING
> objects, and does not include any complicated filters. It calls a function
> pointer on groups of tree and blob objects as grouped by path. This only
> includes objects the first time they are discovered, so an object that
> appears at multiple paths will not be included in two batches.
>
> These batches are collected in 'struct type_and_oid_list' objects, which
> store an object type and an oid_array of objects.
>
> The data structures are documented in 'struct path_walk_context', but in
> summary the most important are:
>
>   * 'paths_to_lists' is a strmap that connects a path to a
>     type_and_oid_list for that path. To avoid conflicts in path names,
>     we make sure that tree paths end in "/" (except the root path with
>     is an empty string) and blob paths do not end in "/".
>
>   * 'path_stack' is a string list that is added to in an append-only
>     way. This stores the stack of our depth-first search on the heap
>     instead of using recursion.
>
>   * 'path_stack_pushed' is a strmap that stores path names that were
>     already added to 'path_stack', to avoid repeating paths in the
>     stack. Mostly, this saves us from quadratic lookups from doing
>     unsorted checks into the string_list.
>
> The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
> push_to_stack() method. Call this instead of inserting into these
> structures directly.
>
> The walk_objects_by_path() method initializes these structures and
> starts walking commits from the given rev_info struct. The commits are
> used to find the list of root trees which populate the start of our
> depth-first search.

Isn't this more of breadth-first search? Reading through the code, the
algorithm seems something like:

- For each commit in list of commits (from rev_info)
  - Tackle each root tree, add root path to the stack.
- For each path in stack left
  - Call the callback provided by client.
  - Find all its first level children, add each to the stack.

So wouldn't this go through the tree in level by level basis? Making it
a BFS?

Apart from this, the patch itself looks solid. I ended up writing a
small client to play with this API, and was very pleased how quickly I
could get it running.

[snip]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 11/1/24 9:12 AM, karthik nayak wrote:
> "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> >> From: Derrick Stolee <[email protected]>
>>
>> The walk_objects_by_path() method initializes these structures and
>> starts walking commits from the given rev_info struct. The commits are
>> used to find the list of root trees which populate the start of our
>> depth-first search.
> > Isn't this more of breadth-first search? Reading through the code, the
> algorithm seems something like:
> > - For each commit in list of commits (from rev_info)
>    - Tackle each root tree, add root path to the stack.
> - For each path in stack left
>    - Call the callback provided by client.
>    - Find all its first level children, add each to the stack.
> > So wouldn't this go through the tree in level by level basis? Making it
> a BFS?

While we are adding all children to the stack, we only pop off the top
of the stack, making it a DFS. (We do visit the paths in reverse-
lexicographic order, though.)

To make it a BFS, we would need to visit the paths in the order they
are added to the list. Instead, we visit them in Last-In First-Out
order.

I initially had built it as a BFS, but ran into memory issues when
running it on very large repos.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, karthik nayak wrote (reply to this):

karthik nayak <[email protected]> writes:

> "Derrick Stolee via GitGitGadget" <[email protected]> writes:
>
>> From: Derrick Stolee <[email protected]>
>>
>> In anticipation of a few planned applications, introduce the most basic form
>> of a path-walk API. It currently assumes that there are no UNINTERESTING
>> objects, and does not include any complicated filters. It calls a function
>> pointer on groups of tree and blob objects as grouped by path. This only
>> includes objects the first time they are discovered, so an object that
>> appears at multiple paths will not be included in two batches.
>>
>> These batches are collected in 'struct type_and_oid_list' objects, which
>> store an object type and an oid_array of objects.
>>
>> The data structures are documented in 'struct path_walk_context', but in
>> summary the most important are:
>>
>>   * 'paths_to_lists' is a strmap that connects a path to a
>>     type_and_oid_list for that path. To avoid conflicts in path names,
>>     we make sure that tree paths end in "/" (except the root path with
>>     is an empty string) and blob paths do not end in "/".
>>
>>   * 'path_stack' is a string list that is added to in an append-only
>>     way. This stores the stack of our depth-first search on the heap
>>     instead of using recursion.
>>
>>   * 'path_stack_pushed' is a strmap that stores path names that were
>>     already added to 'path_stack', to avoid repeating paths in the
>>     stack. Mostly, this saves us from quadratic lookups from doing
>>     unsorted checks into the string_list.
>>
>> The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
>> push_to_stack() method. Call this instead of inserting into these
>> structures directly.
>>
>> The walk_objects_by_path() method initializes these structures and
>> starts walking commits from the given rev_info struct. The commits are
>> used to find the list of root trees which populate the start of our
>> depth-first search.
>
> Isn't this more of breadth-first search? Reading through the code, the
> algorithm seems something like:
>
> - For each commit in list of commits (from rev_info)
>   - Tackle each root tree, add root path to the stack.
> - For each path in stack left
>   - Call the callback provided by client.
>   - Find all its first level children, add each to the stack.
>
> So wouldn't this go through the tree in level by level basis? Making it
> a BFS?

My bad here, thinking more about it, it is DFS indeed. Although we add
all the children of a level to the stack, we pop each of them from the
stack and end up traversing down that level.

>
> Apart from this, the patch itself looks solid. I ended up writing a
> small client to play with this API, and was very pleased how quickly I
> could get it running.
>
> [snip]

=============

The path-walk API is used to walk reachable objects, but to visit objects
in batches based on a common path they appear in, or by type.

For example, all reachable commits are visited in a group. All tags are
visited in a group. Then, all root trees are visited. At some point, all
blobs reachable via a path `my/dir/to/A` are visited. When there are
multiple paths possible to reach the same object, then only one of those
paths is used to visit the object.

Basics
------

To use the path-walk API, include `path-walk.h` and call
`walk_objects_by_path()` with a customized `path_walk_info` struct. The
struct is used to set all of the options for how the walk should proceed.
Let's dig into the different options and their use.

`path_fn` and `path_fn_data`::
The most important option is the `path_fn` option, which is a
function pointer to the callback that can execute logic on the
object IDs for objects grouped by type and path. This function
also receives a `data` value that corresponds to the
`path_fn_data` member, for providing custom data structures to
this callback function.

`revs`::
To configure the exact details of the reachable set of objects,
use the `revs` member and initialize it using the revision
machinery in `revision.h`. Initialize `revs` using calls such as
`setup_revisions()` or `parse_revision_opt()`. Do not call
`prepare_revision_walk()`, as that will be called within
`walk_objects_by_path()`.
+
It is also important that you do not specify the `--objects` flag for the
`revs` struct. The revision walk should only be used to walk commits, and
the objects will be walked in a separate way based on those starting
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, karthik nayak wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

> From: Derrick Stolee <[email protected]>
>
> The rev_info that is specified for a path-walk traversal may specify
> visiting tag refs (both lightweight and annotated) and also may specify
> indexed objects (blobs and trees). Update the path-walk API to walk
> these objects as well.
>
> When walking tags, we need to peel the annotated objects until reaching
> a non-tag object. If we reach a commit, then we can add it to the
> pending objects to make sure we visit in the commit walk portion. If we

Nit: s/in/it in/

[snip]

> +		case OBJ_BLOB:
> +			if (!info->blobs)
> +				continue;
> +			if (pending->path) {
> +				struct type_and_oid_list *list;
> +				char *path = pending->path;
> +				if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
> +					CALLOC_ARRAY(list, 1);
> +					list->type = OBJ_BLOB;
> +					strmap_put(&ctx->paths_to_lists, path, list);
> +				}
> +				oid_array_append(&list->oids, &obj->oid);
> +			} else {
> +				/* assume a root tree, such as a lightweight tag. */

Shouldn't this comment be for tagged blobs?

> +				oid_array_append(&tagged_blobs->oids, &obj->oid);
> +			}
> +			break;

[snip]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 11/1/24 10:25 AM, karthik nayak wrote:
> "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> >> From: Derrick Stolee <[email protected]>
>>
>> The rev_info that is specified for a path-walk traversal may specify
>> visiting tag refs (both lightweight and annotated) and also may specify
>> indexed objects (blobs and trees). Update the path-walk API to walk
>> these objects as well.
>>
>> When walking tags, we need to peel the annotated objects until reaching
>> a non-tag object. If we reach a commit, then we can add it to the
>> pending objects to make sure we visit in the commit walk portion. If we
> > Nit: s/in/it in/

thanks!

>> +		case OBJ_BLOB:
>> +			if (!info->blobs)
>> +				continue;
>> +			if (pending->path) {
>> +				struct type_and_oid_list *list;
>> +				char *path = pending->path;
>> +				if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
>> +					CALLOC_ARRAY(list, 1);
>> +					list->type = OBJ_BLOB;
>> +					strmap_put(&ctx->paths_to_lists, path, list);
>> +				}
>> +				oid_array_append(&list->oids, &obj->oid);
>> +			} else {
>> +				/* assume a root tree, such as a lightweight tag. */
> > Shouldn't this comment be for tagged blobs?

Yes. This is a copy-paste error.

Thanks for the careful reading.
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Sat, Nov 09, 2024 at 07:41:10PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 24cf04c1e7d..2ca08402367 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -98,6 +98,10 @@ static int add_children(struct path_walk_context *ctx,
>  		if (S_ISGITLINK(entry.mode))
>  			continue;
>
> +		/* If the caller doesn't want blobs, then don't bother. */
> +		if (!ctx->info->blobs && type == OBJ_BLOB)
> +			continue;
> +

I was going to ask why we're not reusing the existing property of the
rev_info struct to specify what object type(s) we do/don't want, but the
answer is obvious: we don't want that to change the behavior of the
commit-level walk which is used to seed the actual path walk based on
its results.

However, it would be kind of nice to have a single place to specify how
you want to traverse objects, and using the rev_info struct seems like a
good choice there since you can naively ask Git to parse command-line
arguments like --blobs, --trees, --objects, etc. and set the appropriate
bits.

I wonder if it might make sense to do something like:

    unsigned int tmp_blobs = revs->blob_objects;
    unsigned int tmp_trees = revs->tree_objects;
    unsigned int tmp_tags = revs->tag_objects;

    if (prepare_revision_walk(revs))
        die(_("failed to setup revision walk"));

    /* commit-level walk */

    revs->blob_objects = tmp_blobs;
    revs->tree_objects = tmp_trees;
    revs->tag_objects = tmp_tags;

    /* path-level walk */

I don't have strong feelings about it, but it feels like this approach
could be cleaner from a caller's perspective. But I can see an argument
to the contrary that it does introduce some awkwardness with the
dual-use of those fields.

Thanks,
Taylor

commits.

`commits`, `blobs`, `trees`, `tags`::
By default, these members are enabled and signal that the path-walk
API should call the `path_fn` on objects of these types. Specialized
applications could disable some options to make it simpler to walk
the objects or to have fewer calls to `path_fn`.
+
While it is possible to walk only commits in this way, consumers would be
better off using the revision walk API instead.

`prune_all_uninteresting`::
By default, all reachable paths are emitted by the path-walk API.
This option allows consumers to declare that they are not
interested in paths where all included objects are marked with the
`UNINTERESTING` flag. This requires using the `boundary` option in
the revision walk so that the walk emits commits marked with the
`UNINTERESTING` flag.

Examples
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, karthik nayak wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

[snip]

> diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
> new file mode 100755
> index 00000000000..1f277b88291
> --- /dev/null
> +++ b/t/t6601-path-walk.sh
> @@ -0,0 +1,118 @@
> +#!/bin/sh
> +
> +test_description='direct path-walk API tests'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup test repository' '
> +	git checkout -b base &&
> +
> +	mkdir left &&
> +	mkdir right &&
> +	echo a >a &&
> +	echo b >left/b &&
> +	echo c >right/c &&
> +	git add . &&
> +	git commit -m "first" &&
> +
> +	echo d >right/d &&
> +	git add right &&
> +	git commit -m "second" &&
> +
> +	echo bb >left/b &&
> +	git commit -a -m "third" &&
> +
> +	git checkout -b topic HEAD~1 &&
> +	echo cc >right/c &&
> +	git commit -a -m "topic"
> +'
> +

Nit: Since the root level tree is already special cased out, we only
check one level of path here, would be nice to add another level of tree
to this.

> +test_expect_success 'all' '
> +	test-tool path-walk -- --all >out &&
> +
> +	cat >expect <<-EOF &&
> +	TREE::$(git rev-parse topic^{tree})
> +	TREE::$(git rev-parse base^{tree})
> +	TREE::$(git rev-parse base~1^{tree})
> +	TREE::$(git rev-parse base~2^{tree})
> +	TREE:left/:$(git rev-parse base:left)
> +	TREE:left/:$(git rev-parse base~2:left)
> +	TREE:right/:$(git rev-parse topic:right)
> +	TREE:right/:$(git rev-parse base~1:right)
> +	TREE:right/:$(git rev-parse base~2:right)
> +	trees:9
> +	BLOB:a:$(git rev-parse base~2:a)
> +	BLOB:left/b:$(git rev-parse base~2:left/b)
> +	BLOB:left/b:$(git rev-parse base:left/b)
> +	BLOB:right/c:$(git rev-parse base~2:right/c)
> +	BLOB:right/c:$(git rev-parse topic:right/c)
> +	BLOB:right/d:$(git rev-parse base~1:right/d)
> +	blobs:6
> +	EOF
> +
> +	test_cmp_sorted expect out
> +'

Isn't the order deterministic? Why do we need to sort it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Tan wrote (reply to this):

I haven't looked thoroughly at the rest of the patches yet, but had a
comment about this test. Rearranging:

"Derrick Stolee via GitGitGadget" <[email protected]> writes:
> +test_expect_success 'all' '
> +	test-tool path-walk -- --all >out &&
> +
> +	cat >expect <<-EOF &&
> +	TREE::$(git rev-parse topic^{tree})
> +	TREE::$(git rev-parse base^{tree})
> +	TREE::$(git rev-parse base~1^{tree})
> +	TREE::$(git rev-parse base~2^{tree})
> +	TREE:left/:$(git rev-parse base:left)
> +	TREE:left/:$(git rev-parse base~2:left)
> +	TREE:right/:$(git rev-parse topic:right)
> +	TREE:right/:$(git rev-parse base~1:right)
> +	TREE:right/:$(git rev-parse base~2:right)
> +	trees:9

[snip rest of "expect"]

The way you're testing this, wouldn't the tests pass even if the OIDs
aren't emitted in path order? (E.g. if topic:right and base~1:right
were somehow grouped into two different groups, even though they have
the same path.)

I would have expected the test output to be something like:

  TREE:right/ $(rp :right topic base~1 base~2)

where rp is a function that takes in a suffix and one or more prefixes -
I haven't figured out its contents yet, but

  echo $(git rev-parse HEAD^^ HEAD^ HEAD | sort)

gives us a space-separated list, so it doesn't seem too difficult to
define such a function.

> +static int emit_block(const char *path, struct oid_array *oids,
> +		      enum object_type type, void *data)
> +{
> +	struct path_walk_test_data *tdata = data;
> +	const char *typestr;
> +
> +	switch (type) {
> +	case OBJ_TREE:
> +		typestr = "TREE";
> +		tdata->tree_nr += oids->nr;
> +		break;
> +
> +	case OBJ_BLOB:
> +		typestr = "BLOB";
> +		tdata->blob_nr += oids->nr;
> +		break;
> +
> +	default:
> +		BUG("we do not understand this type");
> +	}
> +
> +	for (size_t i = 0; i < oids->nr; i++)
> +		printf("%s:%s:%s\n", typestr, path, oid_to_hex(&oids->oid[i]));

Then here, you would print typestr and path before the "for" loop. In
the "for" loop you would add oid_to_hex() results to a sorted string
list, have another "for" loop that prints each element preceded by a
space, then print a "\n" after both "for" loops.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 11/1/24 6:23 PM, Jonathan Tan wrote:
> I haven't looked thoroughly at the rest of the patches yet, but had a
> comment about this test. Rearranging:
> > "Derrick Stolee via GitGitGadget" <[email protected]> writes:
>> +test_expect_success 'all' '
>> +	test-tool path-walk -- --all >out &&
>> +
>> +	cat >expect <<-EOF &&
>> +	TREE::$(git rev-parse topic^{tree})
>> +	TREE::$(git rev-parse base^{tree})
>> +	TREE::$(git rev-parse base~1^{tree})
>> +	TREE::$(git rev-parse base~2^{tree})
>> +	TREE:left/:$(git rev-parse base:left)
>> +	TREE:left/:$(git rev-parse base~2:left)
>> +	TREE:right/:$(git rev-parse topic:right)
>> +	TREE:right/:$(git rev-parse base~1:right)
>> +	TREE:right/:$(git rev-parse base~2:right)
>> +	trees:9
> > [snip rest of "expect"]
> > The way you're testing this, wouldn't the tests pass even if the OIDs
> aren't emitted in path order? (E.g. if topic:right and base~1:right
> were somehow grouped into two different groups, even though they have
> the same path.)

You are correct that if the path-walk API emitted multiple batches
with the same path name, then we would not detect that via the current
testing strategy.

The main reason to use the sort is to avoid adding a restriction on
the order in which objects appear within the batch.

Your recommendation to group a batch into a single line does not
strike me as a suitable approach, because long lines become hard to
read and difficult to parse diffs. (Also, the order within the batch
becomes baked in as a requirement.)

The biggest question I'd like to ask is this: do you see a risk of
a path being repeated? There are cases where it will happen, such as
indexed objects that are not reachable anywhere else.

The way I would consider modifying these tests to reflect the batching
would be to associate each batch with a number, causing the order of
the paths to become hard-coded in the test. Something like

  0:COMMIT::$(git rev-parse ...)
  0:COMMIT::$(git rev-parse ...)
  1:TREE::$(git rev-parse ...)
  1:TREE::$(git rev-parse ...)
  2:TREE:right/:$(git rev-parse ...)
  3:BLOB:right/a:$(...)
  4:TREE:left/:$(git rev-parse ...)
  5:BLOB:left/b:$(...)

This would imply some amount of order that maybe should become a
requirement of the API.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Tan wrote (reply to this):

Derrick Stolee <[email protected]> writes:
> You are correct that if the path-walk API emitted multiple batches
> with the same path name, then we would not detect that via the current
> testing strategy.
> 
> The main reason to use the sort is to avoid adding a restriction on
> the order in which objects appear within the batch.
> 
> Your recommendation to group a batch into a single line does not
> strike me as a suitable approach, because long lines become hard to
> read and difficult to parse diffs. (Also, the order within the batch
> becomes baked in as a requirement.)

The hashes in a line can be abbreviated if line length is a concern.
Also, note that I am suggesting sorting the OIDs within a line (that is,
a batch), and also sorting the lines (batches) as a whole.

> The biggest question I'd like to ask is this: do you see a risk of
> a path being repeated? There are cases where it will happen, such as
> indexed objects that are not reachable anywhere else.

I was thinking that the whole point of this feature is that we group
objects by path, so it seems desirable to test that paths are not
repeated. (Or repeated as little as possible, if it is not possible
to avoid repetition e.g. in the case you describe.)

> The way I would consider modifying these tests to reflect the batching
> would be to associate each batch with a number, causing the order of
> the paths to become hard-coded in the test. Something like
> 
>    0:COMMIT::$(git rev-parse ...)
>    0:COMMIT::$(git rev-parse ...)
>    1:TREE::$(git rev-parse ...)
>    1:TREE::$(git rev-parse ...)
>    2:TREE:right/:$(git rev-parse ...)
>    3:BLOB:right/a:$(...)
>    4:TREE:left/:$(git rev-parse ...)
>    5:BLOB:left/b:$(...)
> 
> This would imply some amount of order that maybe should become a
> requirement of the API.
> 
> Thanks,
> -Stolee

If we're willing to declare an order in which we will return paths to
the user, that would work too. (I'm not sure that we need to declare an
order, though.)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 11/4/24 6:39 PM, Jonathan Tan wrote:
> Derrick Stolee <[email protected]> writes:
> >> The biggest question I'd like to ask is this: do you see a risk of
>> a path being repeated? There are cases where it will happen, such as
>> indexed objects that are not reachable anywhere else.
> > I was thinking that the whole point of this feature is that we group
> objects by path, so it seems desirable to test that paths are not
> repeated. (Or repeated as little as possible, if it is not possible
> to avoid repetition e.g. in the case you describe.)
In addition to determining the order of the batches, it can be
helpful to demonstrate that we don't call the path_fn with an
empty batch! I discovered this while making the appropriate
changes today and putting the fixes in the right places.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Patrick Steinhardt wrote (reply to this):

On Thu, Oct 31, 2024 at 06:27:00AM +0000, Derrick Stolee via GitGitGadget wrote:
[snip]
> +int cmd__path_walk(int argc, const char **argv)
> +{
> +	int res;
> +	struct rev_info revs = REV_INFO_INIT;
> +	struct path_walk_info info = PATH_WALK_INFO_INIT;
> +	struct path_walk_test_data data = { 0 };
> +	struct option options[] = {
> +		OPT_END(),
> +	};
> +
> +	initialize_repository(the_repository);
> +	setup_git_directory();
> +	revs.repo = the_repository;
> +
> +	argc = parse_options(argc, argv, NULL,
> +			     options, path_walk_usage,
> +			     PARSE_OPT_KEEP_UNKNOWN_OPT | PARSE_OPT_KEEP_ARGV0);
> +
> +	if (argc > 1)
> +		setup_revisions(argc, argv, &revs, NULL);
> +	else
> +		usage(path_walk_usage[0]);
> +
> +	info.revs = &revs;
> +	info.path_fn = emit_block;
> +	info.path_fn_data = &data;
> +
> +	res = walk_objects_by_path(&info);
> +
> +	printf("trees:%" PRIuMAX "\n"
> +	       "blobs:%" PRIuMAX "\n",
> +	       data.tree_nr, data.blob_nr);
> +
> +	return res;
> +}

This function is leaking memory. I'd propose to add below patch on top
to plug them, which makes t6601 pass with the leak sanitizer enabled.

Patrick

diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 06b103d876..fa3bfe46b5 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -85,7 +85,6 @@ int cmd__path_walk(int argc, const char **argv)
 		OPT_END(),
 	};
 
-	initialize_repository(the_repository);
 	setup_git_directory();
 	revs.repo = the_repository;
 
@@ -110,5 +109,6 @@ int cmd__path_walk(int argc, const char **argv)
 	       "tags:%" PRIuMAX "\n",
 	       data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
 
+	release_revisions(&revs);
 	return res;
 }

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 11/6/24 9:04 AM, Patrick Steinhardt wrote:
> On Thu, Oct 31, 2024 at 06:27:00AM +0000, Derrick Stolee via GitGitGadget wrote:
> [snip]

> This function is leaking memory. I'd propose to add below patch on top
> to plug them, which makes t6601 pass with the leak sanitizer enabled.
Thanks! Applied for the next version.

-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Sat, Nov 09, 2024 at 07:41:09PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <[email protected]>
>
> Add some tests based on the current behavior, doing interesting checks
> for different sets of branches, ranges, and the --boundary option. This
> sets a baseline for the behavior and we can extend it as new options are
> introduced.
>
> Store and output a 'batch_nr' value so we can demonstrate that the paths are
> grouped together in a batch and not following some other ordering. This
> allows us to test the depth-first behavior of the path-walk API. However, we
> purposefully do not test the order of the objects in the batch, so the
> output is compared to the expected output through a sort.
>
> It is important to mention that the behavior of the API will change soon as
> we start to handle UNINTERESTING objects differently, but these tests will
> demonstrate the change in behavior.
>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---

Nice. I like the approach of implementing the API in a single commit,
and then demonstrating a trivial "caller" by way of a custom test
helper. I think that the artifact of having a test helper here is useful
on its own, but it also serves as a good example of how to use the API,
and provides something to actually test the implementation with.

I'm going to steal this pattern the next time I need to work on
something that necessitates a complex new API ;-).

> +static int emit_block(const char *path, struct oid_array *oids,
> +		      enum object_type type, void *data)
> +{
> +	struct path_walk_test_data *tdata = data;
> +	const char *typestr;
> +
> +	switch (type) {
> +	case OBJ_TREE:
> +		typestr = "TREE";
> +		tdata->tree_nr += oids->nr;
> +		break;
> +
> +	case OBJ_BLOB:
> +		typestr = "BLOB";
> +		tdata->blob_nr += oids->nr;
> +		break;
> +	default:
> +		BUG("we do not understand this type");
> +	}

I think you could write this as:

    if (type == OBJ_TREE)
        tdata->tree_nr += oids->nr;
    else if (type == OBJ_BLOB)
        tdata->blob_nr += oids->nr;
    else
        BUG("we do not understand this type");

    typestr = type_name(type);

Which DRYs things up a bit and uses the type_name() helper. That will
give you strings like "tree" and "blob" instead of "TREE" and "BLOB",
but I'm not sure if the casing is important.

Thanks,
Taylor

--------

See example usages in:
`t/helper/test-path-walk.c`
2 changes: 2 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -818,6 +818,7 @@ TEST_BUILTINS_OBJS += test-parse-options.o
TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
TEST_BUILTINS_OBJS += test-partial-clone.o
TEST_BUILTINS_OBJS += test-path-utils.o
TEST_BUILTINS_OBJS += test-path-walk.o
TEST_BUILTINS_OBJS += test-pcre2-config.o
TEST_BUILTINS_OBJS += test-pkt-line.o
TEST_BUILTINS_OBJS += test-proc-receive.o
Expand Down Expand Up @@ -1094,6 +1095,7 @@ LIB_OBJS += parse-options.o
LIB_OBJS += patch-delta.o
LIB_OBJS += patch-ids.o
LIB_OBJS += path.o
LIB_OBJS += path-walk.o
LIB_OBJS += pathspec.o
LIB_OBJS += pkt-line.o
LIB_OBJS += preload-index.o
Expand Down
Loading
Loading