Add AlpinePackages pipeline #272

quepop · 2021-08-05T04:21:30Z

Fixes: aboutcode-org/purldb#307
Depends on: aboutcode-org/fetchcode#54, aboutcode-org/fetchcode#56, aboutcode-org/scancode-toolkit#2598

When developing I ran into some issues that I couldn't fix on my own so I've decided to list them here and mark this PR as a draft:

The function fetch_via_git from Add providing location for fetch_via_{vcs,git} fetchcode#54 doesn't provide any successful/failed checkout feedback which is essential to this PR.
I'm not confident with how I split the entire commit into functions. Especially complement_missing_packages_data. It feels too big for a pipeline function (compared to other pipelines). I tried to be as clean as possible but at this point I'm out of ideas.
I'm not sure about file headers/copyrights.
Do package copyrights require a year (in a legal sense)? This commit uses --summary scancode option which is very convenient but it also prunes any year from the original copyright message.

Signed-off-by: Mateusz Perc [email protected]

pombredanne

Thank you ++ for this! I made a few comments for your review.
Also we would need some automated tests for sure.
Can we setup a time to discuss all these in a live session?

scanpipe/pipelines/alpine_packages.py

scanpipe/pipes/alpine.py

scanpipe/pipelines/alpine_packages.py

scanpipe/pipes/alpine.py

aalexanderr · 2021-08-09T15:46:24Z

scanpipe/pipes/alpine.py

+
+def prepare_scan_dir(package_name, scan_target_path, aports_dir_path=None):
+    """
+    Find package's aports path and if found execute the following steps:


If it is not found shouldn't it be indicated somehow to provide input for further investigation?

That was the question i forgot to ask in the PR message. @pombredanne How to handle logging and error handling?

How do you think it should be handled? Simple stdout logging?

scanpipe/pipes/alpine.py

quepop · 2021-08-10T09:44:31Z

Can we setup a time to discuss all these in a live session?

Yeah, it would be great.

pombredanne

Thank you for the quick replies!
I added some extra comments for your review.

scanpipe/pipes/alpine.py

scanpipe/pipes/docker.py

pombredanne · 2021-08-10T10:09:03Z

scanpipe/pipes/alpine.py

+        apkbuild_dir = aports_dir_path / APORTS_DIR_NAME / subdir_name / package_name
+        if not apkbuild_dir.exists():
+            continue
+        copytree(apkbuild_dir, scan_target_path)


OK, so if I understand correctly you are:

making a copy of the aports directory of a given package (which would typically include the APKBUILD and some patches)

in this copied directory, you will also fetch the sources (or at least only the remote sources as identified by a URL)

finally you will (extract then run a scan of sorts on this directory? )

I think it could be better if you separate each operation and the process could benefit from more documentation.

scanpipe/pipes/alpine.py

A pipeline that complements missing package data. Downloads aports repository and all its necessary branches (alpine versions) then iterates over all alpine packages associated with the pipeline's project. For each package it copies additional files from the aports repository into scan target directory then downloads and extract all the source archives, performs a scan and saves it's output to package's database entry. Signed-off-by: Mateusz Perc <[email protected]>

quepop · 2021-09-20T06:47:23Z

@pombredanne @aalexanderr I cannot decide on how to test the entire pipeline. It looks like alpine pipe tests which i commited already test the majority of the AlpinePackages pipeline. Integration tests are the only viable option in my opinion but they are slow and problematic (due to using fetchcode etc.). Either way they won't be big, so tell me what you think and i will be quick to write them.

Added new tests for functions: -download_or_checkout_aports -get_unscanned_packages_from_db -prepare_scan_dir -extract_summary_fields Signed-off-by: Mateusz Perc <[email protected]>

tdruez

@quepop Integration tests are the only viable option in my opinion but they are slow and problematic (due to using fetchcode etc.)

What about mocking the fetchcode calls?

tdruez · 2021-09-23T06:35:36Z

scanpipe/pipelines/alpine_packages.py

+        Iterate over every alpine version associated with this project.
+        Download corresponding aports repository branches (alpine versions).
+        """
+        self.aports_dir_path = self.project.tmp_path


Is the self.aports_dir_path variable really needed?

tdruez · 2021-09-23T06:37:44Z

scanpipe/pipelines/alpine_packages.py

+        Download corresponding aports repository branches (alpine versions).
+        """
+        self.aports_dir_path = self.project.tmp_path
+        for image_id, alpine_version in self.alpine_versions.items():


Since image_id is not used, I would suggest:
for alpine_version in self.alpine_versions.values()

Using items() was @pombredanne 's commit suggestion.

@quepop values() would be better since you do not use the image_id variable.

tdruez · 2021-09-23T06:39:47Z

scanpipe/pipelines/alpine_packages.py

+                aports_dir_path=self.project.tmp_path, alpine_version=alpine_version
+            )
+
+    def complement_missing_package_data(self):


The following code should be made more digest and readable.

tdruez · 2021-09-23T06:42:47Z

scanpipe/pipelines/alpine_packages.py

+        ) in get_unscanned_packages_from_db(
+            project=self.project, alpine_versions=self.alpine_versions
+        ):


In general, when the name of the keyword argument and the provided variable is the same, it's explicit enough to only keep the variable.

For example:

get_unscanned_packages_from_db(project=self.project, alpine_versions=self.alpine_versions)

I think the following is as explicit and more readable:

get_unscanned_packages_from_db(self.project, self.alpine_versions)

It make sense to keep the keyword agrs in the following example though:

run_scancode( location=str(scan_target_path), output_file=str(scan_result_path), options=self.scancode_options, )

I used unnamed positional arguments before and @pombredanne commented that i should use named positionals everywhere.

Here and in general, do you mind to use named keyword arguments rather than un-named positional arguments? This makes reading much easier and is more resistant to refactorings that adds or reorders arguments

I disagree with the "makes reading much easier" in the cases mentioned above but "more resistant to refactorings" may be a fair point.
You can leave it as-is then ;)

You could store that call in a unscanned_packages to help with the for loop layout.

tdruez · 2021-09-23T06:46:17Z

scanpipe/pipelines/alpine_packages.py

+                package_name=package.name, scan_target_path=scan_target_path
+            ):
+                continue
+            run_extractcode(location=str(scan_target_path))


This run_extractcode function does not exists in the main branch anymore since b035f00. You need to migrate to the new scancode.extract_archives API.
See https://github.com/nexB/scancode.io/blob/main/scanpipe/pipelines/scan_codebase.py#L73 for an example.

tdruez · 2021-09-23T06:53:07Z

scanpipe/pipelines/alpine_packages.py

+            ):
+                continue
+            run_extractcode(location=str(scan_target_path))
+            run_scancode(


I would suggest to call directly the ScanCode scancode.api.get_copyrights function instead of starting a full scancode subprocess.
This will be more efficient and will remove the need for extract_summary_fields.

pombredanne · 2024-02-26T13:38:32Z

I saved a clone of this repo and branches so we can revisit this when we have this but this will happen in PurlDB using package sets rather than in ScanCode.io proper. We have implemented this for some package types already, and this is where the feature would be best homed.

pombredanne · 2024-02-26T13:41:05Z

See aboutcode-org/purldb#307 for the follow up.

tdruez requested a review from pombredanne August 6, 2021 07:09

pombredanne requested changes Aug 9, 2021

View reviewed changes

aalexanderr reviewed Aug 9, 2021

View reviewed changes

pombredanne requested changes Aug 10, 2021

View reviewed changes

pombredanne mentioned this pull request Aug 10, 2021

Create new pipeline to fetch dependency provenance data #284

Open

quepop marked this pull request as draft August 11, 2021 22:19

quepop force-pushed the issue-191 branch from 26d89a6 to c8aee9d Compare August 31, 2021 02:50

quepop force-pushed the issue-191 branch from c8aee9d to be62961 Compare September 2, 2021 19:29

quepop marked this pull request as ready for review September 20, 2021 06:47

Added test for new alpine pipe functions

ee6d8a2

Added new tests for functions: -download_or_checkout_aports -get_unscanned_packages_from_db -prepare_scan_dir -extract_summary_fields Signed-off-by: Mateusz Perc <[email protected]>

quepop force-pushed the issue-191 branch from 998e345 to ee6d8a2 Compare September 22, 2021 19:39

tdruez requested changes Sep 23, 2021

View reviewed changes

pombredanne marked this pull request as draft February 26, 2024 13:34

pombredanne closed this Feb 26, 2024

pombredanne mentioned this pull request Feb 26, 2024

Enhance Alpine package scan results aboutcode-org/purldb#307

Open

pombredanne mentioned this pull request Apr 8, 2024

PURLDB: Collect and return Alpine package metadata on demand for a PURL aboutcode-org/purldb#380

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AlpinePackages pipeline #272

Add AlpinePackages pipeline #272

quepop commented Aug 5, 2021

pombredanne left a comment

aalexanderr Aug 9, 2021

quepop Aug 12, 2021 •

edited

Loading

quepop Aug 31, 2021

quepop commented Aug 10, 2021

pombredanne left a comment •

edited

Loading

pombredanne Aug 10, 2021

quepop commented Sep 20, 2021

tdruez left a comment

tdruez Sep 23, 2021

tdruez Sep 23, 2021

quepop Sep 23, 2021 •

edited

Loading

tdruez Sep 23, 2021

tdruez Sep 23, 2021

tdruez Sep 23, 2021

quepop Sep 23, 2021

tdruez Sep 23, 2021

tdruez Sep 23, 2021

tdruez Sep 23, 2021

tdruez Sep 23, 2021

pombredanne commented Feb 26, 2024

pombredanne commented Feb 26, 2024

Add AlpinePackages pipeline #272

Add AlpinePackages pipeline #272

Conversation

quepop commented Aug 5, 2021

pombredanne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quepop Aug 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quepop commented Aug 10, 2021

pombredanne left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quepop commented Sep 20, 2021

tdruez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quepop Sep 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pombredanne commented Feb 26, 2024

pombredanne commented Feb 26, 2024

quepop Aug 12, 2021 •

edited

Loading

pombredanne left a comment •

edited

Loading

quepop Sep 23, 2021 •

edited

Loading