Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AlpinePackages pipeline #272

Closed
wants to merge 2 commits into from
Closed

Conversation

quepop
Copy link

@quepop quepop commented Aug 5, 2021

Fixes: aboutcode-org/purldb#307
Depends on: aboutcode-org/fetchcode#54, aboutcode-org/fetchcode#56, aboutcode-org/scancode-toolkit#2598

When developing I ran into some issues that I couldn't fix on my own so I've decided to list them here and mark this PR as a draft:

  • The function fetch_via_git from Add providing location for fetch_via_{vcs,git} fetchcode#54 doesn't provide any successful/failed checkout feedback which is essential to this PR.
  • I'm not confident with how I split the entire commit into functions. Especially complement_missing_packages_data. It feels too big for a pipeline function (compared to other pipelines). I tried to be as clean as possible but at this point I'm out of ideas.
  • I'm not sure about file headers/copyrights.
  • Do package copyrights require a year (in a legal sense)? This commit uses --summary scancode option which is very convenient but it also prunes any year from the original copyright message.

Signed-off-by: Mateusz Perc [email protected]

@tdruez tdruez requested a review from pombredanne August 6, 2021 07:09
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ++ for this! I made a few comments for your review.
Also we would need some automated tests for sure.
Can we setup a time to discuss all these in a live session?

scanpipe/pipelines/alpine_packages.py Show resolved Hide resolved
scanpipe/pipelines/alpine_packages.py Outdated Show resolved Hide resolved
scanpipe/pipelines/alpine_packages.py Outdated Show resolved Hide resolved
scanpipe/pipelines/alpine_packages.py Outdated Show resolved Hide resolved
scanpipe/pipelines/alpine_packages.py Outdated Show resolved Hide resolved
scanpipe/pipes/alpine.py Outdated Show resolved Hide resolved
scanpipe/pipes/alpine.py Show resolved Hide resolved
scanpipe/pipes/alpine.py Outdated Show resolved Hide resolved
scanpipe/pipes/alpine.py Outdated Show resolved Hide resolved
scanpipe/pipes/alpine.py Outdated Show resolved Hide resolved
scanpipe/pipelines/alpine_packages.py Show resolved Hide resolved
scanpipe/pipes/alpine.py Outdated Show resolved Hide resolved
scanpipe/pipes/alpine.py Outdated Show resolved Hide resolved

def prepare_scan_dir(package_name, scan_target_path, aports_dir_path=None):
"""
Find package's aports path and if found execute the following steps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is not found shouldn't it be indicated somehow to provide input for further investigation?

Copy link
Author

@quepop quepop Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the question i forgot to ask in the PR message. @pombredanne How to handle logging and error handling?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you think it should be handled? Simple stdout logging?

scanpipe/pipes/alpine.py Outdated Show resolved Hide resolved
@quepop
Copy link
Author

quepop commented Aug 10, 2021

Can we setup a time to discuss all these in a live session?

Yeah, it would be great.

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick replies!
I added some extra comments for your review.

scanpipe/pipes/alpine.py Show resolved Hide resolved
scanpipe/pipes/alpine.py Outdated Show resolved Hide resolved
scanpipe/pipes/alpine.py Outdated Show resolved Hide resolved
scanpipe/pipes/alpine.py Outdated Show resolved Hide resolved
scanpipe/pipes/alpine.py Outdated Show resolved Hide resolved
scanpipe/pipes/alpine.py Outdated Show resolved Hide resolved
scanpipe/pipes/docker.py Outdated Show resolved Hide resolved
apkbuild_dir = aports_dir_path / APORTS_DIR_NAME / subdir_name / package_name
if not apkbuild_dir.exists():
continue
copytree(apkbuild_dir, scan_target_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so if I understand correctly you are:

  1. making a copy of the aports directory of a given package (which would typically include the APKBUILD and some patches)
  2. in this copied directory, you will also fetch the sources (or at least only the remote sources as identified by a URL)
  3. finally you will (extract then run a scan of sorts on this directory? )

I think it could be better if you separate each operation and the process could benefit from more documentation.

scanpipe/pipes/alpine.py Outdated Show resolved Hide resolved
scanpipe/pipes/alpine.py Outdated Show resolved Hide resolved
A pipeline that complements missing package data. Downloads aports
repository and all its necessary branches (alpine versions) then
iterates over all alpine packages associated with the pipeline's
project. For each package it copies additional files from the aports
repository into scan target directory then downloads and extract all
the source archives, performs a scan and saves it's output to package's
database entry.

Signed-off-by: Mateusz Perc <[email protected]>
@quepop
Copy link
Author

quepop commented Sep 20, 2021

@pombredanne @aalexanderr I cannot decide on how to test the entire pipeline. It looks like alpine pipe tests which i commited already test the majority of the AlpinePackages pipeline. Integration tests are the only viable option in my opinion but they are slow and problematic (due to using fetchcode etc.). Either way they won't be big, so tell me what you think and i will be quick to write them.

@quepop quepop marked this pull request as ready for review September 20, 2021 06:47
Added new tests for functions:
 -download_or_checkout_aports
 -get_unscanned_packages_from_db
 -prepare_scan_dir
 -extract_summary_fields

Signed-off-by: Mateusz Perc <[email protected]>
Copy link
Contributor

@tdruez tdruez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quepop Integration tests are the only viable option in my opinion but they are slow and problematic (due to using fetchcode etc.)

What about mocking the fetchcode calls?

Iterate over every alpine version associated with this project.
Download corresponding aports repository branches (alpine versions).
"""
self.aports_dir_path = self.project.tmp_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the self.aports_dir_path variable really needed?

Download corresponding aports repository branches (alpine versions).
"""
self.aports_dir_path = self.project.tmp_path
for image_id, alpine_version in self.alpine_versions.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since image_id is not used, I would suggest:
for alpine_version in self.alpine_versions.values()

Copy link
Author

@quepop quepop Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using items() was @pombredanne 's commit suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quepop values() would be better since you do not use the image_id variable.

aports_dir_path=self.project.tmp_path, alpine_version=alpine_version
)

def complement_missing_package_data(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following code should be made more digest and readable.

Comment on lines +85 to +87
) in get_unscanned_packages_from_db(
project=self.project, alpine_versions=self.alpine_versions
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, when the name of the keyword argument and the provided variable is the same, it's explicit enough to only keep the variable.

For example:

get_unscanned_packages_from_db(project=self.project, alpine_versions=self.alpine_versions)

I think the following is as explicit and more readable:

get_unscanned_packages_from_db(self.project, self.alpine_versions)

It make sense to keep the keyword agrs in the following example though:

run_scancode(
    location=str(scan_target_path),
    output_file=str(scan_result_path),
    options=self.scancode_options,
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used unnamed positional arguments before and @pombredanne commented that i should use named positionals everywhere.

Here and in general, do you mind to use named keyword arguments rather than un-named positional arguments? This makes reading much easier and is more resistant to refactorings that adds or reorders arguments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with the "makes reading much easier" in the cases mentioned above but "more resistant to refactorings" may be a fair point.
You can leave it as-is then ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could store that call in a unscanned_packages to help with the for loop layout.

package_name=package.name, scan_target_path=scan_target_path
):
continue
run_extractcode(location=str(scan_target_path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This run_extractcode function does not exists in the main branch anymore since b035f00. You need to migrate to the new scancode.extract_archives API.
See https://github.com/nexB/scancode.io/blob/main/scanpipe/pipelines/scan_codebase.py#L73 for an example.

):
continue
run_extractcode(location=str(scan_target_path))
run_scancode(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to call directly the ScanCode scancode.api.get_copyrights function instead of starting a full scancode subprocess.
This will be more efficient and will remove the need for extract_summary_fields.

@pombredanne pombredanne marked this pull request as draft February 26, 2024 13:34
@pombredanne
Copy link
Member

I saved a clone of this repo and branches so we can revisit this when we have this but this will happen in PurlDB using package sets rather than in ScanCode.io proper. We have implemented this for some package types already, and this is where the feature would be best homed.

@pombredanne
Copy link
Member

See aboutcode-org/purldb#307 for the follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhance Alpine package scan results
4 participants