-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AlpinePackages pipeline #272
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you ++ for this! I made a few comments for your review.
Also we would need some automated tests for sure.
Can we setup a time to discuss all these in a live session?
scanpipe/pipes/alpine.py
Outdated
|
||
def prepare_scan_dir(package_name, scan_target_path, aports_dir_path=None): | ||
""" | ||
Find package's aports path and if found execute the following steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is not found shouldn't it be indicated somehow to provide input for further investigation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the question i forgot to ask in the PR message. @pombredanne How to handle logging and error handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you think it should be handled? Simple stdout logging?
Yeah, it would be great. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the quick replies!
I added some extra comments for your review.
apkbuild_dir = aports_dir_path / APORTS_DIR_NAME / subdir_name / package_name | ||
if not apkbuild_dir.exists(): | ||
continue | ||
copytree(apkbuild_dir, scan_target_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so if I understand correctly you are:
- making a copy of the aports directory of a given package (which would typically include the APKBUILD and some patches)
- in this copied directory, you will also fetch the sources (or at least only the remote sources as identified by a URL)
- finally you will (extract then run a scan of sorts on this directory? )
I think it could be better if you separate each operation and the process could benefit from more documentation.
A pipeline that complements missing package data. Downloads aports repository and all its necessary branches (alpine versions) then iterates over all alpine packages associated with the pipeline's project. For each package it copies additional files from the aports repository into scan target directory then downloads and extract all the source archives, performs a scan and saves it's output to package's database entry. Signed-off-by: Mateusz Perc <[email protected]>
@pombredanne @aalexanderr I cannot decide on how to test the entire pipeline. It looks like alpine pipe tests which i commited already test the majority of the AlpinePackages pipeline. Integration tests are the only viable option in my opinion but they are slow and problematic (due to using fetchcode etc.). Either way they won't be big, so tell me what you think and i will be quick to write them. |
Added new tests for functions: -download_or_checkout_aports -get_unscanned_packages_from_db -prepare_scan_dir -extract_summary_fields Signed-off-by: Mateusz Perc <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quepop Integration tests are the only viable option in my opinion but they are slow and problematic (due to using fetchcode etc.)
What about mocking the fetchcode calls?
Iterate over every alpine version associated with this project. | ||
Download corresponding aports repository branches (alpine versions). | ||
""" | ||
self.aports_dir_path = self.project.tmp_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the self.aports_dir_path
variable really needed?
Download corresponding aports repository branches (alpine versions). | ||
""" | ||
self.aports_dir_path = self.project.tmp_path | ||
for image_id, alpine_version in self.alpine_versions.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since image_id
is not used, I would suggest:
for alpine_version in self.alpine_versions.values()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using items()
was @pombredanne 's commit suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quepop values() would be better since you do not use the image_id
variable.
aports_dir_path=self.project.tmp_path, alpine_version=alpine_version | ||
) | ||
|
||
def complement_missing_package_data(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following code should be made more digest and readable.
) in get_unscanned_packages_from_db( | ||
project=self.project, alpine_versions=self.alpine_versions | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, when the name of the keyword argument and the provided variable is the same, it's explicit enough to only keep the variable.
For example:
get_unscanned_packages_from_db(project=self.project, alpine_versions=self.alpine_versions)
I think the following is as explicit and more readable:
get_unscanned_packages_from_db(self.project, self.alpine_versions)
It make sense to keep the keyword agrs in the following example though:
run_scancode(
location=str(scan_target_path),
output_file=str(scan_result_path),
options=self.scancode_options,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used unnamed positional arguments before and @pombredanne commented that i should use named positionals everywhere.
Here and in general, do you mind to use named keyword arguments rather than un-named positional arguments? This makes reading much easier and is more resistant to refactorings that adds or reorders arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with the "makes reading much easier" in the cases mentioned above but "more resistant to refactorings" may be a fair point.
You can leave it as-is then ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could store that call in a unscanned_packages
to help with the for loop layout.
package_name=package.name, scan_target_path=scan_target_path | ||
): | ||
continue | ||
run_extractcode(location=str(scan_target_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This run_extractcode
function does not exists in the main branch anymore since b035f00. You need to migrate to the new scancode.extract_archives
API.
See https://github.com/nexB/scancode.io/blob/main/scanpipe/pipelines/scan_codebase.py#L73 for an example.
): | ||
continue | ||
run_extractcode(location=str(scan_target_path)) | ||
run_scancode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to call directly the ScanCode scancode.api.get_copyrights
function instead of starting a full scancode subprocess.
This will be more efficient and will remove the need for extract_summary_fields
.
I saved a clone of this repo and branches so we can revisit this when we have this but this will happen in PurlDB using package sets rather than in ScanCode.io proper. We have implemented this for some package types already, and this is where the feature would be best homed. |
See aboutcode-org/purldb#307 for the follow up. |
Fixes: aboutcode-org/purldb#307
Depends on: aboutcode-org/fetchcode#54, aboutcode-org/fetchcode#56, aboutcode-org/scancode-toolkit#2598
When developing I ran into some issues that I couldn't fix on my own so I've decided to list them here and mark this PR as a draft:
fetch_via_git
from Add providing location for fetch_via_{vcs,git} fetchcode#54 doesn't provide any successful/failed checkout feedback which is essential to this PR.complement_missing_packages_data
. It feels too big for a pipeline function (compared to other pipelines). I tried to be as clean as possible but at this point I'm out of ideas.--summary
scancode option which is very convenient but it also prunes any year from the original copyright message.Signed-off-by: Mateusz Perc [email protected]