Skip to content

GSOC 2023

Ayan Sinha Mahapatra edited this page Mar 21, 2023 · 18 revisions

AboutCode will be applying as a GSoC mentoring org for 2023! See https://summerofcode.withgoogle.com/programs/2023 for more details about the program this year. Here is the complete timeline: https://developers.google.com/open-source/gsoc/timeline

TL;DR See our list of ideas: https://github.com/nexB/aboutcode/wiki/GSOC-2023#project-ideas-index

Table of Contents

This page contains information for aspiring contributors interested in participating and helping with the GSoC 2023 program.

AboutCode: Scan code for origin, license and vulnerabilities

AboutCode is a family of FOSS projects to uncover data ... about software code:

  • where does the code come from? which software package?
  • what is its license? copyright?
  • is the code vulnerable, maintained, well coded?
  • what are its dependencies, are there vulneribilities/licensing issues?

All these are questions that are important to answer: there are millions of free and open source software components available on the web for reuse.

Knowing where a software package comes from, what its license is and whether it is vulnerable should be a problem of the past such that everyone can safely consume more free and open source software. We support not only open source software, but also open data, generated and curated by our applications.

Join us to make it so!

Our tools are used to help detect and report the origin and license of source code, packages and binaries as well as discover software and package dependencies, and tracking security vulnerabilities, bugs and other important software package attributes. They also support creating SBOMs and other disclosure documents with this information and supports leading standards like SPDX, CycloneDX and VEX. They are a suite of database backed web-based and API servers, command line applications and desktop applications often working together to create and provide data about software usability and health.

AboutCode projects are...

NOTE: If you are looking for the Project Ideas List instead of their parent Projects, see https://github.com/nexB/aboutcode/wiki/GSOC-2023#project-ideas-index

Aboutcode project repositories which are the main focus of GSoC 2023 are:

  • purlDB consists of tools to create and expose a database of purls (Package URLs) and also has package data for all of these packages created from scans.

  • VulnerableCode is a web-based API and database to collect and track all the known software package vulnerabilities, with affected and fixed packages, references and a standalone tool Vulntotal to compare this vulneribility information across similar tools.

  • Scancode.io is a web-based and API to run and review scans in rich scripted ScanPipe pipelines, on different kinds of containers/docker images/package archives/source packages/manifests etc, to get information on source/licenses/vulneribilities information.

  • univers is a package to parse and compare all the package versions and all the ranges.

  • ScanCode Toolkit is a popular command line tool to scan code for licenses, copyrights and packages, used by many organizations and FOSS projects, small and large.

  • FetchCode is a library to reliably fetch any code via HTTP, FTP and version control systems such as git.

  • python-inspector and nuget inspector inspects manifests and code to resolve dependencies (vulnerable and non-vulnerable) for python and nuget packages respectively.

GSoC proposals for these above repositories will receive the maximum interest from aboutcode mentors.

There are many other aboutcode projects:

  • Scancode Workbench is a TypeScript, React based desktop application to visualize and review scan results for scancode-toolkit scans.

  • AboutCode Toolkit is a command line tool to document and inventory known packages and licenses and generate attribution docs, typically using the results of analyzed and reviewed scans.

  • TraceCode Toolkit is a command line tool to find which source code file is used to create a compiled binary by tracing and graphing a build.

  • DeltaCode is a command line tool to compare scans and determine if and where there are material differences that affect licensing.

  • license-expression is a library to parse, analyze, simplify and render boolean license expression (such as SPDX)

  • container-inspector is a command line tool to analyze the code in Docker and container images.

We have also co-founded and/or contributing to important projects for other organizations:

  • Package URL which is an emerging standard to reference software packages of all types with simple, readable and concise URLs.

  • SPDX aka. Software Package Data Exchange, a spec to document the origin and licensing of packages.

  • CycloneDX aka. OWASP CycloneDX is a full-stack Bill of Materials (BOM) standard that provides advanced supply chain capabilities for cyber risk reduction

  • ClearlyDefined to review and help FOSS projects improve their licensing and documentation clarity.

Contact

Join the chat online at our element chatroom: aboutcode-org/discuss Introduce yourself and start the discussion!

Please try asking questions the smart way: http://www.catb.org/~esr/faqs/smart-questions.html

For personal issues, you can contact the primary org admin directly: @pombredanne and [email protected]

Technology

Discovering the origin, license and security of code is a vast topic. We primarily use Python with some C/C++ , Rust and Go for performance sensitive code. We use Django, PostgreSQL and javascript for web apps and API servers.

Our domain includes text analysis and processing (for instance for copyrights and licenses detection), parsing (for package manifest formats), binary analysis (to detect the origin and license of binaries, primarily based on the corresponding source code), vulneribility data aggregation and processing, dependency resolution, mining and matching package data (from scannning and fetching metadata from package managers), maintaining databases of packages (using PURL), and these are realized by Web-based tools and APIs (to expose the tools and libraries as Web Services), scripting in python to automate workflows and low-level data structures for efficient matching (such as high performance string search automatons).

Skills

Incoming students will need the following skills:

  • Intermediate to strong Python programming. For some projects, familiarity with Django and Postgresql would be great

  • Familiarity with git as a version control system. Take the time to learn git!

  • Ability to set up your own development environment

  • An interest in open source security, licensing and generally software composition analysis.

We are happy to help you get up to speed, and the more you are able to demonstrate ability and skills in advance, the more likely we are to choose your application!

About your project application

Make sure you read the GSoC student guide carefully. Also follow the writing a proposal guide.

We expect your application to be in the range of 1000 words. Anything less than that will probably not contain enough information for us to determine whether you are the right person for the job. Your proposal should contain at least the following information, plus anything you think is relevant:

Personal Information

We need this information to communicate with you during the project duration and for other communication related to GSoC.

  • Your name

  • Country/Timezone you are from (just for scheduling purpose)

  • Email and Gitter/Element username

  • Link to your GitHub profile

  • Mention the details of your academic studies, any previous work, internships

  • Relevant skills that will help you to achieve the goal (programming languages, frameworks)?

  • Do you plan to have any other commitments during GSoC that may affect your work? Any vacations/holidays? Will you be available full time to work on your project? (Hint: do not bother applying if this is not a serious main time commitment during the GSoC time frame) We also have weekly status meetings, same time as the community call, on Mondays, would you be able to attend them?

  • We will be following the 12 week standard coding period as default for all our, projects, unless unforeseen circumstances arise. Do you accept the standard coding period as default?

Proposal Details

  • Title of your proposal

  • Abstract of your proposal

  • Project Size: (medium (175 hour) or large (350 hour) this should match what we have listed on our project idea page (and re-confirmed by the mentors on your proposal).

  • Link to the original project idea on the project ideas page (if applicable)

  • Detailed description of your idea including explanation on why is it innovative and what it will contribute to the project

    • Explain your data structures and you planned main processing flows in details.
  • Mention the key deliverables of the project

  • Description of previous work on the same issue, existing solutions (links to prototypes, bibliography are more than welcome)

  • A complete timeline of your project, where the project is broken down into smaller tasks with their own deliverables/goals, by time. Please keep some buffer time at the end and consider that it will take some time to address feedback, write docs and other related work. We will help with this on your proposal.

Note that you have to submit a PDF of your proposal in the GSoC website and you can keep updating with a new proposal PDF until the deadline at April 4th 18:00 UTC.

Your contributions

The best way to demonstrate your capability would be to submit a small patch ahead of the project selection for an existing issue or a new issue.
We will always consider and prefer a project submissions where you have submitted a patch over any other submission without a patch.

Note that only useful code contributions demonstrate your ability to successfully complete the project you are proposing, and insignificant/documentation contributions will not support your proposal as much as quality contributions will. Also try to contribute to issues similar to your project idea, for more impact.

  • Any previous open-source projects (or even previous GSoC) you have contributed to and links.

  • Detailed list of your code contributions to aboutcode, by project, with links and brief description.

  • You can also list documentation/other contributions, issues opened etc. optionally.

Take feedback on proposal

  • You should share your proposal early to take feedback from mentors.

    • Don't be afraid to share your proposal even if it's a draft, keep updating after you share.
    • Share you proposal in a publicly viewable google doc. It should also have comment access enabled so mentors can provide feedback.
    • Do share the proposal publicly rather than privately to mentors. Respect the spirit of open source! Don't be afraid of plagiarism as we check for it.
  • Discuss the proposal on open issues/public chat/weekly community calls for more feedback and discussion. Act upon feedback already received and keep improving it.

  • Don't wait till the last day/moment to submit your proposal, submit early! The proposal is editable so you can always update later. Announce on the public channel after submitting your proposal on the GSoC website.

Selection criteria

While creating your proposal, think about how we select proposals from all the submissions we get, to make your proposal better. Think whether your proposal is readable by a person who is not part of aboutcode, and whether they would still understand the problem, solution and steps. Also ask yourself whether the mentors will be satisfied with the level of detail and clarity in your proposal, do you understand the main bottlenecks/challenges? Where do you expect help? Do you think your timeline is reasonable? Do you understand the deliverables correctly and have intermediate goals/deliverables by timeline?

Also consider the main factors we look at when judging proposals, we mentors want successful and impactful GSoC projects. There are a couple of sub-factors towards predicting success:

  • Your contributions: We need to know whether you are capable of finishing the project successfully without significant hand holding. If you have multiple impactful and accepted code contributions, we know you are comfortable reading and writing code, understand the git/github and review workflow, and can solve problems yourself with a little bit of help. If your contributions are in the same project or area of your proposal, it also demonstrates your familiarity with the problem space, which is an added bonus.

  • Proposal clarity and detail: This is discussed in details above in the Proposal Details section. We need to know you really understand the problem and the solution you are suggesting, and that your tasks and timeline is reasonable.

  • Communication and Feedback acceptance: Open source and GSoC is collaborative in nature. As beginners, you are not expected to know everything, and so taking feedback from more experienced community members and mentors is key to your success. This should happen at all the steps, on individual issues/PR review, proposal review and also throughout GSoC. So we need to know you can take constructive criticism and keep integrating feedback in your work.

Make sure you follow these guidelines to make your proposal stand out, for more chances of getting selected. If you are already doing everything mentioned above, you already have a very good chance of being accepted. In the rare case there are two proposals on the same project, we can only select one, and mentors have to take the hard choice of selecting one, based on these factors.


Our Project ideas

Here are some project related attributes you need to keep in mind while looking into prospective project ideas, see also finding the right project guide:

Project Priority

  1. The repositories/projects are sorted in order of importance, (i.e. PURLdb, vulnerablecode and scancode.io are the most important ones, it that order, and then there are all other projects).

  2. The project ideas within a project are not sorted by priority.

  3. This doesn't mean we will always consider a project proposal with a higher priority idea over a realatively lower priority one, no matter the merit of the proposal. This is only one metric of selection, mostly to prioratize important projects.

  4. You can also suggest your own project ideas/discuss changes/updates/enhancements based on the provided ideas, but you need to really know what you are doing here and have lots of discussions with the maintainers.

Project Length

There are two project lengths:

  1. Medium (~175 hours)
  2. Large (~350 hours)

If you are proposing an idea from this ideas list, it should match what is listed here, and additionally please have a discussion with the mentors about your proposed length and timeline.

We have marked our ideas with medium/large but this is tentative and a best guess only. In a few cases they are both used to mark a project as it can be both. But still most of these are on the larger side, as these are large complex projects and you're likely underestimating the complexity (and how much we'll bug you to make sure everything is up to our standards) if you're proposing a medium length project anyway. You must discuss your proposal and the size of project you are proposing with a mentor as otherwise we cannot consider your proposal fairly.

Please also note that there is difference in the stipend based on what you select also.

Project Tags

Here are all the tags we use for specific projects, feel free to search this page using these if you only want to look into projects with specific technical background.

[Django], [PostgreSQL], [Web], [DataStructures], [Scanning], [Javascript], [UI], [LiveServer] [API], [Metadata], [PackageManagers], [SBOM], [Security], [BinaryAnalysis], [Scraping], [NLP], [Social], [Communication], [Review], [Decentralized/Distributed], [Curation]

Project Difficulty Level

We are generally using two level of difficulty to characterize the projects:

  • Intermediate
  • Advanced

If it is a difficult project it means there is significant domain knowledge required to be able to tackle this project successfully, and while this domain knowledge is not a hard pre-requirement before you start, you must consult with mentors/maintainers early, ask a lot of domain specific questions and must be ready to research and tackle greenfield projects if you choose a project in this difficulty category.

Most other intermediate projects do not require this much domain knowledge and can easily be acquired during proposal writing/contributing, if you're familiar with the tech stack used in the projct.


Here is a list of candidate project ideas for your consideration. Your own ideas are welcomed too! Please chat about them to get early feedback!

Project Ideas Index

PURLdb:

Vulnerablecode:

scancode.io:

scancode-toolkit:

Other Project Ideas: https://github.com/nexB/aboutcode/wiki/GSOC-2023#all-otherarchived-project-ideas


PURLdb project ideas


PurlDB: Fetch, scan and index existing PurlDB packages

Repository: https://github.com/nexB/purldb

Project code: https://github.com/nexB/purldb/tree/main/matchcode

Size: Medium

Difficulty Level: Intermediate

Tags: [Django], [PostgreSQL], [Web], [DataStructures], [Scanning]

Mentors:

  • @jyang
  • @pombredanne
  • @AyanSinhaMahapatra

Related Issues:

Description:

The objective is to create an indexing queue to fetch, scan and index existing PurlDB packages in matchcode.

PURLdb has an existing index of package archives (which is used for matching across packages) and package data for each package, created by scans with scancode-toolkit. We need to create an indexing queue from package repository visitor processes (or manual addition) which would add a packageURL/package to the queue for re-fetchig, scanning (this might be with an updated version of scancode-toolkit) and then re-indexing this package.


PURLdb: Add UI and deploy a live public server

Repository: https://github.com/nexB/purldb

Project code: https://github.com/nexB/purldb/tree/main/purldb

Size: Large

Difficulty Level: Intermediate

Tags: [Django], [PostgreSQL], [Javascript], [Web], [UI], [LiveServer]

Mentors:

  • @jyang
  • @tdruez
  • @pombredanne
  • @tg1999

Related Issues:

Description:

There are two tasks here:

  1. Add UI:

Add a basic django UI for the project supporting queary by packages, scanning and matching. We would be heavily reusing elements from scancode.io and vulnerabelcode to give it the same look and feel.

  1. Deploy a public server similar to https://public.vulnerablecode.io/ as a demo with packageDB data. See https://github.com/nexB/vulnerablecode for reference.

PURLdb: On-demand retrieval of package metadata/archives

Repository: https://github.com/nexB/purldb, https://github.com/nexB/fetchcode and https://github.com/nexB/scancode.io

Project code: https://github.com/nexB/purldb/tree/main/purldb

Size: Large

Difficulty Level: Intermediate

Tags: [Django], [PostgreSQL], [Javascript], [API], [Metadata], [PackageManagers]

Mentors:

  • @jyang
  • @pombredanne
  • @tg1999
  • @AyanSinhaMahapatra

Related Issues:

Description:

Given a PackageURL or a list of PURLs:

  • fetch package metadata
  • fetch package archives

Fetching would be from their respective package manager API, for example: see https://pypi.org/pypi/attrs/22.2.0/json. This is already WIP as a lot of the code for a number of package managers are present, but the following has to be done:

  1. standardize the code across package managers and make this available for query by PURL
  2. we have to create package URL -> download url mapping functions for every package types some of this is already in fetchcode, adding more support here.
  3. This can be either added in PURLdb/as a pipeline in scnacode.io

PURLdb/ScanCode.io: Enrich an SBOM based on OSSF Security Score Card

Repository: https://github.com/nexB/purldb and https://github.com/nexB/scancode.io

Reference: https://github.com/ossf/scorecard

Size: Medium

Difficulty Level: Intermediate

Tags: [Django], [PostgreSQL], [SBOM], [Metadata], [Security]

Mentors:

  • @jyang
  • @tdruez
  • @pombredanne
  • @tg1999
  • @AyanSinhaMahapatra

Related Issues:

Description:

We already have SBOM export (and import) options in scancode.io supporting SPDX and CycloneDX SBOMs, and we can enrich this data using the public https://github.com/ossf/scorecard#public-data or the RestAPI at: https://api.securityscorecards.dev/.

The specific tasks for this project are:

  • Research and figure out how best to consume this data
  • Map this data to SPDX/CycloneDX SBOM elements i.e. how it can be exported in a BOM
  • Use this in a pipeline in scancode.io AND/OR have this as an element in packageDB

PURLdb: Create relationships between source and binary packages

Repository: https://github.com/nexB/purldb

Project code: https://github.com/nexB/purldb/tree/main/matchcode

Size: Large

Difficulty Level: Advanced

Tags: [Django], [PostgreSQL], [BinaryAnalysis], [Metadata], [Security]

Mentors:

  • @pombredanne
  • @mjherzog
  • @chinyeungli

Description:

Here the proposed functionality is of matching files between a source code tree for a given package and mapping these files to the binary created for this package in order to apply license/metadata/comclusions obtained from the source package scans to the binary and analyze it's license/attribution oblligations.

An example could be a pipeline for end to end Java app binaries reverse engineering. This project can also be realised similarly for the following ecosystems:

  • JavaScript
  • Android
  • iOS
  • Go

VulnerableCode project ideas

There are two main categories of projects for VulnerableCode:

  • A. COLLECTION: this category is to mine and collect or infer more new and improved data. This includes collecting new data sources, inferring and improving existing data or collecting new primary data (such as finding a fix commit of a vulnetrability)

  • B. USAGE: this category is about using and consuming the vulnerability database and includes the API proper, the GUI, the integrations, and data sharing, feedback and curation.


VulnerableCode: Process unstructured data sources for vulnerabilities (Category A)

Repository: https://github.com/nexB/vulnerablecode

Reference: https://github.com/nexB/vulnerablecode/issues/251

Size: Large

Difficulty Level: Advanced

Tags: [Python], [Django], [PostgreSQL], [Security], [Vulneribility], [NLP]

Mentors:

  • @pombredanne
  • @tg1999
  • @keshav-space
  • @Hritik14
  • @AyanSinhaMahapatra

Related Issues:

Description:

The project would be to provide a way to effectively mine unstructured data sources for possible unreported vulnerabilities.

For a start this should be focused on a few prominent repos. This project could also find Fix Commits.

Some sources are:

  • mailing lists
  • changelogs
  • reflogs of commit
  • bug and issue trackers

This requires systems to "understand" vulnerability descriptions: as often security advisories do not provide structured information on which package and package versions are vulnerable. The end goal is creating a system which would infer vulnerable package name and version(s) by parsing the vulnerability description using specialised techniques and heuristics.

We can either use NLP/machine Learning and automate it all, potentially training data masking algorithms to find these specific data (this also involved creating a dataset) but that's going to be super difficult.

We could also start to craft a curation queue and parse as much as we can to make it easy to curate by humans and progressively also improve some mini NLP models and classification to help further automate the work.


VulnerableCode: Add more data sources and mine the graph to find correlations between vulnerabilities (Category A)

Repository: https://github.com/nexB/vulnerablecode

Reference: https://github.com/nexB/vulnerablecode/issues?q=is%3Aissue+is%3Aopen+label%3A"Data+collection"

Size: Large

Difficulty Level: Intermediate

Tags: [Django], [PostgreSQL], [Security], [Vulneribility], [API], [Scraping]

Mentors:

  • @pombredanne
  • @tg1999
  • @keshav-space
  • @Hritik14
  • @jmhoran

Related Issues:

Description:

See https://github.com/nexB/vulnerablecode#how for background info. We want to search for more vulnerability data sources and consume them.

There is a large number of pending tickets for data sources. See https://github.com/nexB/vulnerablecode/issues?q=is%3Aissue+is%3Aopen+label%3A"Data+collection"

Also see tutorials for adding new importers and improvers:

More reference documentation in improvers and importers:

Note that this is similar to this GSoC 2022 project (a continuation):


VulnerableCode: On demand live evaluation of packages (Category A)

Repository: https://github.com/nexB/vulnerablecode

Size: Large

Difficulty Level: Intermediate

Tags: [Python], [Django], [PostgreSQL], [Security], [web], [Vulneribility], [API]

Mentors:

  • @pombredanne
  • @tg1999
  • @keshav-space

Related Issues:

Description:

Currently vulnerablecode runs importers in bulk where all the data from advisories are imported and stored to be displayed.

The objective of this project is to have another endpoint and API where we can:

  • support querying a specific package by PURL
  • we visit advisories/package ecosystem specific vulneribility datasources and query for this specific package
  • this is irrespective of whether data related to this package being present in the db (i.e. both for new packages and refreshing old packages)

VulnerableCode: Implement new improvers (Category A)

Repository: https://github.com/nexB/vulnerablecode

Reference:

Size: Large

Difficulty Level: Intermediate

Tags: [Python], [Django], [PostgreSQL], [Security], [web], [Vulneribility], [API]

Mentors:

  • @pombredanne
  • @tg1999
  • @keshav-space
  • @jmhoran

Related Issues:

Description:

One example is: make improver to infer the affected ranges for the advisory data sources that only gives fixed version

Take for example AlpineLinux Importer, we can only get the fixed versions from the importer. The aim of this project is to make a generic improver to infer the affected ranges with the help of fixed versions.


VulnerableCode: Decentralized vulnerability data peer-review (Category B)

Repository: https://github.com/nexB/vulnerablecode (this is possibly a new research project)

Reference: https://www.tdcommons.org/cgi/viewcontent.cgi?article=6738&context=dpubs_series

Size: Large

Difficulty Level: Advanced

Tags: [Django], [PostgreSQL], [Security], [web], [Vulneribility], [Social], [Communication], [Review], [Decentralized/Distributed], [Metadata], [Curation]

Mentors:

  • @pombredanne
  • @tdruez
  • @jyang
  • @tg1999

Related Issues:

Description:

See reference paper by @pombredanne above for more details, the goal would be a new system and approach that supports decentralized and federated metadata aggregation and sharing to remove data silos, distribute control, improve availability and enable distributed and social review over this metadata. In context of this project, this will be vulneribility metadata/advisories that need review.

This can be created using the activitypub w3c standard: https://www.w3.org/TR/activitypub/

This is a research project to explore this idea, and create a proof-of-concept with a minimal set of features, and does nothave to be a full social network as that is out of scope for this project.


VulnerableCode/Vulntotal: Browser Extension (Category B)

Repository: https://github.com/nexB/vulnerablecode

Reference: https://github.com/nexB/vulnerablecode/tree/main/vulntotal

Size: Medium

Difficulty Level: Intermediate

Tags: [Python], [Security], [Web], [Vulneribility], [BrowserExtension], [UI]

Mentors:

  • @keshav-space
  • @pombredanne
  • @tg1999

Related Issues:

Description:

Implement a firefox/chrome browser extension which would run vulntotal on the client side, and query the vulneribility datasources for comparing them. The input will be a PURL, similarly as vulntotal.

  • research tools to run python code in a browser (brython/pyscript)
  • implement the browser extension to run vulntotal

ScanCode.io project ideas


Upgrade VulnerableCode Integration in ScanCode.io (Category B)

Repository: https://github.com/nexB/scancode.io

Reference: public.vulnerablecode.io/

Size: Large

Difficulty Level: Intermediate

Tags: [Python], [Django], [PostgreSQL], [UI], [Security], [Vulneribility], [SBOM]

Mentors:

  • @pombredanne
  • @tdruez
  • @keshav-space
  • @AyanSinhaMahapatra

Related Issue:

Description:

The goal of this project is to detect vulnerable packages found in a ScanCode.io project, store this information in the database efficiently with specific models and report this information in standard VDR formats in SBOMs.

We already have a minimally working vulneribility pipeline in scancode.io, see: https://github.com/nexB/scancode.io/blob/main/scanpipe/pipelines/find_vulnerabilities.py

But there is still a lot of work left to make this more useful and usable:

  • Design the models for the data coming from vulnerablecode.
  • Add UI for showing vulnerable packages similar to vulnerablecode.
  • Add SPDX and CylconeDX output support for these vulneribilities as VDRs (VDR: Vulneribility Disclosure Reports)
  • Docs on the vulneribility pipeline and how to connect with https://public.vulnerablecode.io/

ScanCode.io: Create GitHub SBOM creation action(s):

Repository: https://github.com/nexB/scancode.io

Reference: https://github.com/nexB/aboutcode/wiki/Project-Ideas-Create-GitHub-SBOM-action

Size: Large

Difficulty Level: Intermediate

Tags: [Python], [Django], [CI], [Security], [Vulneribility], [SBOM]

Mentors:

  • @pombredanne
  • @tdruez
  • @keshav-space
  • @tg1999
  • @AyanSinhaMahapatra

Related Issue:

Description:

Create a GitHub action using scancode.io:

  • use package/dependencies/vulnaribility data from scancode.io
  • to output a SPDX/CycloneDX SBOM
  • upload this as an artifact created by the action (like artifacts created on tag push/release)

ScanCode Toolkit project ideas


Create pure-python fallback dependencies:

Repository: https://github.com/nexB/scancode-toolkit

Reference: https://github.com/WojciechMula/pyahocorasick/ and https://github.com/inveniosoftware-contrib/intbitset

Size: Large

Difficulty Level: Advanced

Tags: [Python], [DS/Algo]

Mentors:

  • @pombredanne
  • @jyang
  • @AyanSinhaMahapatra

Related Issue:

Description:

We often have portability/installation issues as scancode-toolkit depends on some native C code dependencies for performance critical parts. We should have a fallback degraded, not so fast but pure python versions of the same libraries for portability. The key dependencies in this case are:

  • pyahocorasick: could expand the built-in simpler pure python implementation to implment the pyahocorasick APIs
  • intbitset: could roll out a simple set-based fallback
  • lxml (and other libs based on it such as xmldict): could use stdlib xml.etree instead

ScanCode Toolkit: Create API docs automatically from ScanCode data models:

Reference: https://github.com/nexB/aboutcode/wiki/Project-Ideas-Create-docs-automatically-from-scancode-data

Size: Medium

Difficulty Level: Intermediate

Tags: [Python], [Django], [Sphinx], [Documentation]

Mentors:

  • @pombredanne
  • @AyanSinhaMahapatra

Related Issue:


All Other/Archived Project Ideas

These are lower priority project ideas from all the projects or older project ideas from previous year GSoC, archived here.


VulnerableCode/scancode.io: return SPDX or CycloneDX report for VEX (Category B)

Repository: https://github.com/nexB/vulnerablecode and https://github.com/nexB/scancode.io

Reference: https://ntia.gov/files/ntia/publications/vex_one-page_summary.pdf

Size: Medium

Difficulty Level: Intermediate

Tags: [Python], [Security], [Vulneribility], [SBOM]

Mentors:

  • @dmclark
  • @pombredanne

Description:

vex stands for vulnerability exploitability.

The goal of this project is to provide export capabilities to product VEX documents that comply with industry-recognized formats. This can be in scancode.io enriching the already existing Cyclonedx/SPDX outputs (or vulnerablecode?).

See the example VEX at https://github.com/CycloneDX/bom-examples/blob/master/VEX/vex.json

There is a descriptive overview of the CycloneDX approach to VEX here https://github.com/CycloneDX/bom-examples/tree/master/VEX


VulnerableCode: Create a purl "virtual" database, library and service. (Category B)

A key attraction of VulnerableCode is its built-in support for purl. The goal of this project is to make purl more accessible and visible and:

  • enhance the purl2url and url2purl support of the packageurl Python library such that it can process more common package types
  • enhance the packageurl Python library to convert more purl-like data to purl and in particular the OSV format, the new NVD 5.0 reference, the ORT coordinates, etc.
  • enhance the purl2cpe VulnerableCode utility such that it can process more cases to create better purls. Create script to publish of a continuously updated repository with the purl2pce data.
  • expose a url2purl API service in VulnerableCode to help create correct purls
  • expose a purl2url API service in VulnerableCode to help return a list of URLs given a purl.
  • publish

This is a large size project idea.


VulnerableCode: Create a Vulnerability review app (Category B)

The goal of this web app (integrated in the core VulnerableCode) would be to assist in the curation of vulnerabilities and the operation of VulnerableCode.

The UI would enable reviewers to triage, refine, improve and curate vulnerability data. This could include linking and displaying remote references in place.

The UI should also help display importers and improvers errors and provide to act on these to fix errors that require data resolution.

There are also data models needed to support an efficient review queue.

This is a large size project idea.


VulnerableCode: Vulnerability code scanners (e.g. static code analysis): (Category B)

Create scanners which would verify whether a codebase is vulnerable to a vulnerability. Once we know that a vulnerable package is in use, a scanner could check for whether the vulnerable code is called, or if environmental conditions or configuration are conducive to the vulnerability, etc. This could be based on yara rules, OpenVAS or similar. Or based on Eclipse Steady and deeper code analysis, static or dynamic.


ScanCode.io: web-based automated Conclusions app and GUI review app

This project is to create a new web application in ScanCode.io to help reach conclusions on an analysis project wrt. the origin, license or vulnerabilities of a codebase. This is an important project that comprise:

  • design the data models for conclusions
  • create a mini framework to run "bots" that can automate reaching "conclusions" on licensing and origin including spotting issues and exceptions
  • create the UI to visualize these conclusions and eventually update them by hand

This is a large size project.


ScanCode.io: Improve the web UI experience in SCIO

We have limited ways to navigate the data in ScanCode.io The goal of these project(s) are to improve the UI in several areas and in particular:

  • enabling better linking to resource details from the graphics view
  • provide streamlined simpler resource views that only display the important data and have fewer details (but still provide ways to drill down)
  • improve the way match details are visualized in the a single resource page such that which license and which copyright where detected where is more obvious and the actual license scoring is

This can be a large or medium size project.


ScanCode.io: external storage and archival of scanned code.

This project should extend ScanCode.io such that it can use external storage for the scanned code. The problem is that when you run a large number of projects the volume of storage that is used in ScanCode.io grows a lot. For this we can now archive projects, but we cannot archive the corresponding code that was scanned. The goal of this project is to add a new option in ScanCode.io to also archive to some blob storage the code that was scanned such that:

  • this can be done at the same time a project is archived

  • it can be possible to restore from this archival a state that is essentially the same as the original project state in terms of files and data

  • it would mean to archive the code input of a project or the whole workspace of a project

  • as a bonus it should also export the projects data, codebase resources, packages and other models, such that this can be imported in another instance of the same version of Scancode.io

This is a medium or large size project idea.


ScanCode.io: pluggable advanced and extended pipelines with custom data models and UI.

This project should create a new framework for advanced ScanCode.io pipelines such that it becomes possible to:

  • include pluggable new data models specific to a pipeline (for instance to store the debug symbols found in a binary file)
  • add pluggable UI for a pipeline that would include ways to navigate the data models
  • add pluggable reporting for a pipeline that would include standard reports

As a practical implementation, this project should implement a concrete UI and extension to store and display extended information for Docker images and VM image projects such as the OS, FS and layer details (displayed today as simple plain text)

This is a large size project idea.


ScanCode.io: create a system and web UI to scan ALL the packages from Debian and fix and review all of them

This project would become a prototype to help scan and curate the package licensing of a specific ecosystem. It would include:

  • specific pipelines tuned to collect lists of all the packages and organize the scans of these correctly
  • specific UI to visualize the queue of scan projects
  • specific libraries to detect common licensing issues of this package type
  • a UI to organize the community/peer review of all these package scans and issues
  • extension to create reports and update the package type manifests (here Debian machine readable copyright files)

-See also Create web application for massive scanning campaign of a whole package ecosystem

This is a large size project idea.


In particular, this project could add a new pipeline for integration with external matching services This would include tool such as SoftwareHeritage or Scanoss and other Component or package identification integration. The goal would be to create "pipes" and an improved package scanning pipeline that would include matching.


This is a large project idea.


ScanCode.io: Add web service for software package and project evaluations and comparisons (djangopackages-like)

This project would build on the djangopackages/opencomparison code to provide:

  • a general purpose and easy way to create and share package comparison grids
  • their scanning integration in ScanCode.io

This is a large or medium project idea.


This is a medium size project idea.


This is to have faster license and copyright detection using less memory.

This is a large size project idea.


TraceCode/ScanCode Toolkit/ScanCode.io: Source to binary reverse engineering

This project is about the integration of multiple existing plugins and tools with a singular to find which source code used to create a compiled using symbols, debug symbols, strings or more.

This is a large size project idea and this requires quite a bit of knowledge of binaries and source and build processes.


ScanCode Toolkit: License Language Server Protocol server for IDE integration

This project would implement a Language Server Protocol server for license and copyright that would be usng ScanCode toolkit and provide live license and copyright feedback directly in IDEs. It would also provide a plugin for integration in at least one IDE such Atom, VSCode or Eclipse.

This is a large size project idea.


Univers: Validate that the univers library can handle all the versions and ranges of all the packages!

Project(s) in this domain would consist in building test suites and fix them for all the versions in univers. Practically this means to download all the version of all the packages of an ecosystem (for instance PyPI) and validate that we can compare the version as good as the package management tool of reference for this ecosystem. For instance in alpine, https://git.alpinelinux.org/apk-tools/tree/test/version.sh?h=v2.12.9

Some specific highlights would cover:

  • writing code that can collect the list of all the versions of all the packages in a given package ecosystem (for instance PyPI, npm,etc). This code would be likely in FetchCode or ScanCode Toolkit packagedcode module. This could be extended to collect all the version ranges.

  • write an automated test harness to ensure that the univers library can properly parse (and unparse) all the versions and version ranges of all the packages.

  • write an automated test harness to ensure that the univers library can properly sort all the versions of each package in an ecosystem.

  • Update the univers library accordingly and create a unit test suite as needed


Package URL: Chrome and Firefox extension to support browsing Package URL.

Browsing pkg:pypi/packageurl-python/ should go to https://pypi.org/project/packageurl-python/

And create/register a Duck Duck Go bang mapper for https://github.com/package-url/packageurl-python/blob/main/src/packageurl/contrib/purl2url.py Also add multiple URLs in purl2url.py





CommonCode: Package name and version inference from a file name: get package name and version reliably

This project would provide a more reliable way to infer a package name and version from a package archive name. For instance the simple cases of "log4j-1.2.3.jar" could yield type:maven, name:log4j, version:1.2.3 Existing regex-based code in commoncode at https://github.com/nexB/commoncode/blob/main/src/commoncode/version.py is a bit complex to maintain. The project could possibly use some machine learning. In all case part of the project is to collect a test dataset of a large number of released archives names from various sources (sf.net, SWH, Debian, Fedora) to use as test (and possibly training set for ML)


In search of popularity and prominence metric for software packages

See https://github.com/nexB/aboutcode/wiki/Project-Ideas-Project-popularity

Clone this wiki locally