Skip to content

GSoC 2023 Project idea: Improved product representation & meta-info about products. #2633

@terriko

Description

@terriko

cve-bin-tool: Improved product representation & meta-info about products.

Project description

  • We currently just report whatever we called a thing internally in the binary scans, and whatever it was called in the file in non-binary scans such as SBOM or package list parsing, but it would be nice to include things like software heritage designations, especially to allow for de-duplication if we combine scans from multiple BOMs. We might also want to see how viable it is to provide other commonly desired meta-info like licensing, source urls, packaging data, etc.
  • For packages that come through our "language parsers" some of this data may already be available to us, we just aren't storing or using it fully. We should be able to use a lot of databases of meta-information directly and periodically refresh them from live sources as needed.
  • We also mostly just do a "search by exact name in NVD" style matching -- using metadata from package managers may allow us to do better in cases where the product name is either very generic and re-used in different environments (e.g. something like json-parser) or cases where a product has changed names/CPE designations and needs more than one (looking out our other checkers, this happens fairly frequently in the linux package data), and we may also want to allow users to add data to improve scans.
  • You will almost certainly need to build a data format for de-dupe / meta data and allow users to be able to add to it, similar to how we have checkers right now.
  • The goal here isn't perfection, but if we could say tag 50-70% of things with additional meta-data that might be enough.

An example of where this gets messy:

The text-parsing library Beautiful Soup is available in pip as beautifulsoup4, on debian based systems as python3-bs4 and on fedora-based systems as python-beautifulsoup4. If we were detecting this package in all those formats, it would be useful to have meta data that told us

  • what {vendor, product} pair(s) it has in NVD
  • mapping showing that all of those packages refer to the same software. That includes correlating the names but also the sources. We don't want to accidentally think that say, a ruby json-parser and a python json-parser are necessarily the same code.
  • additional data such as source url, license that we could include in reports (and canonical source urls are surprisingly hard -- some distros point to their own forks, for example)
  • and so on.

Skills

  • python
  • ability to read json/xml data formats (if you don't know it yet you can learn it before we start),
  • some understanding of software packaging for at least one language/linux distro would be helpful
  • understanding of SBOMs or license management would be helpful

Difficulty level

  • medium/hard.
  • The actual code needed here will often be simple, but you're going to need to be able to grok packaging and learn how it works across a variety of systems and make a lot of judgement calls on how to use data, what to store, etc.
  • There are a lot of "unsolved" packaging issues and cases where groups disagree about optimal solutions.

Project Length

  • 350 hours (e.g. full-time for 10 weeks or part-time for longer)

GSoC Participants Only

This issue is a potential project idea for GSoC 2023, and is reserved for completion by a selected GSoC contributor. Please do not work on it outside of that program. If you'd like to apply to do it through GSoC, please start by reading #2230 .

Metadata

Metadata

Assignees

No one assigned

    Labels

    gsocTasks related to our participation in Google Summer of Code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions