Skip to content

Conversation

@art1f1c3R
Copy link
Member

@art1f1c3R art1f1c3R commented Dec 13, 2024

This PR adds in a new heuristic to the PyPI malware analyzer that focuses on identifying anomalistic version numbers. An anomalistic version number is classified as one that is unrealistically high for a single release. This heuristic depends on the single-release heuristic, and currently implements thresholds for determining suspicious version numbers on the epoch, major, and minor version components. Versioning numbers must adhere to the PyPI standards (PEP 440, as per the packaging module), otherwise they will not be analysed.

This heuristic attempts to identify the versioning pattern to reduce false-positives. Calendar versioning is defined as a versioning pattern where the major is in the form of YYYY or YY, the minor is in the form of MM or M, and the micro in the form of DD or D, with no other release components (i.e. in the form YYYY.MM.DD only). These values must correspond to the upload time with a 48-hour window for time differences to be classified as calendar versioning.

Calendar-semantic versioning is defined as a versioning pattern where the major is in the form of YYYY or YY, but all other components are not detected as calendar versioning. All other versioning patterns are classed as semantic versioning.

Outstanding tasks for this PR:

  • run Macaron with this new feature to determine suitable default thresholds
  • make the thresholds configurable through defaults.ini

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Dec 13, 2024
@art1f1c3R art1f1c3R self-assigned this Dec 16, 2024
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/anomalistic-version branch from 303f33f to 69b6bd8 Compare December 18, 2024 05:44
@art1f1c3R
Copy link
Member Author

art1f1c3R commented Dec 20, 2024

5537 existing PyPI packages with a single release were analyzed for their epoch, major, minor, and micro version numbers. From the results, all packages used an Epoch number of 0 (the implied epoch number as none supplied one). The data for major, minor, and micro version numbers included too many anomalies to plot nicely. From the data, the following changes will be made:

  • Minor and Micro versions are not worth analyzing as they can contain many obscure values that are intended to be interpreted in a different way (e.g. aggregateGithubCommits with version 3.20200806 where the minor is actually a date, or annize version 5.0.6379 published May 31 2021). They are a less likely attack vector as to guarantee being prioritized over another package.
  • The epoch threshold will be lowered to 3. The purpose of using an epoch is generally to ensure changing versioning systems prioritizes the new versioning system. It will still be included as it is a realistic attack vector, since most packages do not use an epoch at all.
  • The calendar versioning will be updated to account for different ordering (e.g. YYYY.MM.DD or YYYY.MM.DD or DD.MM.YYYY), with a motivating example of abbrfix with version 26.2.2024 published on February 27 2024
  • The threshold on the number of days of error allowed for calendar versioning will be increased to 4. A motivating example is aiocmdline version 2024.1.25 released on January 28 2024.
  • The major version number may be checked to represent a date itself (e.g. andronnlib version 160122 published Jan 18 2022)
  • The restriction for calendar versioning to only have 3 release components will be removed. A motivating example is aigc with version 2023.2.15.14.18.59 published February 15 2023. Checking the implied major version from this (14) is unnecessary as the date would be prioritized over this.
  • The major version threshold of 20 will be kept. Many version numbers around this range are often calendar-serial or calendar versioning for the year (e.g. adwords-client version 17.7 published July 13 2017), so any major version number not corresponding to the calendar year here and above this threshold is sufficiently large to be considered suspicious.

There were some anomalies to these rules (e.g. almavik version 362 published June 24 2023, or abqpy2016 to abqpy2025 all versions 20xx.7.9, all published Dec 8 2024), but not enough to warrant relaxing the threshold.

@art1f1c3R art1f1c3R force-pushed the art1f1c3R/anomalistic-version branch from ba54226 to 73b5eeb Compare December 20, 2024 06:41
@art1f1c3R
Copy link
Member Author

I am going to request review on this as I don't believe an integration test may be appropriate for this feature, since there is no guarantee that any given package may be assumed to always have one release, so if I use a package for the test if it releases a new version the test no longer makes sense.

@art1f1c3R art1f1c3R marked this pull request as ready for review January 6, 2025 01:56
@art1f1c3R art1f1c3R requested a review from behnazh-w as a code owner January 6, 2025 01:56
@art1f1c3R art1f1c3R requested a review from tromai January 6, 2025 01:56
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/anomalistic-version branch from 69db6cd to 633ff50 Compare January 7, 2025 05:41
@behnazh-w behnazh-w requested a review from benmss January 8, 2025 09:57
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/anomalistic-version branch from dd2dd17 to 434e322 Compare January 9, 2025 23:59
Copy link
Contributor

@tromai tromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@art1f1c3R art1f1c3R merged commit 65f9325 into staging Jan 14, 2025
10 checks passed
@art1f1c3R art1f1c3R deleted the art1f1c3R/anomalistic-version branch June 18, 2025 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants