Skip to content

Investigate reasons why table isn't fully sorted by url_surtkey #12

@sebastian-nagel

Description

@sebastian-nagel

The table is written from the cdx-*.gz files which are sorted by the SURT key. However, looks like that the Parquet files are not fully sorted anymore:

$> parquet-tools cat --json \
         s3a://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-22/subset=warc/part-00243-f48ad7b9-218e-42ff-aff7-02e2fdcb7739.c000.gz.parquet \
    | jq -r .url_surtkey
...
org,commoncrawl)/
org,catho85,jubile2017)/evenements/exposition-jean-michel-solves
org,commoncrawl)/2011/12/mapreduce-for-the-masses
org,catho85,jubile2017)/evenements/exposition-la-fleur-de-lage
org,commoncrawl)/2012/03/data-2-0-summit
...

The output looks like the result of zipping 2 sorted streams into the final Parquet file and is observable over various crawls, i.e. it's not bound to a specific Spark / Parquet version.

Ensuring the sort order by SURT key will also ensure that other columns (domain name, partially host name) are properly sorted. Hence, it should improve compression and look-ups in the Parquet files by more precise min/max ranges.

The same issue is described in a post on stackoverflow. Initial experiments prove that without partitioning the data (into crawl and subset) the order is preserved.

If fixed and if better compression or faster look-ups are visible, might be worth to regenerate the index for older crawls.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions