-
Notifications
You must be signed in to change notification settings - Fork 14
Description
The table is written from the cdx-*.gz files which are sorted by the SURT key. However, looks like that the Parquet files are not fully sorted anymore:
$> parquet-tools cat --json \
s3a://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-22/subset=warc/part-00243-f48ad7b9-218e-42ff-aff7-02e2fdcb7739.c000.gz.parquet \
| jq -r .url_surtkey
...
org,commoncrawl)/
org,catho85,jubile2017)/evenements/exposition-jean-michel-solves
org,commoncrawl)/2011/12/mapreduce-for-the-masses
org,catho85,jubile2017)/evenements/exposition-la-fleur-de-lage
org,commoncrawl)/2012/03/data-2-0-summit
...
The output looks like the result of zipping 2 sorted streams into the final Parquet file and is observable over various crawls, i.e. it's not bound to a specific Spark / Parquet version.
Ensuring the sort order by SURT key will also ensure that other columns (domain name, partially host name) are properly sorted. Hence, it should improve compression and look-ups in the Parquet files by more precise min/max ranges.
The same issue is described in a post on stackoverflow. Initial experiments prove that without partitioning the data (into crawl and subset) the order is preserved.
If fixed and if better compression or faster look-ups are visible, might be worth to regenerate the index for older crawls.