Investigate reasons why table isn't fully sorted by `url_surtkey`

The table is written from the cdx-*.gz files which are sorted by the SURT key. However, looks like that the Parquet files are not fully sorted anymore:
```
$> parquet-tools cat --json \
         s3a://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-22/subset=warc/part-00243-f48ad7b9-218e-42ff-aff7-02e2fdcb7739.c000.gz.parquet \
    | jq -r .url_surtkey
...
org,commoncrawl)/
org,catho85,jubile2017)/evenements/exposition-jean-michel-solves
org,commoncrawl)/2011/12/mapreduce-for-the-masses
org,catho85,jubile2017)/evenements/exposition-la-fleur-de-lage
org,commoncrawl)/2012/03/data-2-0-summit
...
```
The output looks like the result of zipping 2 sorted streams into the final Parquet file and is observable over various crawls, i.e. it's not bound to a specific Spark / Parquet version. 

Ensuring the sort order by SURT key will also ensure that other columns (domain name, partially host name) are properly sorted. Hence, it should improve compression and look-ups in the Parquet files by more precise min/max ranges.

The same issue is described in a [post on stackoverflow](https://stackoverflow.com/questions/52159938/cant-write-ordered-data-to-parquet-in-spark). Initial experiments prove that without partitioning the data (into crawl and subset) the order is preserved.

If fixed and if better compression or faster look-ups are visible, might be worth to regenerate the index for older crawls.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate reasons why table isn't fully sorted by `url_surtkey` #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate reasons why table isn't fully sorted by url_surtkey #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Investigate reasons why table isn't fully sorted by `url_surtkey` #12