Skip to content

Conversation

@damian0815
Copy link

@damian0815 damian0815 commented Oct 29, 2025

Add a tool to check if row groups .min / .max for a particular column (eg url_surtkey) are strictly increasing within a particular parquet file or collection of parquet files; see README for more information and limitations - in particular, this does not check of the rows are sorted, just that the row groups min/max within a single parquet file are strictly increasing. The tool is intended to help check for #12.

  • Initial implementation
  • Unit tests
  • GitHub workflow

@damian0815 damian0815 force-pushed the damian/feat/is_table_sorted branch from fec818b to 2914f42 Compare October 29, 2025 14:31
@damian0815
Copy link
Author

TBD: is it expected that urls are sorted in between the parquet files, ie should max in part-00001-....gz.parquet always be <= min in part-00002-....gz.parquet?

@damian0815 damian0815 marked this pull request as ready for review October 29, 2025 14:58
@damian0815 damian0815 requested a review from wumpus October 30, 2025 14:32
Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @damian0815!

Would you mind adding some context to the description of the PR? Namely checking for #12 and a short description how the tool works. The latter could be also in the README or the command-line help of the tool.

Since the tools checks only the row group metadata whether the min/max values of a single column overlap, its name is_table_sorted.py is not quite precise resp. may raise undeliverable expectations. Maybe the name and corresponding function names can be adjusted?

I've successfully tested the tool on data from CC-MAIN-2022-05 (#12 not yet fixed) and CC-MAIN-2022-21 (#12 fixed):

  • it failed to detect that the column url_surtkey is not properly sorted on some input files of the first crawl. Definitely, if there is only a single row group. That's not unlikely for the robots.txt partition, e.g. this file.
  • but if run over more or all files the test works.

with urlopen(path_or_url) as f:
content = f.read()
else:
with open(path_or_url, "r") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 'rb'. Otherwise:

  • f.read() fails to read binary content (if gzipped), or
  • content.decode("utf-8") fails on decoding strings (if a plan text file was read)

Generally: it's nice that various kinds of inputs are supported (s3, http, https, files) and also gzipped input. However, this requires that the 4x2 matrix is tested. I'd be more lazy from the beginning on, that is support less input formats.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the HTTP support; now we just support local files and s3 urls, gzip or otherwise.

@damian0815 damian0815 changed the title Check if tables are sorted Add a tool to check if row groups .min / .max are strictly increasing within a parquet file Oct 31, 2025
@damian0815
Copy link
Author

I have updated the title and description to better correspond with what the tool does.

@damian0815
Copy link
Author

TBD: is it expected that urls are sorted in between the parquet files, ie should max in part-00001-....gz.parquet always be <= min in part-00002-....gz.parquet?

Determined: this is not intended, ie part-00001.max may be out of order w.r.t part-00002.min

@jenenglish
Copy link

@damian0815 This is waiting on @sebastian-nagel to re-review with your changes, correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants