-
Notifications
You must be signed in to change notification settings - Fork 14
Add a tool to check if row groups .min / .max are strictly increasing within a parquet file
#40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
fec818b to
2914f42
Compare
|
TBD: is it expected that urls are sorted in between the parquet files, ie should |
sebastian-nagel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @damian0815!
Would you mind adding some context to the description of the PR? Namely checking for #12 and a short description how the tool works. The latter could be also in the README or the command-line help of the tool.
Since the tools checks only the row group metadata whether the min/max values of a single column overlap, its name is_table_sorted.py is not quite precise resp. may raise undeliverable expectations. Maybe the name and corresponding function names can be adjusted?
I've successfully tested the tool on data from CC-MAIN-2022-05 (#12 not yet fixed) and CC-MAIN-2022-21 (#12 fixed):
- it failed to detect that the column url_surtkey is not properly sorted on some input files of the first crawl. Definitely, if there is only a single row group. That's not unlikely for the robots.txt partition, e.g. this file.
- but if run over more or all files the test works.
src/util/is_table_sorted.py
Outdated
| with urlopen(path_or_url) as f: | ||
| content = f.read() | ||
| else: | ||
| with open(path_or_url, "r") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be 'rb'. Otherwise:
f.read()fails to read binary content (if gzipped), orcontent.decode("utf-8")fails on decoding strings (if a plan text file was read)
Generally: it's nice that various kinds of inputs are supported (s3, http, https, files) and also gzipped input. However, this requires that the 4x2 matrix is tested. I'd be more lazy from the beginning on, that is support less input formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the HTTP support; now we just support local files and s3 urls, gzip or otherwise.
…hecking Signed-off-by: Damian Stewart <[email protected]>
.min / .max are strictly increasing within a parquet file
|
I have updated the title and description to better correspond with what the tool does. |
Determined: this is not intended, ie part-00001.max may be out of order w.r.t part-00002.min |
|
@damian0815 This is waiting on @sebastian-nagel to re-review with your changes, correct? |
Add a tool to check if row groups
.min/.maxfor a particular column (egurl_surtkey) are strictly increasing within a particular parquet file or collection of parquet files; see README for more information and limitations - in particular, this does not check of the rows are sorted, just that the row groups min/max within a single parquet file are strictly increasing. The tool is intended to help check for #12.