-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix(parquet): write single file if option is set #17009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1c05032 to
f1d58c5
Compare
|
I restarted the checks |
|
Sorry for the delay -- this PR seems to have quite. few conflicts. |
00148b2 to
651cfb8
Compare
No worries! My bad, I forgot to sync before rebasing the branch onto main again. Thx! |
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way of implicitly encoding single fileness via adding an extension to the path does seem quite hacky; not sure if this is the correct approach to follow 🤔
| let path = if file_type.get_ext() != DEFAULT_PARQUET_EXTENSION | ||
| && options.single_file_output | ||
| { | ||
| let mut path = path.to_owned(); | ||
| path.push_str(SINGLE_FILE_EXTENSION); | ||
| path | ||
| } else { | ||
| path.to_owned() | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part confuses me; If I'm not wrong file_type.get_ext() != DEFAULT_PARQUET_EXTENSION will always be true, because:
datafusion/datafusion/datasource-parquet/src/file_format.rs
Lines 157 to 162 in 602475f
| impl GetExt for ParquetFormatFactory { | |
| fn get_ext(&self) -> String { | |
| // Removes the dot, i.e. ".parquet" -> "parquet" | |
| DEFAULT_PARQUET_EXTENSION[1..].to_string() | |
| } | |
| } |
| let test_path = std::path::Path::new(&path); | ||
| assert!( | ||
| test_path.is_dir(), | ||
| "No extension and default DataFrameWriteOptons should have yielded a dir." | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would need to also check there are indeed multiple parquet files
| let test_path = std::path::Path::new(&path); | ||
| assert!( | ||
| test_path.is_file(), | ||
| "No extension and DataFrameWriteOptons::with_single_file_output(true) should have yielded a single file." | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here we would need to check that only one file is written, not just a file is written
|
@alamb @Jefffrey Sorry for not responding earlier. I will have some head room this week. And will address the comments. @Jefffrey The implicit extension is anything but ideal. Looking into the issue, you will find a discussion around approaches. We probably would have to change the signature to have the more ideal scenario. I will update my findings this evening (CET). |
@Jefffrey Agreed. Where would you suggest to implement it to have a cleaner version? Maybe setting an option in |
I'll try find some time to look into this |
DataFrameWriteOptions::with_single_file_outputproduces a directory #13323Which issue does this PR close?
DataFrameWriteOptions::with_single_file_outputproduces a directory #13323Rationale for this change
DF.write_parquetwrites multiple files / one directory even ifoptions.single_file_outputis set.What changes are included in this PR?
Introduce an internal
.singleextension.Are these changes tested?
Yes, tests are part of this PR.
Are there any user-facing changes?
Not in this implementation. There might be, if we decide to move to an
FileSinkConfigbased solution.Quoting: #13323 (comment)