Skip to content

Conversation

@dhegberg
Copy link
Contributor

Which issue does this PR close?

Closes #9684.

Rationale for this change

Dataframe's write_parquet() was identified as incorrectly identifying paths without an extension as a single file output.

This change updates start_demuxer_task to respect the suggested behaviour:

    tmp/dataset/ -> is a folder since it ends in /
    tmp/dataset -> is still a folder since it does not end in / but has no valid file extension
    tmp/file.parquet -> is a file since it does not end in / and has a valid file extension .parquet
    tmp/file.parquet/ -> is a folder since it ends in /

What changes are included in this PR?

  • Add file_extension() to ListingTableUrl to return an Optional extension
  • Update start_demuxer_task() to require the presence of an extension from the ListingTableUrl to set single_file_output to true
  • Rename file_extension to default_extension to indicate usage will be ignored if single_file_output is triggered.

Are these changes tested?

  • Unit tests added for file_extension()
  • Unit tests added for Dataframe.write_parquet() for paths with and without extensions.
  • No direct testing for start_demuxer_task since there was no direct testing originally. I can revise and test this directly if preferred.

Testing via cargo test -- --test-threads=1

Are there any user-facing changes?

  • Yes, the file output write behaviour is changing.

@github-actions github-actions bot added the core Core DataFusion crate label Oct 23, 2024
@alamb
Copy link
Contributor

alamb commented Oct 30, 2024

Thanks @dhegberg -- I plan to review this later today

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dhegberg -- this is a really nice PR -- I think the code and tests are well written.

Thank you 🙏

cc @progval

I also tried it locally with datafusion-cli and it works as expected 👌

> copy (values (1), (2)) to '/tmp/foo' STORED AS parquet;
+-------+
| count |
+-------+
| 2     |
+-------+
1 row(s) fetched.
Elapsed 0.030 seconds.

>
\q
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/datafusion-cli$ ls -ltr /tmp/foo
total 8
-rw-r--r--@ 1 andrewlamb  wheel   342B Nov  1 12:31 MrzgxU8HT1fn3wTB_0.parquet
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/datafusion-cli$

}

#[test]
fn test_file_extension() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice tests 👏

Ok(())
}

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice too

@alamb alamb merged commit 87f0838 into apache:main Nov 1, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Nov 1, 2024

Here is a small follow on PR to add some more docs #13216 (really get the great writeup you did on this PR into the code)

@sergiimk
Copy link
Contributor

sergiimk commented Nov 9, 2024

I suspect this introduced a regression - would appreciate your opinion on #13323

@dhegberg dhegberg deleted the write_files_when_extension branch December 14, 2024 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

to_parquet with path not ending in a slash writes to a file instead of a directory since v36

3 participants