Skip to content

Regression: DataFrameWriteOptions::with_single_file_output produces a directory #13323

@sergiimk

Description

@sergiimk

Describe the bug

Consider a snippet like this:

df.write_parquet(
  "dir/data",
  DataFrameWriteOptions::new().with_single_file_output(true),
  None
).await

Before v43 this would write a single file called data, but in v43 this is creating data as a directory with a randomly named file(s) in it.

This seems to be related to #13079 (cc @dhegberg) that added an extension-based heuristic.

I see this as a regression, as single file output is requested explicitly, and I don't want a heuristics to be applied.

We are using Parquet files with a content-addressable file system and our files don't have extensions.

To Reproduce

See above

Expected behavior

Considering the introduction of the extension-based heuristic I would suggest the following behavior:

  • with_single_file_output is not called (single_file_output == None) - apply the heuristic
  • with_single_file_output(true) - produce a single file at the exact path specified
  • with_single_file_output(false) - create directory under specified path if doesn't exist and write one or many files

Additional context

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinggood first issueGood for newcomershelp wantedExtra attention is neededregressionSomething that used to work no longer does

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions