[SPARK-49893] Respect user schema nullability for file data sources when DSV2 Table is used. #48321

urosstan-db · 2024-10-02T09:32:19Z

What changes were proposed in this pull request?

DataFrameReader has 3 APIs for JSON reading

json(DataSet[String])
json(Rdd)
json(filePath)

First two APIs respects provided user schema nullability when spark flag spark.sql.legacy.respectNullabilityInTextDatasetConversion is set to true, but third one does not respect and provided schema nullability is always overriden to true.

E.g.
dataFrameReader.json(jsonRDD) and dataFrameReader.json(jsonDataSet) will check mentioned config, but dataFrameReader.json(path) will hit totally different code path, and it will end up in FileTable where dataSchema getter will override fields nullability to true.

Why are the changes needed?

Some users just want to have a validation of data and to get exception when some field is nullable.

Does this PR introduce any user-facing change?

When customer set newly added Spark conf spark.sql.respectUserSchemaNullabilityForFileDataSourceWithFilePath, provided user schema nullability will not be overriden to true anymore.
Default value for flag is false.

How was this patch tested?

Using integration test in base JsonSuite class.

Was this patch authored or co-authored using generative AI tooling?

No

…eProvider

urosstan-db · 2024-10-02T09:33:22Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+
+        val schemaStr = df.schema.treeString
+
+        assert(schemaStr.contains("f2: string (nullable = false)"))


We can use df.schema also, there is no need for string invocation for this test.

urosstan-db · 2024-10-02T09:36:44Z

I will add JIRA item as well in future, if we decide to go with this PR.

urosstan-db · 2024-10-02T11:11:52Z

@MaxGekk What do you think about this change?
@gengliangwang

MaxGekk · 2024-10-02T13:04:11Z

cc @HyukjinKwon

gengliangwang · 2024-10-02T20:14:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala

    }
    fileIndex match {
      case _: MetadataLogFileIndex => schema
+      case _ if SQLConf.get.respectUserSchemaNullabilityForFileDataSources => schema


why changing the V2 only?

Other code paths are covered in data frame reader, if flag LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION is set to true, then schema nullability will be preserved.

gengliangwang · 2024-10-02T20:15:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "`DataFrameReader.schema(schema).json(path)` and .csv(path) and .xml(path) is respected" +
+        "Otherwise, they are turned to a nullable schema forcibly.")
+      .version("4.0.0")
+      .fallbackConf(LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION)


why fall back to the config of text data source?

This is some kind of follow up to PR introduced LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION, schema nullability is not preserved when path to file is given, even if flag is set to true. We had users who complained about not preserving schema nullability even if they set this flag. So to make them easier, fallback is set to the flag from previous PR, but since it is dangerous to rely on existing flag (because mitigation for potential issue will affect other code paths), I introduced new flag with fallback to the old one. So if something goes wrong, new flag can be set, while, if customer already overrode old flag to true, then there is no additional actions.

gengliangwang · 2024-10-02T20:17:32Z

IIRC, this is a long-standing behavior to avoid unexpected nullability. Could you further explain why do we have to support non-nullable user-provided schema?

cc @cloud-fan who probably has the most context on this one

urosstan-db · 2024-10-02T20:41:32Z

IIRC, this is a long-standing behavior to avoid unexpected nullability. Could you further explain why do we have to support non-nullable user-provided schema?

cc @cloud-fan who probably has the most context on this one

I have to talk with users, but for now, they explicitly wanted to have nullability of provided schema, and I suppose they want exception during parsing if some field is null.

HyukjinKwon · 2024-10-03T23:50:32Z

When the initial PR merged for json and csv mentioned in the PR (#33436), I realised that I underestimated the problem (#33436 (comment)). For reading actual files, there'd be much more cases that it could break. One of the cases were streaming cases that the schema change over time etc.

HyukjinKwon · 2024-10-03T23:52:09Z

We tried to fix this few times, e.g., #17293 and #14124. The breakage was pretty severe

urosstan-db · 2024-10-04T13:15:42Z

We tried to fix this few times, e.g., #17293 and #14124. The breakage was pretty severe

Thanks a lot @HyukjinKwon, what can we do for user here, they want to get an error when column is null or missing

cloud-fan · 2024-10-08T13:35:32Z

Is this only a problem for file source v2? If yes I think we should just fix file source v2 to respect LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION

github-actions · 2025-01-17T00:23:47Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Respect user schema nullability for file data sources when using Tabl…

460948d

…eProvider

github-actions bot added the SQL label Oct 2, 2024

urosstan-db commented Oct 2, 2024

View reviewed changes

gengliangwang reviewed Oct 2, 2024

View reviewed changes

andrej-db approved these changes Oct 3, 2024

View reviewed changes

urosstan-db changed the title ~~Respect user schema nullability for file data sources when DSV2 Table is used.~~ [SPARK-49893] Respect user schema nullability for file data sources when DSV2 Table is used. Oct 7, 2024

urosstan-db marked this pull request as ready for review October 7, 2024 13:08

Add more tests

f64be00

github-actions bot added the Stale label Jan 17, 2025

github-actions bot closed this Jan 18, 2025


		val schemaStr = df.schema.treeString

		assert(schemaStr.contains("f2: string (nullable = false)"))

[SPARK-49893] Respect user schema nullability for file data sources when DSV2 Table is used. #48321

[SPARK-49893] Respect user schema nullability for file data sources when DSV2 Table is used. #48321

Uh oh!

Conversation

urosstan-db commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

urosstan-db Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

urosstan-db commented Oct 2, 2024

Uh oh!

urosstan-db commented Oct 2, 2024

Uh oh!

MaxGekk commented Oct 2, 2024

Uh oh!

gengliangwang Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

urosstan-db Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

gengliangwang Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

urosstan-db Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Oct 2, 2024

Uh oh!

urosstan-db commented Oct 2, 2024

Uh oh!

HyukjinKwon commented Oct 3, 2024

Uh oh!

HyukjinKwon commented Oct 3, 2024

Uh oh!

urosstan-db commented Oct 4, 2024

Uh oh!

cloud-fan commented Oct 8, 2024

Uh oh!

github-actions bot commented Jan 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

urosstan-db commented Oct 2, 2024 •

edited

Loading