-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-49893] Respect user schema nullability for file data sources when DSV2 Table is used. #48321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49893] Respect user schema nullability for file data sources when DSV2 Table is used. #48321
Conversation
|
|
||
| val schemaStr = df.schema.treeString | ||
|
|
||
| assert(schemaStr.contains("f2: string (nullable = false)")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use df.schema also, there is no need for string invocation for this test.
|
I will add JIRA item as well in future, if we decide to go with this PR. |
|
@MaxGekk What do you think about this change? |
|
cc @HyukjinKwon |
| } | ||
| fileIndex match { | ||
| case _: MetadataLogFileIndex => schema | ||
| case _ if SQLConf.get.respectUserSchemaNullabilityForFileDataSources => schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why changing the V2 only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other code paths are covered in data frame reader, if flag LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION is set to true, then schema nullability will be preserved.
| "`DataFrameReader.schema(schema).json(path)` and .csv(path) and .xml(path) is respected" + | ||
| "Otherwise, they are turned to a nullable schema forcibly.") | ||
| .version("4.0.0") | ||
| .fallbackConf(LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why fall back to the config of text data source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is some kind of follow up to PR introduced LEGACY_RESPECT_NULLABILITY_IN_TEXT_DATASET_CONVERSION, schema nullability is not preserved when path to file is given, even if flag is set to true. We had users who complained about not preserving schema nullability even if they set this flag. So to make them easier, fallback is set to the flag from previous PR, but since it is dangerous to rely on existing flag (because mitigation for potential issue will affect other code paths), I introduced new flag with fallback to the old one. So if something goes wrong, new flag can be set, while, if customer already overrode old flag to true, then there is no additional actions.
|
IIRC, this is a long-standing behavior to avoid unexpected nullability. Could you further explain why do we have to support non-nullable user-provided schema? cc @cloud-fan who probably has the most context on this one |
I have to talk with users, but for now, they explicitly wanted to have nullability of provided schema, and I suppose they want exception during parsing if some field is null. |
|
When the initial PR merged for json and csv mentioned in the PR (#33436), I realised that I underestimated the problem (#33436 (comment)). For reading actual files, there'd be much more cases that it could break. One of the cases were streaming cases that the schema change over time etc. |
Thanks a lot @HyukjinKwon, what can we do for user here, they want to get an error when column is null or missing |
|
Is this only a problem for file source v2? If yes I think we should just fix file source v2 to respect |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
DataFrameReader has 3 APIs for JSON reading
json(DataSet[String])
json(Rdd)
json(filePath)
First two APIs respects provided user schema nullability when spark flag
spark.sql.legacy.respectNullabilityInTextDatasetConversionis set to true, but third one does not respect and provided schema nullability is always overriden to true.E.g.
dataFrameReader.json(jsonRDD)anddataFrameReader.json(jsonDataSet)will check mentioned config, butdataFrameReader.json(path)will hit totally different code path, and it will end up inFileTablewheredataSchemagetter will override fields nullability to true.Why are the changes needed?
Some users just want to have a validation of data and to get exception when some field is nullable.
Does this PR introduce any user-facing change?
When customer set newly added Spark conf
spark.sql.respectUserSchemaNullabilityForFileDataSourceWithFilePath, provided user schema nullability will not be overriden to true anymore.Default value for flag is false.
How was this patch tested?
Using integration test in base JsonSuite class.
Was this patch authored or co-authored using generative AI tooling?
No