-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19950][SQL] Fix to ignore nullable when df.load() is executed for file-based data source #17293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #74538 has finished for PR 17293 at commit
|
|
Test build #74549 has finished for PR 17293 at commit
|
|
Oh @kiszk, I think this deals with the essentially same problem in #14124 in a different way. I initially proposed forcing all cases into nullable schema in the similar places (please refer the details in the JIRA and PR). It now turned to propose forcing the schemas into nullable ones when we set, per discussion, though. I guess the problem is essentially the same. |
|
@HyukjinKwon Thank you for your clarification. I understood when we enable user-specified schema, we have to add code for validation data with the schema. |
|
Test build #74849 has finished for PR 17293 at commit
|
|
Test build #74875 has finished for PR 17293 at commit
|
|
Test build #74887 has finished for PR 17293 at commit
|
|
@HyukjinKwon I added data validation using schema information for Parquet Reader, as @gatorsmile suggested in https://www.mail-archive.com/[email protected]/msg39233.html. Could you please take a look? |
|
Uh.. I am actually not sure if we want the non-nullability per #14124 (comment). I am willing to help test and verify if it is okay for non-nullability at my best but I hope anyone who is qualified decides what we want first. Another personal humble opinion is, I think we should enable/disable this for all datasource or not if the change is not too big (assuming this PR enables non-nullability for Parquet only). cc @cloud-fan, @marmbrus and @liancheng, do we want to enable this non-nullability? |
|
Test build #74948 has finished for PR 17293 at commit
|
|
Test build #74950 has finished for PR 17293 at commit
|
|
Test build #74957 has finished for PR 17293 at commit
|
|
Test build #74958 has finished for PR 17293 at commit
|
|
This failure occurs due to an issue described at #17302. |
|
@HyukjinKwon If we agree with adding data cleaning phase, I will do this for other data sources in another PR. |
|
Test build #75149 has started for PR 17293 at commit |
|
Jenkins, retest this please |
|
Test build #75158 has finished for PR 17293 at commit
|
|
cc: @HyukjinKwon, @cloud-fan, @marmbrus and @liancheng, |
|
Test build #77529 has finished for PR 17293 at commit
|
|
Test build #77530 has finished for PR 17293 at commit
|
|
Test build #77541 has finished for PR 17293 at commit
|
|
ping @HyukjinKwon, @gatorsmile |
1 similar comment
|
ping @HyukjinKwon, @gatorsmile |
|
(Just FWIW, I am waiting for the feedback for #17293 (comment). I can help review but I think we need a decision (which I can't make by rule I believe) whether we should support this case or not). |
|
This issue causes a lot of headaches for us when picking up parquet datasets. To get around this issue, we write the schema alongside the parquet files in a side-band, and then when loading, create the correct schema object (with non-nullable columns) and swap it in, like so: This is extremely ugly code, particularly in PySpark, and I'd rather remove it. I don't see why we wouldn't want to trust the nullable flags that Spark itself correctly writes to parquet files. |
|
@HyukjinKwon thank you for reminding us this discussion. I agree that we need a discussion. @cloud-fan, @marmbrus, @liancheng is there any comment on this? |
| * 2. If A doesn't exist in `that`, it's included in the result schema. | ||
| * 3. If B doesn't exist in `this`, it's also included in the result schema. | ||
| * 2. If A doesn't exist in `that`, it's included in the result schema with nullable. | ||
| * 3. If B doesn't exist in `this`, it's also included in the result schema with nullable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
|
My main concern is about how to validate it. This is one pain point of Spark: we don't control the storage layer and we can't trust any constraints like nullable, primary key, etc. I think it's a safe choice to always treat input data as nullable, and I think it's all about performance, validating also have performance penalty, and it may eliminate the benefit of non-nullable optimization. BTW another reason is Spark doesn't work well if the nullable information is wrong, we may return wrong result which is very hard to debug. |
|
Reliable nullability information is about far more than non-nullable optimization to us. I would happily opt in to any performance penalty that validated that non-nullable columns were actually non-nullable, with a hard fail if it encountered an unexpected null. This is a real problem we face running a reasonably large data warehouse at scale. In fact, we already do this in another project through the use of a custom function called
|
|
Actually Spark already has a |
|
I see. |
|
Does the closure of this PR imply that setting nullable=false in a custom (user-defined) schema will never have an effect when loading CSV or JSON data from a file? In other words, if someone sets |
|
We should discuss further. Yes, for now. That's what happens now. |
|
Thanks for the answer, @HyukjinKwon. Is there any place that specifies the full syntax (DDL and/or JSON) and known limitations of custom schemas? I searched in various ways through the docs root here but couldn't find anything. We are incorporating support for custom schemas into our tool and I was hoping to find something I could link our users to. |
|
Apache Spark side doesnt document SQL full syntax. You shkuld take a look at ANTLR definition. Let's ask it to mailing list since that subject is orthogonal. |
|
This issue still persist in 2023. |
What changes were proposed in this pull request?
This PR fixes a problem that was reported in Databricks forum. When the following code is executed, a schema for
idhasnullable = truewhileread.schema()specifiesnullable = false.I realized this is because current Spark sets
trueintonullablein its schema for file-based data source. This PR makes this conservative setting precise.How was this patch tested?
added new test in
DataFrameReaderWriterSuite