-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-30151][SQL] Issue better error message when user-specified schema mismatched #26781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #114950 has finished for PR 26781 at commit
|
|
The failed test |
|
Jenkins, retest this please. |
|
Test build #114955 has finished for PR 26781 at commit
|
| .unzip | ||
| if (persistentFields.nonEmpty) { | ||
| val errorMsg = | ||
| s"Mismatched fields detected between persistent schema and user specified schema: " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: seems like we can remove ss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean: fields -> filed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I meant the s for string interpolation (s"...)
| val persistentSize = persistentSchema.size | ||
| val specifiedSize = specifiedSchema.size | ||
| if (persistentSize == specifiedSize) { | ||
| val (persistentFields, specifiedFields) = persistentSchema.zip(specifiedSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to improve such error message case across the codebase, we might also think about having a common method (maybe something called assertEquality in StructType?) that checks each type recursively and shows a better message. Can we at least have a private method here for this case in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether we'd require this similar functionality in some cases in the future. But, maybe, we could still give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it wont handle nested cases. There are other external data sources that support nested schema and the current code tells only root columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there are many cases to show better error messages like this. E.g., StructType.merge or _merge_type in Python's schema inference (https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1097-L1111)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @HyukjinKwon , after discussing with wenchen offline, we decide not to make it too complicated here. If schemas are detected not match, we simply show the whole schema to user rather than those mismatched fields as previously did. Please see de036b6.
| // only implements the RelationProvider or the SchemaRelationProvider. | ||
| Seq("TEMPORARY VIEW", "TABLE").foreach { tableType => | ||
| val schemaNotAllowed = intercept[Exception] { | ||
| val schemaNotMatch = intercept[Exception] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the variable name to make it be more readable according to current error message.
|
Test build #115037 has finished for PR 26781 at commit
|
|
Test build #115040 has finished for PR 26781 at commit
|
| "you're using DataFrameReader.schema API or creating a table, please do not " + | ||
| "specify the schema. Or if you're scanning an existed table, please drop " + | ||
| "it and re-create it." | ||
| throw new AnalysisException(errorMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: format like this?
throw new AnalysisException("The user-specified schema doesn't match the actual schema: " +
s"user-specified: ${schema.toDDL}, actual: ${baseRelation.schema.toDDL}. If " +
"you're using DataFrameReader.schema API or creating a table, please do not " +
"specify the schema. Or if you're scanning an existed table, please drop " +
"it and re-create it.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thanks!
|
Test build #115083 has finished for PR 26781 at commit
|
|
Jenkins, retest this please. |
|
Test build #115093 has finished for PR 26781 at commit
|
|
thanks, merging to master! |
|
thanks @cloud-fan @HyukjinKwon @maropu |
What changes were proposed in this pull request?
Issue better error message when user-specified schema and not match relation schema
Why are the changes needed?
Inspired by #25248 (comment), user could get a weird error message when type mapping behavior change between Spark schema and datasource schema(e.g. JDBC). Instead of saying "SomeProvider does not allow user-specified schemas.", we'd better tell user what is really happening here to make user be more clearly about the error.
Does this PR introduce any user-facing change?
Yes, user will see error message changes.
How was this patch tested?
Updated existed tests.