-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-6322][SQL] CTAS should consider the case where no file format or storage handler is given #5014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #28567 has finished for PR 5014 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my bad I didn't drop a comment here.
Actually we are not using the ignores, previously we were trying the parse the storage format etc. information within HiveQl.scala, however, the logic is quite complicated and error-prone, and then we decided to reuse the Hive code for that purpose, that's why we always pass down the node (ASTNode) for further analysis. The node is required to be passed down even without storage format specified, Hive will provide a default storage format for that during the analysis. So I don't think the change here is reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Actually, ignores could be None here when no file format or storage handler is given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even without the storage handler specified, Hive still need the node to get the default format.
See https://github.com/apache/spark/pull/5014/files#diff-ee66e11b56c21364760a5ed2b783f863L539
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it is None and hive.convertCTAS is true, Hive CTAS statement will become a data source table with default data source. Please refer to CreateTables in HiveMetastoreCatalog.scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise it will fall into the default format specified by hive.conf.defaultDataSourceName, which probably not the Hive default behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In CreateTables, the node is parsed by Hive's codes and see if file format or storage handler is given. If no, and hive.convertCTAS is true, Hive CTAS statement will become a data source table with default data source, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenghao-intel Please refer to #4639, which uses Parquet as the default file format for CTAS statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, ok , I see, thanks for the explanation, but what if hive.convertCTAS is false and user didn't specify the storage format in HiveQl? How can we get the default Hive storage format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In that case, the parsed CreateTableDesc will be passed to execution.CreateTableAsSelect.
|
@yhuai @chenghao-intel I updated it. Please take a look when you have time. Thanks! |
|
Test build #28769 has finished for PR 5014 at commit
|
|
@yhuai Any other comments? |
|
/cc @yhuai @liancheng Can you take a look of this if you have time? Thanks! |
|
Not super familiar with this part of code. But according to the context and discussion, I think this change makes sense. @yhuai Could you help confirming this? |
|
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the only place where we create a CreateTableAsSelect? If so, can we get rid of the Option and just pass the node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it is. If we are sure that CreateTableAsSelect is only used by Hive dialect, we can remove the Option.
|
Test build #28999 has finished for PR 5014 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of the type parameter and rename it to CreateHiveTableAsSelect (be a little bit more specific on what this one does).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CreateTableAsSelect is designed as a common logical plan node, that's why I made the desc as T, and also the optional parameter. Otherwise, every SQL dialect will implements it's own CTAS node(logical plan).
Or is the CreateTableUsingAsSelect a more generic interface for the same purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think CreateTableUsingAsSelect is just for data source API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add a specific CreateHiveTableAsSelect but still keepCreateTableAsSelect as a common logical plan for the use of other SQL dialects.
…f CreateTableAsSelect.
|
Test build #29045 has finished for PR 5014 at commit
|
|
@yhuai Is this good to go now? |
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala
|
Test build #29164 has finished for PR 5014 at commit
|
|
@yhuai please take a look, thanks. |
|
@yhuai Can you check if this is ready to merge? |
|
@liancheng @yhuai this is quite a while for waiting merging, may you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having this trait here, can we just make the implementation below a Command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we cannot use Command because a Command is a LeafNode. If we make it a Command, the child logical plan will not be analyzed and optimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it will be good to have a different type of Command that is not a LeafNode.
|
One minor comment otherwise this LGTM. ping @yhuai |
|
Test build #30106 has finished for PR 5014 at commit
|
|
ping @yhuai |
|
Test build #31053 has started for PR 5014 at commit |
When creating
CreateTableAsSelectinHiveQl, it doesn't consider the case where no file format or storage handler is given. So later inCreateTables, one ofCreateTableAsSelectcases will never be run. The twoCreateTableAsSelectcases are basically the same codes except for checkingCreateTableDesc. This pr fixes this issue.