-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-46272][SQL] Support CTAS using DSv2 sources #44190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4736879 to
6bcc4a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we change src/main, you don't need to use [TESTS], @allisonwang-db .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! Thanks for letting me know!
|
cc @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not related to data source and we probably don't need it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused here. So for this data source, we probably don't check anything when creating the table. But when load the table, and get the Table instance in V2SessionCatalog, shall we check if Table#schema match what we stored in HMS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually tricky, and the behavior depends on the implementation of the data source.
In createTable, we store the query's schema (col1, col2, col3) in HMS, but this data source's getTable method does not take into account the input schema.
So, when the DSv2 relation is being created, it uses table.columns.asSchema which is the default implementation of the schema (i, j).
Lines 210 to 211 in bacdb3b
| val schema = CharVarcharUtils.replaceCharVarcharWithStringInSchema(table.columns.asSchema) | |
| DataSourceV2Relation(table, toAttributes(schema), catalog, identifier, options) |
However, if the data source utilize user-defined schema in getTable, it won't throw this exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should enforce something here. When calling TableProvider.getTable with user-specified schema, we should make sure the returned table reports the same schema as the user-specified schema (probably ignore nullability). It's actually a documented requirement
* Return a {@link Table} instance with the specified table schema, partitioning and properties
* to do read/write. The returned table should report the same schema and partitioning with the
* specified ones, or Spark may fail the operation.
0721a77 to
1fee5b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlState is required now. maybe 42K02 is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we do this check in a lower level? where we instantiate the v2 Table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense
sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we check it in loadTable? In case the data source is a bit random and only return wrong schema at second time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it but looks like loadTable is used in many other places, such as in tableExists. If we check the schema there, it will fail commands other than create table (e.g DROP TABLE t will fail because the schema in catalogTable does not match the table schema)
|
LGTM if all tests pass |
|
@cloud-fan the test failure seems unrelated |
|
Can you retrigger the GA jobs? |
b507500 to
7ed9ad2
Compare
|
thanks, merging to master! |
What changes were proposed in this pull request?
#43949 supports CREATE TABLE using DSv2 sources. This PR supports CREATE TABLE AS SELECT (CTAS) using DSv2 sources. It turns out that we don't need additional code changes. This PR simply adds more test cases for CTAS queries.
Why are the changes needed?
To add tests for CTAS for DSv2 sources.
Does this PR introduce any user-facing change?
No
How was this patch tested?
New tests
Was this patch authored or co-authored using generative AI tooling?
No