Skip to content

Conversation

@allisonwang-db
Copy link
Contributor

What changes were proposed in this pull request?

#43949 supports CREATE TABLE using DSv2 sources. This PR supports CREATE TABLE AS SELECT (CTAS) using DSv2 sources. It turns out that we don't need additional code changes. This PR simply adds more test cases for CTAS queries.

Why are the changes needed?

To add tests for CTAS for DSv2 sources.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New tests

Was this patch authored or co-authored using generative AI tooling?

No

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we change src/main, you don't need to use [TESTS], @allisonwang-db .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Thanks for letting me know!

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-46272][SQL][TESTS] Support CTAS using DSv2 sources [SPARK-46272][SQL] Support CTAS using DSv2 sources Dec 5, 2023
@allisonwang-db
Copy link
Contributor Author

cc @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to data source and we probably don't need it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here. So for this data source, we probably don't check anything when creating the table. But when load the table, and get the Table instance in V2SessionCatalog, shall we check if Table#schema match what we stored in HMS?

Copy link
Contributor Author

@allisonwang-db allisonwang-db Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually tricky, and the behavior depends on the implementation of the data source.

In createTable, we store the query's schema (col1, col2, col3) in HMS, but this data source's getTable method does not take into account the input schema.

So, when the DSv2 relation is being created, it uses table.columns.asSchema which is the default implementation of the schema (i, j).

val schema = CharVarcharUtils.replaceCharVarcharWithStringInSchema(table.columns.asSchema)
DataSourceV2Relation(table, toAttributes(schema), catalog, identifier, options)

However, if the data source utilize user-defined schema in getTable, it won't throw this exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should enforce something here. When calling TableProvider.getTable with user-specified schema, we should make sure the returned table reports the same schema as the user-specified schema (probably ignore nullability). It's actually a documented requirement

   * Return a {@link Table} instance with the specified table schema, partitioning and properties
   * to do read/write. The returned table should report the same schema and partitioning with the
   * specified ones, or Spark may fail the operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlState is required now. maybe 42K02 is fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we do this check in a lower level? where we instantiate the v2 Table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check it in loadTable? In case the data source is a bit random and only return wrong schema at second time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it but looks like loadTable is used in many other places, such as in tableExists. If we check the schema there, it will fail commands other than create table (e.g DROP TABLE t will fail because the schema in catalogTable does not match the table schema)

@cloud-fan
Copy link
Contributor

LGTM if all tests pass

@allisonwang-db
Copy link
Contributor Author

@cloud-fan the test failure seems unrelated

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 16, 2023

Can you retrigger the GA jobs?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in cc9d991 Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants