-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19107][SQL] support creating hive table with DataFrameWriter and Catalog #16487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #70983 has finished for PR 16487 at commit
|
|
How about the Seq(1 -> "a").toDF("i", "j").write.format("hive").save()We will get the following error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If path is not provided, it still works. However, based on our latest design decision, users must provide path when they creating an external table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the design decision, we want to hide the managed/external concept from users. I not sure if we want to rename this API...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, just issue an exception when users do not provide a path? Otherwise, we have to add new APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll address this problem in a follow-up PR, other data source also have this problem, e.g. users can create an external parquet table without path, so this PR doesn't introduce new problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are also facing the same issue in the insertInto(tableIdent: TableIdentifier) API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insertInto is different, it generates InsertIntoTable plan instead of CreateTable plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seq((1, 2)).toDF("i", "j").write.format("parquet").saveAsTable(tableName)
table(tableName).write.mode(SaveMode.Overwrite).insertInto(tableName)We captured the exceptions when the format is parquet. Now, when the format is hive, should we do the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrameWriter.insertInto will ignore the specified provider, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we ignore the specified provider, we still respect the actual format of the table. For example, below is the Hive table. We are not blocking it. Should we block it to make them consistent?
sql(s"CREATE TABLE $tableName STORED AS SEQUENCEFILE AS SELECT 1 AS key, 'abc' AS value")
val df = sql(s"SELECT key, value FROM $tableName")
df.write.mode("overwrite").insertInto(tableName)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not block it. This generates InsertIntoTable, and it supports hive table. What we should block is saveAsTable with Overwrite mode, which generates CreateTable.
insert overwrite is different from create table with overwrite mode
|
Test build #71110 has finished for PR 16487 at commit
|
| } | ||
| } | ||
|
|
||
| test("save API - format hive") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Dataset.ofRows(sparkSession, | ||
| sparkSession.sessionState.catalog.lookupRelation( | ||
| sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName))) | ||
| sparkSession.table(tableName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
LGTM pending test |
|
LGTM |
|
Test build #71112 has finished for PR 16487 at commit
|
|
Test build #71120 has finished for PR 16487 at commit
|
|
thanks for the review, merging to master! |
…nd Catalog ## What changes were proposed in this pull request? After unifying the CREATE TABLE syntax in apache#16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now. This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests. ## How was this patch tested? new tests in `HiveDDLSuite` Author: Wenchen Fan <[email protected]> Closes apache#16487 from cloud-fan/hive-table.
…nd Catalog ## What changes were proposed in this pull request? After unifying the CREATE TABLE syntax in apache#16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now. This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests. ## How was this patch tested? new tests in `HiveDDLSuite` Author: Wenchen Fan <[email protected]> Closes apache#16487 from cloud-fan/hive-table.
What changes were proposed in this pull request?
After unifying the CREATE TABLE syntax in #16296, it's pretty easy to support creating hive table with
DataFrameWriterandCatalognow.This PR basically just removes the hive provider check in
DataFrameWriter.saveAsTableandCatalog.createExternalTable, and add tests.How was this patch tested?
new tests in
HiveDDLSuite