[SPARK-18123][SQL] Use db column names instead of RDD column ones during JDBC Writing #15664

dongjoon-hyun · 2016-10-27T22:20:19Z

What changes were proposed in this pull request?

Apache Spark supports the following cases by quoting RDD column names while saving through JDBC.

Allow reserved keyword as a column name, e.g., 'order'.

Allow mixed-case colume names like the following, e.g., [a: int, A: int].

scala> val df = sql("select 1 a, 1 A")
df: org.apache.spark.sql.DataFrame = [a: int, A: int]
...
scala> df.write.mode("overwrite").format("jdbc").options(option).save()
scala> df.write.mode("append").format("jdbc").options(option).save()

This PR aims to use database column names instead of RDD column ones in order to support the following additionally.
Note that this case succeeds with MySQL, but fails on Postgres/Oracle before.

val df1 = sql("select 1 a")
val df2 = sql("select 1 A")
...
df1.write.mode("overwrite").format("jdbc").options(option).save()
df2.write.mode("append").format("jdbc").options(option).save()

How was this patch tested?

Pass the Jenkins test with a new testcase.

SparkQA · 2016-10-28T00:27:32Z

Test build #67673 has finished for PR 15664 at commit 9558f96.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-10-28T21:52:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

Actually, this is an approach similar to normalizePartitionSpec in PartitioningUtils.scala.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L248-L251

dongjoon-hyun · 2016-11-10T09:23:18Z

Retest this please.

SparkQA · 2016-11-10T10:22:58Z

Test build #68459 has finished for PR 15664 at commit 9558f96.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-11-10T16:27:53Z

Two failures seems to be irrelevant.

[info] - randomized aggregation test - [typed, with partial + safe] - with grouping keys - with non-empty input *** FAILED *** (1 second, 367 milliseconds)

dongjoon-hyun · 2016-11-10T16:28:00Z

Retest this please.

SparkQA · 2016-11-10T18:41:05Z

Test build #68481 has finished for PR 15664 at commit 9558f96.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-02T22:22:33Z

Test build #69584 has finished for PR 15664 at commit 4bad553.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-12-05T03:42:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

nameMap -> lowercaseNameMap.

dongjoon-hyun · 2016-12-05T04:58:13Z

Thank you for review, @viirya .
I'll update like that.

gatorsmile · 2016-12-05T06:17:40Z

This is a bug fix, right? Will review this tomorrow.

dongjoon-hyun · 2016-12-05T06:26:08Z

Yes, right! Thank you, @gatorsmile !

SparkQA · 2016-12-05T07:32:11Z

Test build #69658 has finished for PR 15664 at commit 528ccfe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-12-05T09:26:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

I think the table schema won't change during inserting all data (or it is possible?). You ask table schema for every insert statement now. Can we do this once in caller side (i.e., savePartition) and reuse the schema then?

Thank you for review, @viirya . I'll try to update like that.

SparkQA · 2016-12-26T04:48:49Z

Test build #70577 has finished for PR 15664 at commit 11f5874.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-26T06:25:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

We can get the table schema when we checking whether the table exists.

Thank you for review, @gatorsmile .
Yes. That looks great! We can use getSchemaQuery instead of tableExist.

gatorsmile · 2016-12-26T07:05:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

The name resolution should be still controlled by spark.sql.caseSensitive, right?

Yep. I'll fix that.

gatorsmile · 2016-12-26T07:12:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

Can we build the INSERT SQL statement in saveTable based on the schema? No need to prepare the generated statement in saveTable.

dongjoon-hyun · 2016-12-27T20:39:42Z

The PR is updated to

get table schema once in createRelation
respect spark.sql.caseSensitive

For insertStatement, I thought it seems to be better to keep the current structure to isolate INSERT PreparedStatement generation code like dialect.quoteIdentifier.

SparkQA · 2016-12-27T21:58:36Z

Test build #70642 has finished for PR 15664 at commit e0a467c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-27T23:01:13Z

Test build #70643 has finished for PR 15664 at commit f803f41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-28T06:40:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+      case Success(v) =>
+        Some(v)
+      case Failure(e) =>
+        None


Please do not use Try/Success/Failure. https://github.com/databricks/scala-style-guide#exception-handling-try-vs-try

I see. Thank you for review! I wrote that by keeping the same logic in tableExists because I thought the guideline is about using return Try as a return value before this PR.

Sure, I'll remove the usage of Try/Success/Failure.

gatorsmile · 2016-12-28T07:02:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+    }
+  }
+
+  /**


saving schema is not right. We need a better name here.

rddSchema's sequence and tableSchema's name -> rddSchema's column sequence and tableSchema's column names

Here, we need to explain why we need to use the column sequences in rddSchema and why we need to use the column names in tableSchema

Yep. I'll add more correct details here.

gatorsmile · 2016-12-28T07:06:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+      if (nameMap.isDefinedAt(f.name)) {
+        // identical names
+        schema = schema.add(nameMap(f.name))
+      } else if (!caseSensitive && lowercaseNameMap.isDefinedAt(f.name.toLowerCase)) {


Need to improve the comments. Actually, we return case sensitive column names.

My bad. I meant case-insensitively identical names. I'll revise this.

gatorsmile · 2016-12-28T07:06:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+      } else if (!caseSensitive && lowercaseNameMap.isDefinedAt(f.name.toLowerCase)) {
+        // case-insensitive identical names
+        schema = schema.add(lowercaseNameMap(f.name.toLowerCase))
+      } else {


org.apache.spark.SparkException -> AnalysisException

gatorsmile · 2016-12-28T07:11:49Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala

-              saveTable(df, url, table, jdbcOptions)
+              df.schema
            }
+            saveTable(df, url, table, savingSchema, jdbcOptions)


How about passing the table schema and resolve/merge the schemas inside saveTable? It might simplify the codes

That could be. But, to do that, JdbcUtil.saveTable need to understand SaveMode, too. Is it okay?
Currently, JdbcUtil only provides somewhat primitive APIs.

SparkQA · 2016-12-28T10:32:09Z

Test build #70662 has finished for PR 15664 at commit 8d520dd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-28T10:53:57Z

Test build #70663 has finished for PR 15664 at commit 66af06d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-12-28T11:03:22Z

The failure is not related to this PR.

[info] - fatal errors from a source should be sent to the user *** FAILED *** (84 milliseconds)
[info]   org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 5ae8eeb2-81f1-44df-b8ad-55b780512ee5, runId = 3a5bfebc-e3a3-496e-987b-e0e789f1fdcb] terminated with exception: null

…Failure. Add comments.

dongjoon-hyun · 2016-12-30T10:19:17Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala

              // In this case, we should truncate table and then load.
              truncateTable(conn, table)
-              saveTable(df, url, table, jdbcOptions)
+              val tableSchema = JdbcUtils.getSchemaOption(conn, url, table)


I moved this into case statements.
Since JdbcUtils.tableExists is used, getSchemaOption can be skipped for the other SaveMode.

dongjoon-hyun · 2016-12-30T10:20:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+      isCaseSensitive: Boolean,
+      dialect: JdbcDialect): String = {
+    val columns = if (tableSchema.isEmpty) {
+      rddSchema.fields.map(x => dialect.quoteIdentifier(x.name)).mkString(",")


The legacy behavior is used when tableSchema is None.

SparkQA · 2016-12-30T11:29:45Z

Test build #70746 has finished for PR 15664 at commit 54adaf5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-12-30T11:35:30Z

The only failure is irrelevant to this PR.

[info] StreamSuite:
[info] - fatal errors from a source should be sent to the user *** FAILED *** (84 milliseconds)
[info]   org.apache.spark.sql.streaming.StreamingQueryException: Query [id = d522aafe-085e-43e4-b796-037695dec113, runId = fc2f679a-907a-4445-8042-9198649bb55d] terminated with exception: null
[info]   at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:296)
[info]   at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:186)
[info]   Cause: org.apache.spark.sql.streaming.StreamSuite$$anonfun$12$$anon$2:

dongjoon-hyun · 2016-12-30T11:35:39Z

Retest this please.

SparkQA · 2016-12-30T13:50:11Z

Test build #70749 has finished for PR 15664 at commit 54adaf5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-30T18:23:14Z

LGTM

gatorsmile · 2016-12-30T18:29:04Z

Merging to master. Thanks!

dongjoon-hyun · 2016-12-30T18:47:43Z

Thank you, @gatorsmile .
Happy New Year! :)

…ing JDBC Writing ## What changes were proposed in this pull request? Apache Spark supports the following cases **by quoting RDD column names** while saving through JDBC. - Allow reserved keyword as a column name, e.g., 'order'. - Allow mixed-case colume names like the following, e.g., `[a: int, A: int]`. ``` scala scala> val df = sql("select 1 a, 1 A") df: org.apache.spark.sql.DataFrame = [a: int, A: int] ... scala> df.write.mode("overwrite").format("jdbc").options(option).save() scala> df.write.mode("append").format("jdbc").options(option).save() ``` This PR aims to use **database column names** instead of RDD column ones in order to support the following additionally. Note that this case succeeds with `MySQL`, but fails on `Postgres`/`Oracle` before. ``` scala val df1 = sql("select 1 a") val df2 = sql("select 1 A") ... df1.write.mode("overwrite").format("jdbc").options(option).save() df2.write.mode("append").format("jdbc").options(option).save() ``` ## How was this patch tested? Pass the Jenkins test with a new testcase. Author: Dongjoon Hyun <[email protected]> Author: gatorsmile <[email protected]> Closes apache#15664 from dongjoon-hyun/SPARK-18123.

dongjoon-hyun commented Oct 28, 2016

View reviewed changes

viirya reviewed Dec 5, 2016

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala Outdated

Copy link

Member

viirya Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nameMap -> lowercaseNameMap.

viirya reviewed Dec 5, 2016

View reviewed changes

gatorsmile reviewed Dec 26, 2016

View reviewed changes

dongjoon-hyun mentioned this pull request Dec 27, 2016

[SPARK-19004][SQL] Fix JDBCWriteSuite.testH2Dialect by removing getCatalystType #16409

Closed

gatorsmile reviewed Dec 28, 2016

View reviewed changes

dongjoon-hyun and others added 12 commits December 30, 2016 01:13

Fix.

d4b8e42

Get table schema once and use it during insertion.

ddb9fdc

Use SQLConf.CASE_SENSITIVE.key

c7fc7ed

Rename getSavingSchema into normalizedSchema. Remove Try/Success/…

7d83676

…Failure. Add comments.

Update Incompatible INSERT to append, too

d74ba52

Remove unused import

a1a0bbc

Update PR according to the @gatorsmile's patch.

647d34c

Simplify.

5e1672f

fix.

fd062fa

improve.

35c7723

improve.

7abaf3c

Address comments.

54adaf5

dongjoon-hyun commented Dec 30, 2016

View reviewed changes

asfgit closed this in b85e294 Dec 30, 2016

dongjoon-hyun deleted the SPARK-18123 branch January 7, 2019 07:03

Hisoka-X mentioned this pull request Jul 5, 2023

[SPARK-44262][SQL] Add dropTable and getInsertStatement to JdbcDialect #41855

Closed

+                  }
+                }
+                /**

[SPARK-18123][SQL] Use db column names instead of RDD column ones during JDBC Writing #15664

[SPARK-18123][SQL] Use db column names instead of RDD column ones during JDBC Writing #15664

Uh oh!

Conversation

dongjoon-hyun commented Oct 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 10, 2016

Uh oh!

SparkQA commented Nov 10, 2016

Uh oh!

dongjoon-hyun commented Nov 10, 2016

Uh oh!

dongjoon-hyun commented Nov 10, 2016

Uh oh!

SparkQA commented Nov 10, 2016

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 5, 2016

Uh oh!

gatorsmile commented Dec 5, 2016

Uh oh!

dongjoon-hyun commented Dec 5, 2016

Uh oh!

SparkQA commented Dec 5, 2016

Uh oh!

viirya Dec 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 27, 2016

Uh oh!

SparkQA commented Dec 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Dec 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

dongjoon-hyun commented Oct 27, 2016 •

edited

Loading

viirya Dec 5, 2016 •

edited

Loading

dongjoon-hyun commented Dec 27, 2016 •

edited

Loading

gatorsmile Dec 28, 2016 •

edited

Loading