[SPARK-33081][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect) #29972

huaxingao · 2020-10-07T23:06:14Z

What changes were proposed in this pull request?

Override the default SQL strings in the DB2 Dialect for:
- ALTER TABLE UPDATE COLUMN TYPE
- ALTER TABLE UPDATE COLUMN NULLABILITY
Add new docker integration test suite jdbc/v2/DB2IntegrationSuite.scala

Why are the changes needed?

In SPARK-24907, we implemented JDBC v2 Table Catalog but it doesn't support some ALTER TABLE at the moment. This PR supports DB2 specific ALTER TABLE.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

By running new integration test suite:

$ ./build/sbt -Pdocker-integration-tests "test-only *.DB2IntegrationSuite"

…lity of columns (DB2 dialect)

SparkQA · 2020-10-07T23:58:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34139/

SparkQA · 2020-10-08T00:19:03Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34139/

SparkQA · 2020-10-08T01:34:43Z

Test build #129534 has finished for PR 29972 at commit 8e2da8d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2020-10-08T02:25:59Z

cc @cloud-fan @maropu @MaxGekk Could you please take a look? Thanks!

cloud-fan · 2020-10-08T04:53:20Z

...cker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala

+/**
+ * To run this test suite for a specific version (e.g., ibmcom/db2:11.5.4.0):
+ * {{{
+ *   DB2_DOCKER_IMAGE_NAME=ibmcom/db2:11.5.4.0


DB2 docker test is much simpler than Oracle? aea78d2#diff-a003dfa2ba6f747fa3ac7f4563e78325R34-R54

In Oracle docker test, it has instructions for how to build docker image. In DB2 docker test and all the other docker tests, it is assumed that the docker images are there and only has instruction for how to run the tests. That's why DB2 docker test looks much simpler.
e.g. here is what we have for MS SQL Server docker test

/** * To run this test suite for a specific version (e.g., 2019-GA-ubuntu-16.04): * {{{ * MSSQLSERVER_DOCKER_IMAGE_NAME=2019-GA-ubuntu-16.04 * ./build/sbt -Pdocker-integration-tests "test-only *MsSqlServerIntegrationSuite" * }}} */

...cker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala

MaxGekk

Could you put the tests to a common trait, and mix it to Oracle and DB2 dialect test suites. The diff should be only in catalog name oracle vs db2, right?

huaxingao · 2020-10-08T06:57:50Z

Could you put the tests to a common trait, and mix it to Oracle and DB2 dialect test suites. The diff should be only in catalog name oracle vs db2, right?

There are actually a little more differences. For example, DB2 doesn't have a STRING type. It uses CHAR and VARCHAR. Also, Oracle allows update column data type from INTEGER to STRING, but DB2 doesn't allow update column data type from INTEGER to VARCHAR.

MaxGekk · 2020-10-08T07:20:05Z

For example, DB2 doesn't have a STRING type. It uses CHAR and VARCHAR.

You could define a type in the common trait like StrType, and override in the Oracle and DB2 test suites.

Oracle allows update column data type from INTEGER to STRING, but DB2 doesn't allow update column data type from INTEGER to VARCHAR.

ohh, I think we can remove such checks because:

I don't see any reasons to test DBMS behavior in Spark's tests
The checks doesn't improve test coverage.

or as an option, we could extract dialect specific test to separate tests. For now, while reading the integration tests, it is hard to say why we duplicate the code.

huaxingao · 2020-10-09T01:38:54Z

external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala

+trait V2JDBCTest extends SharedSparkSession {
+  val catalogName: String
+  // dialect specific update column type test
+  def updateColumnType: Unit


I can't find an update column type that works for both DB2 and Oracle, so I will do this separately. For DB2, we have update column type from int to double. For Oracle, we have update column type from int to string

SparkQA · 2020-10-09T02:19:38Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34175/

SparkQA · 2020-10-09T02:21:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34176/

SparkQA · 2020-10-09T02:38:15Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34175/

SparkQA · 2020-10-09T02:45:28Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34176/

SparkQA · 2020-10-09T04:11:00Z

Test build #129569 has finished for PR 29972 at commit 6c2919d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DB2IntegrationSuite extends DockerJDBCIntegrationSuite with V2JDBCTest
class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with V2JDBCTest
trait V2JDBCTest extends SharedSparkSession

SparkQA · 2020-10-09T04:16:35Z

Test build #129570 has finished for PR 29972 at commit 0f9284b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait V2JDBCTest extends SharedSparkSession

cloud-fan · 2020-10-12T09:01:44Z

external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala

+
+  test("SPARK-33034: ALTER TABLE ... update column type") {
+    withTable(s"$catalogName.alt_table") {
+      updateColumnType


should the name be prepareTableForUpdateTypeTest(tbl: String)?

ah it has tests as well. How about testUpdateColumnType(tbl: String)?

cloud-fan · 2020-10-12T09:02:44Z

external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala

+      assert(msg2.contains("Cannot update missing field bad_column"))
+      // Update column to wrong type
+      val msg3 = intercept[ParseException] {
+        sql(s"ALTER TABLE $catalogName.alt_table ALTER COLUMN id TYPE bad_type")


I think we should remove it as it's a parser error which is not related to JDBC.

We can move it to DDLParserSuite if it's not tested there.

external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala

cloud-fan · 2020-10-12T09:08:08Z

...cker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala

+  override def dataPreparation(conn: Connection): Unit = {}
+
+  override def updateColumnType: Unit = {
+    sql(s"CREATE TABLE $catalogName.alt_table (ID INTEGER) USING _")


can we test the table schema right after creation?

cloud-fan · 2020-10-12T09:08:17Z

...r-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala

-    val msg = intercept[AnalysisException] {
-      sql(s"ALTER TABLE oracle.not_existing_table ADD COLUMNS (C4 STRING)")
+  override def updateColumnType: Unit = {
+    sql(s"CREATE TABLE $catalogName.alt_table (ID INTEGER) USING _")


cloud-fan

thanks for unifying the tests!

SparkQA · 2020-10-12T22:08:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34312/

SparkQA · 2020-10-12T22:26:36Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34312/

SparkQA · 2020-10-13T00:49:12Z

Test build #129706 has finished for PR 29972 at commit 8d0226b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-10-13T12:56:42Z

external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala

+      sql(s"CREATE TABLE $catalogName.alt_table (ID STRING NOT NULL) USING _")
+      var t = spark.table(s"$catalogName.alt_table")
+      // nullable is true in the expecteSchema because Spark always sets nullable to true
+      // regardless of the JDBC metadata https://github.com/apache/spark/pull/18445


I think we can change it in JDBC V2, as the table metadata is stored in the remote JDBC server directly. This can be done in a followup.

I did a couple of quick tests using V2 write API:

sql("INSERT INTO h2.test.people SELECT 'bob', null")

and

sql("SELECT null AS ID, 'bob' AS NAME").writeTo("h2.test.people")

I got Exception from h2 jdbc driver:

Caused by: org.h2.jdbc.JdbcSQLException: NULL not allowed for column "ID"; SQL statement: INSERT INTO "test"."people" ("NAME","ID") VALUES (?,?) [23502-195] at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)

So we are able to pass the null value for not null column ID to h2 and h2 blocks the insert.

However, if I change the current code in JDBCRDD.resolveTable to make alwaysNullable = false to get the real nullable value,

def resolveTable(options: JDBCOptions): StructType = { ...... JdbcUtils.getSchema(rs, dialect, alwaysNullable = false)

For insert, I got Exception from Spark

Cannot write incompatible data to table 'test.people': - Cannot write nullable values to non-null column 'ID'; org.apache.spark.sql.AnalysisException: Cannot write incompatible data to table 'test.people': - Cannot write nullable values to non-null column 'ID'; at org.apache.spark.sql.catalyst.analysis.TableOutputResolver$.resolveOutputColumns(TableOutputResolver.scala:72) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOutputRelation$$anonfun$apply$31.applyOrElse(Analyzer.scala:3040) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOutputRelation$$anonfun$apply$31.applyOrElse(Analyzer.scala:3035)

Spark blocks the insert and we are not able to pass the null value for not null column ID to h2. Since the whole point of #18445 is to let the underlying database to decide how to process null for a not null column, I guess we will not change this alwaysNullable for JDBCV2?

This does expose a problem in Spark: most databases allow to write nullable data to non-nullable column, and fail at runtime if they see null values. I think Spark shouldn't block it at compile time. After all, nullability is more like a constraint, not data type itself. cc @rdblue @dongjoon-hyun @viirya @maropu @MaxGekk

cloud-fan · 2020-10-13T12:57:52Z

thanks, merging to master!

huaxingao · 2020-10-13T18:02:51Z

Thanks! @cloud-fan @MaxGekk

Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullabi…

8e2da8d

…lity of columns (DB2 dialect)

huaxingao changed the title ~~Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)~~ [SPARK-33081][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect) Oct 7, 2020

cloud-fan reviewed Oct 8, 2020

View reviewed changes

...cker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala Show resolved Hide resolved

MaxGekk reviewed Oct 8, 2020

View reviewed changes

huaxingao added 2 commits October 8, 2020 18:23

put common alter table tests in a trait

6c2919d

change file name V2JDBC->V2JDBCTest

0f9284b

huaxingao commented Oct 9, 2020

View reviewed changes

cloud-fan reviewed Oct 12, 2020

View reviewed changes

external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala Show resolved Hide resolved

cloud-fan reviewed Oct 12, 2020

View reviewed changes

cloud-fan approved these changes Oct 12, 2020

View reviewed changes

address comments

8d0226b

cloud-fan reviewed Oct 13, 2020

View reviewed changes

cloud-fan closed this in af3e2f7 Oct 13, 2020

huaxingao deleted the db2_docker branch October 13, 2020 17:40

[SPARK-33081][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect) #29972

[SPARK-33081][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect) #29972

Uh oh!

Conversation

huaxingao commented Oct 7, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 7, 2020

Uh oh!

SparkQA commented Oct 8, 2020

Uh oh!

SparkQA commented Oct 8, 2020

Uh oh!

huaxingao commented Oct 8, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Oct 8, 2020

Uh oh!

MaxGekk commented Oct 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

SparkQA commented Oct 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 13, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 13, 2020

Uh oh!

huaxingao commented Oct 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

MaxGekk commented Oct 8, 2020 •

edited

Loading