Skip to content

Conversation

@huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

  • Override the default SQL strings in the DB2 Dialect for:

    • ALTER TABLE UPDATE COLUMN TYPE
    • ALTER TABLE UPDATE COLUMN NULLABILITY
  • Add new docker integration test suite jdbc/v2/DB2IntegrationSuite.scala

Why are the changes needed?

In SPARK-24907, we implemented JDBC v2 Table Catalog but it doesn't support some ALTER TABLE at the moment. This PR supports DB2 specific ALTER TABLE.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

By running new integration test suite:

$ ./build/sbt -Pdocker-integration-tests "test-only *.DB2IntegrationSuite"

@huaxingao huaxingao changed the title Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect) [SPARK-33081][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect) Oct 7, 2020
@SparkQA
Copy link

SparkQA commented Oct 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34139/

@SparkQA
Copy link

SparkQA commented Oct 8, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34139/

@SparkQA
Copy link

SparkQA commented Oct 8, 2020

Test build #129534 has finished for PR 29972 at commit 8e2da8d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

cc @cloud-fan @maropu @MaxGekk Could you please take a look? Thanks!

/**
* To run this test suite for a specific version (e.g., ibmcom/db2:11.5.4.0):
* {{{
* DB2_DOCKER_IMAGE_NAME=ibmcom/db2:11.5.4.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB2 docker test is much simpler than Oracle? aea78d2#diff-a003dfa2ba6f747fa3ac7f4563e78325R34-R54

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Oracle docker test, it has instructions for how to build docker image. In DB2 docker test and all the other docker tests, it is assumed that the docker images are there and only has instruction for how to run the tests. That's why DB2 docker test looks much simpler.
e.g. here is what we have for MS SQL Server docker test

/**
 * To run this test suite for a specific version (e.g., 2019-GA-ubuntu-16.04):
 * {{{
 *   MSSQLSERVER_DOCKER_IMAGE_NAME=2019-GA-ubuntu-16.04
 *     ./build/sbt -Pdocker-integration-tests "test-only *MsSqlServerIntegrationSuite"
 * }}}
 */

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you put the tests to a common trait, and mix it to Oracle and DB2 dialect test suites. The diff should be only in catalog name oracle vs db2, right?

@huaxingao
Copy link
Contributor Author

Could you put the tests to a common trait, and mix it to Oracle and DB2 dialect test suites. The diff should be only in catalog name oracle vs db2, right?

There are actually a little more differences. For example, DB2 doesn't have a STRING type. It uses CHAR and VARCHAR. Also, Oracle allows update column data type from INTEGER to STRING, but DB2 doesn't allow update column data type from INTEGER to VARCHAR.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 8, 2020

For example, DB2 doesn't have a STRING type. It uses CHAR and VARCHAR.

You could define a type in the common trait like StrType, and override in the Oracle and DB2 test suites.

Oracle allows update column data type from INTEGER to STRING, but DB2 doesn't allow update column data type from INTEGER to VARCHAR.

ohh, I think we can remove such checks because:

  • I don't see any reasons to test DBMS behavior in Spark's tests
  • The checks doesn't improve test coverage.

or as an option, we could extract dialect specific test to separate tests. For now, while reading the integration tests, it is hard to say why we duplicate the code.

trait V2JDBCTest extends SharedSparkSession {
val catalogName: String
// dialect specific update column type test
def updateColumnType: Unit
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find an update column type that works for both DB2 and Oracle, so I will do this separately. For DB2, we have update column type from int to double. For Oracle, we have update column type from int to string

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34175/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34176/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34175/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34176/

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Test build #129569 has finished for PR 29972 at commit 6c2919d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DB2IntegrationSuite extends DockerJDBCIntegrationSuite with V2JDBCTest
  • class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with V2JDBCTest
  • trait V2JDBCTest extends SharedSparkSession

@SparkQA
Copy link

SparkQA commented Oct 9, 2020

Test build #129570 has finished for PR 29972 at commit 0f9284b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait V2JDBCTest extends SharedSparkSession


test("SPARK-33034: ALTER TABLE ... update column type") {
withTable(s"$catalogName.alt_table") {
updateColumnType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the name be prepareTableForUpdateTypeTest(tbl: String)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah it has tests as well. How about testUpdateColumnType(tbl: String)?

assert(msg2.contains("Cannot update missing field bad_column"))
// Update column to wrong type
val msg3 = intercept[ParseException] {
sql(s"ALTER TABLE $catalogName.alt_table ALTER COLUMN id TYPE bad_type")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove it as it's a parser error which is not related to JDBC.

We can move it to DDLParserSuite if it's not tested there.

override def dataPreparation(conn: Connection): Unit = {}

override def updateColumnType: Unit = {
sql(s"CREATE TABLE $catalogName.alt_table (ID INTEGER) USING _")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test the table schema right after creation?

val msg = intercept[AnalysisException] {
sql(s"ALTER TABLE oracle.not_existing_table ADD COLUMNS (C4 STRING)")
override def updateColumnType: Unit = {
sql(s"CREATE TABLE $catalogName.alt_table (ID INTEGER) USING _")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for unifying the tests!

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34312/

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34312/

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Test build #129706 has finished for PR 29972 at commit 8d0226b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sql(s"CREATE TABLE $catalogName.alt_table (ID STRING NOT NULL) USING _")
var t = spark.table(s"$catalogName.alt_table")
// nullable is true in the expecteSchema because Spark always sets nullable to true
// regardless of the JDBC metadata https://github.com/apache/spark/pull/18445
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can change it in JDBC V2, as the table metadata is stored in the remote JDBC server directly. This can be done in a followup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a couple of quick tests using V2 write API:

sql("INSERT INTO h2.test.people SELECT 'bob', null")

and

sql("SELECT null AS ID, 'bob' AS NAME").writeTo("h2.test.people")

I got Exception from h2 jdbc driver:

Caused by: org.h2.jdbc.JdbcSQLException: NULL not allowed for column "ID"; SQL statement:
INSERT INTO "test"."people" ("NAME","ID") VALUES (?,?) [23502-195]
	at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)

So we are able to pass the null value for not null column ID to h2 and h2 blocks the insert.

However, if I change the current code in JDBCRDD.resolveTable to make alwaysNullable = false to get the real nullable value,

  def resolveTable(options: JDBCOptions): StructType = {

          ......

          JdbcUtils.getSchema(rs, dialect, alwaysNullable = false)

For insert, I got Exception from Spark

Cannot write incompatible data to table 'test.people':
- Cannot write nullable values to non-null column 'ID';
org.apache.spark.sql.AnalysisException: Cannot write incompatible data to table 'test.people':
- Cannot write nullable values to non-null column 'ID';
	at org.apache.spark.sql.catalyst.analysis.TableOutputResolver$.resolveOutputColumns(TableOutputResolver.scala:72)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOutputRelation$$anonfun$apply$31.applyOrElse(Analyzer.scala:3040)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOutputRelation$$anonfun$apply$31.applyOrElse(Analyzer.scala:3035)

Spark blocks the insert and we are not able to pass the null value for not null column ID to h2. Since the whole point of #18445 is to let the underlying database to decide how to process null for a not null column, I guess we will not change this alwaysNullable for JDBCV2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does expose a problem in Spark: most databases allow to write nullable data to non-nullable column, and fail at runtime if they see null values. I think Spark shouldn't block it at compile time. After all, nullability is more like a constraint, not data type itself. cc @rdblue @dongjoon-hyun @viirya @maropu @MaxGekk

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in af3e2f7 Oct 13, 2020
@huaxingao huaxingao deleted the db2_docker branch October 13, 2020 17:40
@huaxingao
Copy link
Contributor Author

Thanks! @cloud-fan @MaxGekk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants