-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-33081][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect) #29972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…lity of columns (DB2 dialect)
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #129534 has finished for PR 29972 at commit
|
|
cc @cloud-fan @maropu @MaxGekk Could you please take a look? Thanks! |
| /** | ||
| * To run this test suite for a specific version (e.g., ibmcom/db2:11.5.4.0): | ||
| * {{{ | ||
| * DB2_DOCKER_IMAGE_NAME=ibmcom/db2:11.5.4.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DB2 docker test is much simpler than Oracle? aea78d2#diff-a003dfa2ba6f747fa3ac7f4563e78325R34-R54
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Oracle docker test, it has instructions for how to build docker image. In DB2 docker test and all the other docker tests, it is assumed that the docker images are there and only has instruction for how to run the tests. That's why DB2 docker test looks much simpler.
e.g. here is what we have for MS SQL Server docker test
/**
* To run this test suite for a specific version (e.g., 2019-GA-ubuntu-16.04):
* {{{
* MSSQLSERVER_DOCKER_IMAGE_NAME=2019-GA-ubuntu-16.04
* ./build/sbt -Pdocker-integration-tests "test-only *MsSqlServerIntegrationSuite"
* }}}
*/
...cker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala
Show resolved
Hide resolved
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you put the tests to a common trait, and mix it to Oracle and DB2 dialect test suites. The diff should be only in catalog name oracle vs db2, right?
There are actually a little more differences. For example, DB2 doesn't have a STRING type. It uses CHAR and VARCHAR. Also, Oracle allows update column data type from INTEGER to STRING, but DB2 doesn't allow update column data type from INTEGER to VARCHAR. |
You could define a type in the common trait like
ohh, I think we can remove such checks because:
or as an option, we could extract dialect specific test to separate tests. For now, while reading the integration tests, it is hard to say why we duplicate the code. |
| trait V2JDBCTest extends SharedSparkSession { | ||
| val catalogName: String | ||
| // dialect specific update column type test | ||
| def updateColumnType: Unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find an update column type that works for both DB2 and Oracle, so I will do this separately. For DB2, we have update column type from int to double. For Oracle, we have update column type from int to string
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Kubernetes integration test status failure |
|
Test build #129569 has finished for PR 29972 at commit
|
|
Test build #129570 has finished for PR 29972 at commit
|
|
|
||
| test("SPARK-33034: ALTER TABLE ... update column type") { | ||
| withTable(s"$catalogName.alt_table") { | ||
| updateColumnType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the name be prepareTableForUpdateTypeTest(tbl: String)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah it has tests as well. How about testUpdateColumnType(tbl: String)?
| assert(msg2.contains("Cannot update missing field bad_column")) | ||
| // Update column to wrong type | ||
| val msg3 = intercept[ParseException] { | ||
| sql(s"ALTER TABLE $catalogName.alt_table ALTER COLUMN id TYPE bad_type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove it as it's a parser error which is not related to JDBC.
We can move it to DDLParserSuite if it's not tested there.
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala
Show resolved
Hide resolved
| override def dataPreparation(conn: Connection): Unit = {} | ||
|
|
||
| override def updateColumnType: Unit = { | ||
| sql(s"CREATE TABLE $catalogName.alt_table (ID INTEGER) USING _") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we test the table schema right after creation?
| val msg = intercept[AnalysisException] { | ||
| sql(s"ALTER TABLE oracle.not_existing_table ADD COLUMNS (C4 STRING)") | ||
| override def updateColumnType: Unit = { | ||
| sql(s"CREATE TABLE $catalogName.alt_table (ID INTEGER) USING _") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for unifying the tests!
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #129706 has finished for PR 29972 at commit
|
| sql(s"CREATE TABLE $catalogName.alt_table (ID STRING NOT NULL) USING _") | ||
| var t = spark.table(s"$catalogName.alt_table") | ||
| // nullable is true in the expecteSchema because Spark always sets nullable to true | ||
| // regardless of the JDBC metadata https://github.com/apache/spark/pull/18445 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can change it in JDBC V2, as the table metadata is stored in the remote JDBC server directly. This can be done in a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a couple of quick tests using V2 write API:
sql("INSERT INTO h2.test.people SELECT 'bob', null")
and
sql("SELECT null AS ID, 'bob' AS NAME").writeTo("h2.test.people")
I got Exception from h2 jdbc driver:
Caused by: org.h2.jdbc.JdbcSQLException: NULL not allowed for column "ID"; SQL statement:
INSERT INTO "test"."people" ("NAME","ID") VALUES (?,?) [23502-195]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
So we are able to pass the null value for not null column ID to h2 and h2 blocks the insert.
However, if I change the current code in JDBCRDD.resolveTable to make alwaysNullable = false to get the real nullable value,
def resolveTable(options: JDBCOptions): StructType = {
......
JdbcUtils.getSchema(rs, dialect, alwaysNullable = false)
For insert, I got Exception from Spark
Cannot write incompatible data to table 'test.people':
- Cannot write nullable values to non-null column 'ID';
org.apache.spark.sql.AnalysisException: Cannot write incompatible data to table 'test.people':
- Cannot write nullable values to non-null column 'ID';
at org.apache.spark.sql.catalyst.analysis.TableOutputResolver$.resolveOutputColumns(TableOutputResolver.scala:72)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOutputRelation$$anonfun$apply$31.applyOrElse(Analyzer.scala:3040)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOutputRelation$$anonfun$apply$31.applyOrElse(Analyzer.scala:3035)
Spark blocks the insert and we are not able to pass the null value for not null column ID to h2. Since the whole point of #18445 is to let the underlying database to decide how to process null for a not null column, I guess we will not change this alwaysNullable for JDBCV2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does expose a problem in Spark: most databases allow to write nullable data to non-nullable column, and fail at runtime if they see null values. I think Spark shouldn't block it at compile time. After all, nullability is more like a constraint, not data type itself. cc @rdblue @dongjoon-hyun @viirya @maropu @MaxGekk
|
thanks, merging to master! |
|
Thanks! @cloud-fan @MaxGekk |
What changes were proposed in this pull request?
Override the default SQL strings in the DB2 Dialect for:
Add new docker integration test suite jdbc/v2/DB2IntegrationSuite.scala
Why are the changes needed?
In SPARK-24907, we implemented JDBC v2 Table Catalog but it doesn't support some ALTER TABLE at the moment. This PR supports DB2 specific ALTER TABLE.
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
By running new integration test suite:
$ ./build/sbt -Pdocker-integration-tests "test-only *.DB2IntegrationSuite"