[SPARK-10849][SQL] Adds option to the JDBC data source write for user to specify database column type for the create table #16209

sureshthalamati · 2016-12-08T06:58:55Z

What changes were proposed in this pull request?

Currently JDBC data source creates tables in the target database using the default type mapping, and the JDBC dialect mechanism. If users want to specify different database data type for only some of columns, there is no option available. In scenarios where default mapping does not work, users are forced to create tables on the target database before writing. This workaround is probably not acceptable from a usability point of view. This PR is to provide a user-defined type mapping for specific columns.

The solution is to allow users to specify database column data type for the create table as JDBC datasource option(createTableColumnTypes) on write. Data type information can be specified in the same format as table schema DDL format (e.g: name CHAR(64), comments VARCHAR(1024)).

All supported target database types can not be specified , the data types has to be valid spark sql data types also. For example user can not specify target database CLOB data type. This will be supported in the follow-up PR.

Example:

df.write
.option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
.jdbc(url, "TEST.DBCOLTYPETEST", properties)

How was this patch tested?

Added new test cases to the JDBCWriteSuite

SparkQA · 2016-12-08T07:04:48Z

Test build #69851 has finished for PR 16209 at commit 6eec6ca.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-08T07:57:39Z

Test build #69855 has started for PR 16209 at commit faa8172.

gatorsmile · 2016-12-08T08:02:56Z

@rxin @JoshRosen @srowen Which solution is preferred for supporting customized column types? Table-level JDBC option or column metadata property? Thanks!

FYI: This PR is based on the table-level JDBC option.

gatorsmile · 2016-12-08T19:30:08Z

retest this please

SparkQA · 2016-12-08T22:11:23Z

Test build #69871 has finished for PR 16209 at commit faa8172.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-16T03:00:41Z

Test build #71410 has finished for PR 16209 at commit ff71bac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-18T21:55:55Z

This PR is following the same way as what AVRO did in Hive. This linked file is an example shown in our test case:

### What changes were proposed in this pull request? Specifying the table schema in DDL formats is needed for different scenarios. For example, - [specifying the schema in SQL function `from_json` using DDL formats](https://issues.apache.org/jira/browse/SPARK-19637), which is suggested by marmbrus , - [specifying the customized JDBC data types](apache#16209). These two PRs need users to use the JSON format to specify the table schema. This is not user friendly. This PR is to provide a `parseTableSchema` API in `ParserInterface`. ### How was this patch tested? Added a test suite `TableSchemaParserSuite` Author: Xiao Li <[email protected]> Closes apache#17171 from gatorsmile/parseDDLStmt.

gatorsmile · 2017-03-16T04:54:24Z

@sureshthalamati #17171 has been resolved. Can you update your PR by allowing users to specify the schema in DDL format?

sureshthalamati · 2017-03-16T17:35:38Z

@gatorsmile sure. I will update the PR with the DDL format approach.

sureshthalamati · 2017-03-16T22:43:22Z

@gatorsmile I like the DDL schema format approach. But the method CatalystSqlParser.parseTableSchema(sql) will work only if user wants to specify the target database datatype that also exists in Spark. For example if user wants to specify CLOB(200K) ; it will not work because that is not a valid data type in spark.

How about simple comma separate list with restriction of ,(comma) can not be in the column name to use this option ?. I am guessing that would work in most of the scenarios.

Any suggestions ?

SparkQA · 2017-03-17T05:57:30Z

Test build #74721 has finished for PR 16209 at commit e76b7e0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-03-18T05:55:26Z

Sorry to cutting in though, IMHO we need to have general logic to inject user-defined types via UDTRegistration into the DDL parser (CatalystSqlParser). If we have the logic, we could use the types in a expression string.

gatorsmile · 2017-03-19T04:17:44Z

Yes, we need to extend the DDL parser to support the general user-defined types.

SparkQA · 2017-03-21T03:47:54Z

Test build #74920 has finished for PR 16209 at commit 95ac9a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-21T07:01:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

Are we assuming the name comparison is always case sensitive?

Thank you for the review. Good question., updated the PR with case-sensitive handling. Now column names from user specified schema are matched with data frame schema based on the SQLConf.CASE_SENSITIVE flag.

gatorsmile · 2017-03-21T07:03:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

We can create a partial function here.

Done. Moved it to separate function. Thanks for the suggestion.

gatorsmile · 2017-03-21T07:16:14Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala

This is the case that users specify all the columns. In this case, we should mix the order of the columns.

In addition, we also need a case that users only specify one/two columns.

nit: Drop interpolation s in the head.

Thanks for review @maropu . Fixed it.

gatorsmile · 2017-03-23T00:11:58Z

LGTM pending Jenkins

cc @rxin @JoshRosen @srowen

This is a nice option to have for JDBC users. If no further comment, I will merge it to master tomorrow. Thanks!

SparkQA · 2017-03-23T01:34:53Z

Test build #75068 has finished for PR 16209 at commit 95e47a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-03-23T08:06:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

nit: looks like wrong indent.

Will fix it.

viirya · 2017-03-23T08:08:03Z

docs/sql-programming-guide.md

~~The specified types should be valid spark sql data types? What it means? Do you mean VARCHAR(1024)?~~

Is VARCHAR(1024) a valid spark sql data types? This description might need to be changed.

VARCHAR(1024) is a valid data type in spark sql, it gets translated to String internally in Spark. The data types specified in this property are meant for target database, using VARCHAR for example because many RDBMS does not have String data type.

Thank you for reviewing @viirya .

Yeah, I see it is working internally. However, looks like this kind of types is not explicitly documented: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types

So I have a little concern regarding the description here.

viirya · 2017-03-23T08:38:25Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala

Do I miss something? Looks like you have 4 columns. But the schema has only 3 fields? Is it intentional? You don't use the last column, though.

Forgot to delete the extra value. Will fix it.

viirya · 2017-03-23T08:44:29Z

LGTM except for few minor comments.

…column types when table is created on write

… format

… Addressed review comments.

SparkQA · 2017-03-23T20:04:33Z

Test build #75106 has finished for PR 16209 at commit 6f51d3f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-24T00:37:57Z

Thanks! Merging to master.

cbyn · 2018-01-03T21:20:31Z

@sureshthalamati is it possible to write strings as LONGTEXT? I'm having difficulty understanding which types are possible.

robbyki · 2018-01-04T11:43:01Z

Is there a recommended workaround to achieve exactly this in spark 2.1? I'm going through several resources to try and understand how to maintain my schema created outside of spark and then just truncating my tables from spark followed by writing with a savemode of overwrite. My problem exactly this issue with respect to my db netezza failing when it sees spark trying to save a text data type so I then have to go specify in my new jdbc dialect to use varchar(n) which does work however that just replaces all of my varchar columns (different lengths for different columns) with whatever I specified in my dialect which is not what I want. How can I just have it save the TEXT as varchar without specifying a length in the custom dialect?

sureshthalamati · 2018-01-06T23:02:33Z

@cbyn The specified types should be valid spark sql data types. LONGTEXT probably is not one of those types supported by spark sql syntax.

@robbyki Problem with dialect as you noticed it will be same for all the columns as you noticed. Only workaround is to create table explicitly in the Netezza , and the save it. There is a truncate option also
if you need to empty the table before saving , that typically keeps the table as you created.

Please post questions to spark user list , you will get answers quickly from other users and developers. People will not notice comments on the closed PRS.

cbyn · 2018-01-08T20:33:00Z

Thanks @sureshthalamati. I thought the idea was to specify the destination database type. E.g. writing spark sql strings as VARCHAR works but VARCHAR is not a spark sql type. (I'm using the VARCHAR feature and I'm very grateful for the addition!)

clwang24 · 2019-11-18T08:25:59Z

It's now can specify target database CLOB or BLOB data type? @sureshthalamati i use
"createTableColumnTypes" option when write BLOB data in ORACLE DataBase but return error
"ora-00902 invalid datatype", the scala code just as shown below:

    ` df.write.mode(SaveMode.Overwrite)
          .option("createTableColumnTypes", "id int, Name binary")
          .jdbc(url1, "TEST.USERDBTYPETEST", properties)`

Then i try change column type as BLOB, but return error "DataType blob is not supported.(line 1,
pos 437)", the scala code just as shown below:

    'df.write.mode(SaveMode.Overwrite)
          .option("createTableColumnTypes", "id int, Name blob")
          .jdbc(url1, "TEST.USERDBTYPETEST", properties)'

The last i try to no specify column type, but return another error "ora-12899 value too large for
column (actual 581 maximum 255)" because StrignDataType default mappping varchar(255). Can
you spare time to help me with this question? thx

sureshthalamati mentioned this pull request Dec 8, 2016

[WIP][SPARK-10849][SQL] Adds a new column metadata property to the jdbc data source for users to specify database column type using the metadata #16208

Closed

sureshthalamati mentioned this pull request Dec 8, 2016

[SPARK-10849][SQL] Adding field metadata property to override default jdbc data source type mapping. #9352

Closed

sureshthalamati force-pushed the jdbc_custom_dbtype_option_json-spark-10849 branch from faa8172 to ff71bac Compare January 16, 2017 00:45

gatorsmile mentioned this pull request Jan 17, 2017

Custom JDBC column types databricks/spark-redshift#220

Closed

gatorsmile mentioned this pull request Mar 6, 2017

[SPARK-19830] [SQL] Add parseTableSchema API to ParserInterface #17171

Closed

sureshthalamati force-pushed the jdbc_custom_dbtype_option_json-spark-10849 branch from ff71bac to e76b7e0 Compare March 17, 2017 04:51

sureshthalamati force-pushed the jdbc_custom_dbtype_option_json-spark-10849 branch from e76b7e0 to 95ac9a0 Compare March 21, 2017 01:46

sureshthalamati changed the title ~~[WIP][SPARK-10849][SQL] Adds option to the JDBC data source for user to specify database column type for the create table~~ [SPARK-10849][SQL] Adds option to the JDBC data source write for user to specify database column type for the create table Mar 21, 2017

gatorsmile reviewed Mar 21, 2017

View reviewed changes

sureshthalamati force-pushed the jdbc_custom_dbtype_option_json-spark-10849 branch from 95ac9a0 to 95e47a7 Compare March 22, 2017 23:30

viirya reviewed Mar 23, 2017

View reviewed changes

sureshthalamati added 5 commits March 23, 2017 10:33

Adding new option to the jdbc to allow users to specify create table …

0380bbf

…column types when table is created on write

fix for python style check error

6d1b4f8

Changing the createTableColumnTypes option value format to DDL schema…

64eb505

… format

Added case-sensitive handling to the user specied columnTypes string.…

d93d3fd

… Addressed review comments.

Addressing review comments. Fixed indendation and removed unused value

6f51d3f

sureshthalamati force-pushed the jdbc_custom_dbtype_option_json-spark-10849 branch from 95e47a7 to 6f51d3f Compare March 23, 2017 17:59

asfgit closed this in c791180 Mar 24, 2017

gatorsmile mentioned this pull request Jun 12, 2017

[SPARK-20427][SQL] Read JDBC table use custom schema #18266

Closed

[SPARK-10849][SQL] Adds option to the JDBC data source write for user to specify database column type for the create table #16209

[SPARK-10849][SQL] Adds option to the JDBC data source write for user to specify database column type for the create table #16209

Uh oh!

Conversation

sureshthalamati commented Dec 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 8, 2016

Uh oh!

SparkQA commented Dec 8, 2016

Uh oh!

gatorsmile commented Dec 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Dec 8, 2016

Uh oh!

SparkQA commented Dec 8, 2016

Uh oh!

SparkQA commented Jan 16, 2017

Uh oh!

gatorsmile commented Jan 18, 2017

Uh oh!

gatorsmile commented Mar 16, 2017

Uh oh!

sureshthalamati commented Mar 16, 2017

Uh oh!

sureshthalamati commented Mar 16, 2017

Uh oh!

SparkQA commented Mar 17, 2017

Uh oh!

maropu commented Mar 18, 2017

Uh oh!

gatorsmile commented Mar 19, 2017

Uh oh!

SparkQA commented Mar 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Mar 23, 2017

Uh oh!

SparkQA commented Mar 23, 2017

Uh oh!

viirya Mar 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Mar 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Mar 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Mar 23, 2017

Uh oh!

SparkQA commented Mar 23, 2017

Uh oh!

sureshthalamati commented Dec 8, 2016 •

edited

Loading

gatorsmile commented Dec 8, 2016 •

edited

Loading

viirya Mar 23, 2017 •

edited

Loading

viirya Mar 23, 2017 •

edited

Loading

viirya Mar 23, 2017 •

edited

Loading

clwang24 commented Nov 18, 2019 •

edited

Loading