Skip to content

Conversation

@sureshthalamati
Copy link
Contributor

@sureshthalamati sureshthalamati commented Dec 8, 2016

What changes were proposed in this pull request?

Currently JDBC data source creates tables in the target database using the default type mapping, and the JDBC dialect mechanism.  If users want to specify different database data type for only some of columns, there is no option available. In scenarios where default mapping does not work, users are forced to create tables on the target database before writing. This workaround is probably not acceptable from a usability point of view. This PR is to provide a user-defined type mapping for specific columns.

The solution is to allow users to specify database column data type for the create table as JDBC datasource option(createTableColumnTypes) on write. Data type information can be specified in the same format as table schema DDL format (e.g: name CHAR(64), comments VARCHAR(1024)).

All supported target database types can not be specified , the data types has to be valid spark sql data types also. For example user can not specify target database CLOB data type. This will be supported in the follow-up PR.

Example:

df.write
.option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
.jdbc(url, "TEST.DBCOLTYPETEST", properties)

How was this patch tested?

Added new test cases to the JDBCWriteSuite

@SparkQA
Copy link

SparkQA commented Dec 8, 2016

Test build #69851 has finished for PR 16209 at commit 6eec6ca.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 8, 2016

Test build #69855 has started for PR 16209 at commit faa8172.

@gatorsmile
Copy link
Member

gatorsmile commented Dec 8, 2016

@rxin @JoshRosen @srowen Which solution is preferred for supporting customized column types? Table-level JDBC option or column metadata property? Thanks!

FYI: This PR is based on the table-level JDBC option.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 8, 2016

Test build #69871 has finished for PR 16209 at commit faa8172.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sureshthalamati sureshthalamati force-pushed the jdbc_custom_dbtype_option_json-spark-10849 branch from faa8172 to ff71bac Compare January 16, 2017 00:45
@SparkQA
Copy link

SparkQA commented Jan 16, 2017

Test build #71410 has finished for PR 16209 at commit ff71bac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

This PR is following the same way as what AVRO did in Hive. This linked file is an example shown in our test case:

ghost pushed a commit to dbtsai/spark that referenced this pull request Mar 16, 2017
### What changes were proposed in this pull request?

Specifying the table schema in DDL formats is needed for different scenarios. For example,
- [specifying the schema in SQL function `from_json` using DDL formats](https://issues.apache.org/jira/browse/SPARK-19637), which is suggested by marmbrus ,
- [specifying the customized JDBC data types](apache#16209).

These two PRs need users to use the JSON format to specify the table schema. This is not user friendly.

This PR is to provide a `parseTableSchema` API in `ParserInterface`.

### How was this patch tested?
Added a test suite `TableSchemaParserSuite`

Author: Xiao Li <[email protected]>

Closes apache#17171 from gatorsmile/parseDDLStmt.
@gatorsmile
Copy link
Member

@sureshthalamati #17171 has been resolved. Can you update your PR by allowing users to specify the schema in DDL format?

@sureshthalamati
Copy link
Contributor Author

@gatorsmile sure. I will update the PR with the DDL format approach.

@sureshthalamati
Copy link
Contributor Author

@gatorsmile I like the DDL schema format approach. But the method CatalystSqlParser.parseTableSchema(sql) will work only if user wants to specify the target database datatype that also exists in Spark. For example if user wants to specify CLOB(200K) ; it will not work because that is not a valid data type in spark.

How about simple comma separate list with restriction of ,(comma) can not be in the column name to use this option ?. I am guessing that would work in most of the scenarios.

Any suggestions ?

@sureshthalamati sureshthalamati force-pushed the jdbc_custom_dbtype_option_json-spark-10849 branch from ff71bac to e76b7e0 Compare March 17, 2017 04:51
@SparkQA
Copy link

SparkQA commented Mar 17, 2017

Test build #74721 has finished for PR 16209 at commit e76b7e0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Mar 18, 2017

Sorry to cutting in though, IMHO we need to have general logic to inject user-defined types via UDTRegistration into the DDL parser (CatalystSqlParser). If we have the logic, we could use the types in a expression string.

@gatorsmile
Copy link
Member

Yes, we need to extend the DDL parser to support the general user-defined types.

@sureshthalamati sureshthalamati force-pushed the jdbc_custom_dbtype_option_json-spark-10849 branch from e76b7e0 to 95ac9a0 Compare March 21, 2017 01:46
@sureshthalamati sureshthalamati changed the title [WIP][SPARK-10849][SQL] Adds option to the JDBC data source for user to specify database column type for the create table [SPARK-10849][SQL] Adds option to the JDBC data source write for user to specify database column type for the create table Mar 21, 2017
@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74920 has finished for PR 16209 at commit 95ac9a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming the name comparison is always case sensitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. Good question., updated the PR with case-sensitive handling. Now column names from user specified schema are matched with data frame schema based on the SQLConf.CASE_SENSITIVE flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can create a partial function here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved it to separate function. Thanks for the suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the case that users specify all the columns. In this case, we should mix the order of the columns.

In addition, we also need a case that users only specify one/two columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Drop interpolation s in the head.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review @maropu . Fixed it.

@sureshthalamati sureshthalamati force-pushed the jdbc_custom_dbtype_option_json-spark-10849 branch from 95ac9a0 to 95e47a7 Compare March 22, 2017 23:30
@gatorsmile
Copy link
Member

LGTM pending Jenkins

cc @rxin @JoshRosen @srowen

This is a nice option to have for JDBC users. If no further comment, I will merge it to master tomorrow. Thanks!

@SparkQA
Copy link

SparkQA commented Mar 23, 2017

Test build #75068 has finished for PR 16209 at commit 95e47a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya Mar 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like wrong indent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix it.

Copy link
Member

@viirya viirya Mar 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specified types should be valid spark sql data types? What it means? Do you mean VARCHAR(1024)?

Is VARCHAR(1024) a valid spark sql data types? This description might need to be changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VARCHAR(1024) is a valid data type in spark sql, it gets translated to String internally in Spark. The data types specified in this property are meant for target database, using VARCHAR for example because many RDBMS does not have String data type.

Thank you for reviewing @viirya .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see it is working internally. However, looks like this kind of types is not explicitly documented: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types

So I have a little concern regarding the description here.

Copy link
Member

@viirya viirya Mar 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I miss something? Looks like you have 4 columns. But the schema has only 3 fields? Is it intentional? You don't use the last column, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to delete the extra value. Will fix it.

@viirya
Copy link
Member

viirya commented Mar 23, 2017

LGTM except for few minor comments.

@sureshthalamati sureshthalamati force-pushed the jdbc_custom_dbtype_option_json-spark-10849 branch from 95e47a7 to 6f51d3f Compare March 23, 2017 17:59
@SparkQA
Copy link

SparkQA commented Mar 23, 2017

Test build #75106 has finished for PR 16209 at commit 6f51d3f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@cbyn
Copy link

cbyn commented Jan 3, 2018

@sureshthalamati is it possible to write strings as LONGTEXT? I'm having difficulty understanding which types are possible.

@robbyki
Copy link

robbyki commented Jan 4, 2018

Is there a recommended workaround to achieve exactly this in spark 2.1? I'm going through several resources to try and understand how to maintain my schema created outside of spark and then just truncating my tables from spark followed by writing with a savemode of overwrite. My problem exactly this issue with respect to my db netezza failing when it sees spark trying to save a text data type so I then have to go specify in my new jdbc dialect to use varchar(n) which does work however that just replaces all of my varchar columns (different lengths for different columns) with whatever I specified in my dialect which is not what I want. How can I just have it save the TEXT as varchar without specifying a length in the custom dialect?

@sureshthalamati
Copy link
Contributor Author

@cbyn The specified types should be valid spark sql data types. LONGTEXT probably is not one of those types supported by spark sql syntax.

@robbyki Problem with dialect as you noticed it will be same for all the columns as you noticed. Only workaround is to create table explicitly in the Netezza , and the save it. There is a truncate option also
if you need to empty the table before saving , that typically keeps the table as you created.

Please post questions to spark user list , you will get answers quickly from other users and developers. People will not notice comments on the closed PRS.

@cbyn
Copy link

cbyn commented Jan 8, 2018

Thanks @sureshthalamati. I thought the idea was to specify the destination database type. E.g. writing spark sql strings as VARCHAR works but VARCHAR is not a spark sql type. (I'm using the VARCHAR feature and I'm very grateful for the addition!)

@clwang24
Copy link

clwang24 commented Nov 18, 2019

It's now can specify target database CLOB or BLOB data type? @sureshthalamati i use
"createTableColumnTypes" option when write BLOB data in ORACLE DataBase but return error
"ora-00902 invalid datatype", the scala code just as shown below:

    ` df.write.mode(SaveMode.Overwrite)
          .option("createTableColumnTypes", "id int, Name binary")
          .jdbc(url1, "TEST.USERDBTYPETEST", properties)`

Then i try change column type as BLOB, but return error "DataType blob is not supported.(line 1,
pos 437)", the scala code just as shown below:

    'df.write.mode(SaveMode.Overwrite)
          .option("createTableColumnTypes", "id int, Name blob")
          .jdbc(url1, "TEST.USERDBTYPETEST", properties)'

The last i try to no specify column type, but return another error "ora-12899 value too large for
column (actual 581 maximum 255)" because StrignDataType default mappping varchar(255). Can
you spare time to help me with this question? thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants