-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10849][SQL] Adds option to the JDBC data source write for user to specify database column type for the create table #16209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #69851 has finished for PR 16209 at commit
|
|
Test build #69855 has started for PR 16209 at commit |
|
@rxin @JoshRosen @srowen Which solution is preferred for supporting customized column types? Table-level JDBC option or column metadata property? Thanks! FYI: This PR is based on the table-level JDBC option. |
|
retest this please |
|
Test build #69871 has finished for PR 16209 at commit
|
faa8172 to
ff71bac
Compare
|
Test build #71410 has finished for PR 16209 at commit
|
|
This PR is following the same way as what AVRO did in Hive. This linked file is an example shown in our test case: |
### What changes were proposed in this pull request? Specifying the table schema in DDL formats is needed for different scenarios. For example, - [specifying the schema in SQL function `from_json` using DDL formats](https://issues.apache.org/jira/browse/SPARK-19637), which is suggested by marmbrus , - [specifying the customized JDBC data types](apache#16209). These two PRs need users to use the JSON format to specify the table schema. This is not user friendly. This PR is to provide a `parseTableSchema` API in `ParserInterface`. ### How was this patch tested? Added a test suite `TableSchemaParserSuite` Author: Xiao Li <[email protected]> Closes apache#17171 from gatorsmile/parseDDLStmt.
|
@sureshthalamati #17171 has been resolved. Can you update your PR by allowing users to specify the schema in DDL format? |
|
@gatorsmile sure. I will update the PR with the DDL format approach. |
|
@gatorsmile I like the DDL schema format approach. But the method How about simple comma separate list with restriction of ,(comma) can not be in the column name to use this option ?. I am guessing that would work in most of the scenarios. Any suggestions ? |
ff71bac to
e76b7e0
Compare
|
Test build #74721 has finished for PR 16209 at commit
|
|
Sorry to cutting in though, IMHO we need to have general logic to inject user-defined types via |
|
Yes, we need to extend the DDL parser to support the general user-defined types. |
e76b7e0 to
95ac9a0
Compare
|
Test build #74920 has finished for PR 16209 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we assuming the name comparison is always case sensitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. Good question., updated the PR with case-sensitive handling. Now column names from user specified schema are matched with data frame schema based on the SQLConf.CASE_SENSITIVE flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can create a partial function here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Moved it to separate function. Thanks for the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case that users specify all the columns. In this case, we should mix the order of the columns.
In addition, we also need a case that users only specify one/two columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Drop interpolation s in the head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review @maropu . Fixed it.
95ac9a0 to
95e47a7
Compare
|
LGTM pending Jenkins This is a nice option to have for JDBC users. If no further comment, I will merge it to master tomorrow. Thanks! |
|
Test build #75068 has finished for PR 16209 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: looks like wrong indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix it.
docs/sql-programming-guide.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specified types should be valid spark sql data types? What it means? Do you mean VARCHAR(1024)?
Is VARCHAR(1024) a valid spark sql data types? This description might need to be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VARCHAR(1024) is a valid data type in spark sql, it gets translated to String internally in Spark. The data types specified in this property are meant for target database, using VARCHAR for example because many RDBMS does not have String data type.
Thank you for reviewing @viirya .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see it is working internally. However, looks like this kind of types is not explicitly documented: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types
So I have a little concern regarding the description here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I miss something? Looks like you have 4 columns. But the schema has only 3 fields? Is it intentional? You don't use the last column, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to delete the extra value. Will fix it.
|
LGTM except for few minor comments. |
…column types when table is created on write
… Addressed review comments.
95e47a7 to
6f51d3f
Compare
|
Test build #75106 has finished for PR 16209 at commit
|
|
Thanks! Merging to master. |
|
@sureshthalamati is it possible to write strings as LONGTEXT? I'm having difficulty understanding which types are possible. |
|
Is there a recommended workaround to achieve exactly this in spark 2.1? I'm going through several resources to try and understand how to maintain my schema created outside of spark and then just truncating my tables from spark followed by writing with a savemode of overwrite. My problem exactly this issue with respect to my db netezza failing when it sees spark trying to save a text data type so I then have to go specify in my new jdbc dialect to use varchar(n) which does work however that just replaces all of my varchar columns (different lengths for different columns) with whatever I specified in my dialect which is not what I want. How can I just have it save the TEXT as varchar without specifying a length in the custom dialect? |
|
@cbyn The specified types should be valid spark sql data types. LONGTEXT probably is not one of those types supported by spark sql syntax. @robbyki Problem with dialect as you noticed it will be same for all the columns as you noticed. Only workaround is to create table explicitly in the Netezza , and the save it. There is a truncate option also Please post questions to spark user list , you will get answers quickly from other users and developers. People will not notice comments on the closed PRS. |
|
Thanks @sureshthalamati. I thought the idea was to specify the destination database type. E.g. writing spark sql strings as VARCHAR works but VARCHAR is not a spark sql type. (I'm using the VARCHAR feature and I'm very grateful for the addition!) |
|
It's now can specify target database CLOB or BLOB data type? @sureshthalamati i use Then i try change column type as BLOB, but return error "DataType blob is not supported.(line 1, The last i try to no specify column type, but return another error "ora-12899 value too large for |
What changes were proposed in this pull request?
Currently JDBC data source creates tables in the target database using the default type mapping, and the JDBC dialect mechanism. If users want to specify different database data type for only some of columns, there is no option available. In scenarios where default mapping does not work, users are forced to create tables on the target database before writing. This workaround is probably not acceptable from a usability point of view. This PR is to provide a user-defined type mapping for specific columns.
The solution is to allow users to specify database column data type for the create table as JDBC datasource option(createTableColumnTypes) on write. Data type information can be specified in the same format as table schema DDL format (e.g:
name CHAR(64), comments VARCHAR(1024)).All supported target database types can not be specified , the data types has to be valid spark sql data types also. For example user can not specify target database CLOB data type. This will be supported in the follow-up PR.
Example:
How was this patch tested?
Added new test cases to the JDBCWriteSuite