[SPARK-10101] [SQL] Add maxlength to JDBC field metadata and override JDBCDialects for strings as VARCHAR. #8374

rama-mullapudi · 2015-08-22T18:21:28Z

No description provided.

rxin · 2015-08-25T06:14:24Z

I think the current solution is to use the jdbc dialect, similar to the pull request here: https://github.com/apache/spark/pull/8393/files

Are you seeing a database that we don't support yet? Maybe we can just add a dialect for it.

rama-mullapudi · 2015-08-25T14:39:20Z

I was using Oracle and IBM Netezza databases to write to, I thought of adding Jdbc dialect but in case of strings CLOB is the one without character limit which can be added to jdbc dialect but most databases i.e Oracle, DB2, Teradata have restrictions on usage of CLOB columns ex can not use CLOB columns in group by or distinct. Since VARCHAR(n) is the most preferred option for these databases and in most cases its VARCHAR(255) or less, what would be best way to implement string to VARCHAR conversion.
My idea is to provide an optional jdbc connection parameter so developer can decide to use VARCHAR(20) or VARCHAR(2000) or CLOB and can pass it as jdbc connection options and all string columns for that table will use the data type passed by developer.

marmbrus · 2015-08-25T20:32:13Z

In spark-redshift we are planning to use an optional metadata field on the column maxLength that when present will let the data source use a fixed length column. I would be good to standardize here.

The other advantage to this approach is it is per column instead of global to a table.

rama-mullapudi · 2015-08-26T03:07:23Z

Since String to varchar(n) is common problem across multiple databases, would it be possible to push spark-redshift change to spark core so every one can benifit from it. Should I go ahead and close this pull request so we can have more flexible solution.

marmbrus · 2015-08-26T19:05:04Z

I don't think there is anything in conflict with implementing this in both libraries. Really the redshift library is just a simplified version of this since it only needs to handle one dialect (the reason it is its own library is because it actually does extra work using S3 to parallel extract data instead of using JDBC as the channel).

I'm only suggestion that we use the same option so that users don't have to learn two different concepts. Can we just add some logic like this (https://github.com/databricks/spark-redshift/pull/54/files#diff-69806564231efb590460b162532ba683R145) in the appropriate dialects here in spark core?

Changed getJDBCType to take 2 parameters DataType and MetaData Usage in Scala import org.apache.spark.sql.types.MetadataBuilder val metadata = new MetadataBuilder().putLong("maxlength", 10).build() df.withColumn("colName", col("colName").as("colName", metadata)

rama-mullapudi · 2015-08-27T21:22:15Z

Added maxlength for field metadata so string types can be converted to VARCHAR(n).

Added JDBCDialects for OracleDialect, NetezzaDialect and modifed DB2Dialect to use VARCHAR when maxlength exists.

Modified existing getJDBCType function to take 2 parameters DataType and MetaData, so maxlength metadata can be used in JDBCDialects.

Can you review the change.

marmbrus · 2015-08-31T23:25:27Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

While this is a DeveloperAPI, it is public so it would be good to fix without breaking binary compatibility.

pallavipr · 2015-09-03T09:03:59Z

DB2 team is working on its own dialect that works with different flavors of DB2. Code will shortly be submitted.

…nd MetaData and removed DB2 JdbcDialect as DB2 team is working on the dialect.

rama-mullapudi · 2015-09-14T20:45:56Z

Left getJDBCType with single parameter for binary compatibility and removed DB2 dialect for DB2 team. Can you take a look.

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

pallavipr · 2015-09-16T04:00:21Z

Looks good Rama. We are almost done with DB2 changes - will send for review soon.

One question, did you introduce a stringDataType property in connection url? And StringType will be mapped to the value provided for stringDataType?

Thanks,
Pallavi

rama-mullapudi · 2015-09-26T13:27:26Z

Can one of admins review changes

marmbrus · 2015-09-27T19:45:24Z

ok to test

SparkQA · 2015-09-27T19:58:25Z

Test build #43054 has finished for PR 8374 at commit dddc137.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

sureshthalamati · 2015-09-27T19:58:50Z

I am new to git, not sure if I am reading the changes correctly. It looks you
are undoing https://issues.apache.org/jira/browse/SPARK-9078 changes in your last
merge.

I see you fixed the current dialect implementation in the code to use the new method, but there
can be custom dialect implementations using the old getJDBCType(dt: DataType) method.
After this change , user specified existing custom dialects may not work.

Example :
import org.apache.spark.sql.jdbc._
JdbcDialects.registerDialect(MyNetEzzaDialect)

Another minor thing, any particualar reason for setting java.sql.Types.CHAR insead of
java.sql.Types.VARCHAR in the following statements.

OracleDialect :
Some(JdbcType(s"VARCHAR(${md.getLong("maxlength")})", java.sql.Types.CHAR))
NetEzzaDialect :
Some(JdbcType("VARCHAR(255)", java.sql.Types.CHAR))

rama-mullapudi · 2015-09-28T13:46:16Z

Intent of the change is make field metadata available to getJDBCType so it can be used by dialects to handle special cases. One I was trying to solve is by default java string types are stored as TEXT data type in database and most databases doesn't support TEXT data type, so trying to use VARCHAR which needs max length to create a varchar field in database.

What would be the best way to handle custom dialects, since its developerapi and can change.

Thanks for looking in to the change and will update code to use java.sql.Types.VARCHAR instead of java.sql.Types.CHAR

…java.sql.Types.CHAR to java.sql.Types.VARCHAR when type used is VARCHAR

SparkQA · 2015-09-28T14:15:26Z

Test build #43063 has finished for PR 8374 at commit 5f532e8.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

pallavipr · 2015-09-28T15:20:38Z

Rama, while specifying DataFrame metadata length is good approach, can we please have dbStringMappingType and dbStringMappingLength as properties on the connection url as well? It will make it really easy for developers, as opposed to dealing with dataframe length.

We have it implemented for IBM data servers, trying to see if we can make it mainline Spark since its a useful feature.

VARCHAR has its limitations on length, hence under many circumstances, users may want to map StringType to CLOB of varying lengths if data server supports.

Let me know if we can pursue - we have the code ready.

Thanks.
Pallavi

marmbrus · 2015-09-28T18:14:58Z

Just to let you know, we are busy wrapping up 1.5.1, but I have put reviewing this PR on our schedule for the next 2 week sprint.

SparkQA · 2015-09-28T18:55:01Z

Test build #43068 has finished for PR 8374 at commit 27f118b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-28T20:43:29Z

Test build #43071 has finished for PR 8374 at commit 44e1978.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-29T01:18:55Z

Test build #43074 timed out for PR 8374 at commit e605a11 after a configured wait of 250m.

SparkQA · 2015-09-29T16:04:32Z

Test build #43089 has finished for PR 8374 at commit 4c2a7a4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

sureshthalamati · 2015-09-29T18:56:05Z

The new contract defined in the JdbcDialect class is sub classes can define any one of the getJDBCType methods or both. Both the methods has to be called to check if dialect has specified different mapping. I think it makes sense to put the call with metadata first, and if it is not defined then call the one without the metadata. If we address this, any existing custom dialects I mentioned in my previous comment should also work fine.

I think Code change in JdbcUtils.scala that call getJDBCType() has to be changed to something like the following:

val typ: String = dialect.getJDBCType(field.dataType, field.metadata).map(.databaseTypeDefinition).orElse(dialect.getJDBCType(field.dataType).map(.databaseTypeDefinition)).getOrElse(
field.dataType match { …

I hope that helps . Thank you for working on this issue.

SparkQA · 2015-09-29T20:27:24Z

Test build #43099 has finished for PR 8374 at commit a0cb024.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…etadata and one with only datatype.

SparkQA · 2015-10-04T17:28:58Z

Test build #43224 has finished for PR 8374 at commit d50bdf7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rama-mullapudi · 2015-10-04T19:31:16Z

Please review the pull for approval to merge.

maropu · 2015-10-29T08:24:30Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

How about calling getJDBCType(dt,Metadata.empty) instead of None in getJDBCType(dt: DataType), because the classes that extends JdbcDialect possibly implement one of them and then the behaviours of the two functions totally different?

Sorry did not get the change you are suggesting, do you mean to call getJDBCType(dt,Metadata.empty) from getJDBCType(dt: DataType).

@rama-mullapudi yes.

maropu · 2015-10-29T08:28:08Z

Great work! I left some trivial comments.

pallavipr · 2015-11-18T04:33:31Z

Rama, can we please get the metadata maxlength changes in? We have a serious problem with writing Stringtypes, and being able to map according to Dataframe maxlength would be an alternative to get around this issue.

Thanks.

rishitesh · 2016-04-20T08:32:38Z

Is this PR going to be merged shortly. We are also expecting this PR in so that we don't have to change Spark. Can anybody respond whats the status ? If any help needed do let me know.

maropu · 2016-04-21T05:04:57Z

Seems no because it gets stale. If nobody takes this, I'll do it.

HyukjinKwon · 2016-10-12T14:46:45Z

It seems this PR is stale.. ping @rama-mullapudi

HyukjinKwon · 2017-02-09T12:26:38Z

ping @rama-mullapudi

robbyki · 2018-01-04T11:37:17Z

Apologies for misunderstanding this issue but I'm going through several resources to try and understand how to maintain my schema created outside of spark and then just truncating my tables from spark followed by writing with a savemode of overwrite. My problem exactly this issue with respect to my db netezza failing when it sees spark trying to save a text data type so I then have to go specify in my new jdbc dialect to use varchar(n) which does work however that just replaces all of my varchar columns (different lengths for different columns) with whatever I specified in my dialect which is not what I want. How can I just have it save the TEXT as varchar without specifying a length in the custom dialect?

Added stringDataType option to jdbc connection properties

c7abad3

JoshRosen mentioned this pull request Aug 26, 2015

Use maxlength metadata to configure VARCHAR column lengths databricks/spark-redshift#54

Closed

rama-mullapudi added 3 commits August 27, 2015 01:03

marmbrus reviewed Aug 31, 2015
View reviewed changes

Added method override for getJDBCType to take 2 parameters DataType a…

35e61f3

…nd MetaData and removed DB2 JdbcDialect as DB2 team is working on the dialect.

Sync to master

dd22b2f

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

rama-mullapudi added 3 commits September 26, 2015 09:12

Sync Master

cd809c5

Sync Master

0cfeefa

Updated a call with added parameter

dddc137

rama-mullapudi changed the title ~~[SPARK-10101] [SQL] Added stringDataType option to jdbc connection properties~~ [SPARK-10101] [SQL] Added maxlength to JDBC field metadata and override JDBCDialects for strings as VARCHAR. Sep 26, 2015

rama-mullapudi changed the title ~~[SPARK-10101] [SQL] Added maxlength to JDBC field metadata and override JDBCDialects for strings as VARCHAR.~~ [SPARK-10101] [SQL] Add maxlength to JDBC field metadata and override JDBCDialects for strings as VARCHAR. Sep 26, 2015

Fixed spark code style where comment exceeded 100 chars, and changed …

5f532e8

…java.sql.Types.CHAR to java.sql.Types.VARCHAR when type used is VARCHAR

Merge remote-tracking branch 'upstream/master'

03f4d96

Fixed sync with master issue

27f118b

Fixed JDBCSuite for DB2 to send extra parameter for getJDBCType

44e1978

Fixed JDBCSuite for DB2 to send extra parameter for getJDBCType

e605a11

rama-mullapudi added 2 commits September 29, 2015 03:05

Fixed JDBCSuite for DB2 to send extra parameter for getJDBCType

4cae11b

Updated JDBCDialects to save VARCHAR size to metadata

4c2a7a4

Updated JDBCDialects to save VARCHAR size to metadata

a0cb024

Changed call for getJDBCType to call both methods with datatype and m…

d50bdf7

…etadata and one with only datatype.

maropu reviewed Oct 29, 2015
View reviewed changes

HyukjinKwon mentioned this pull request Feb 15, 2017

[BUILD] Close stale PRs #16937

Closed

asfgit closed this in ed338f7 Feb 17, 2017

[SPARK-10101] [SQL] Add maxlength to JDBC field metadata and override JDBCDialects for strings as VARCHAR. #8374

[SPARK-10101] [SQL] Add maxlength to JDBC field metadata and override JDBCDialects for strings as VARCHAR. #8374

Uh oh!

Conversation

rama-mullapudi commented Aug 22, 2015

Uh oh!

rxin commented Aug 25, 2015

Uh oh!

rama-mullapudi commented Aug 25, 2015

Uh oh!

marmbrus commented Aug 25, 2015

Uh oh!

rama-mullapudi commented Aug 26, 2015

Uh oh!

marmbrus commented Aug 26, 2015

Uh oh!

rama-mullapudi commented Aug 27, 2015

Uh oh!

marmbrus Aug 31, 2015

Choose a reason for hiding this comment

Uh oh!

pallavipr commented Sep 3, 2015

Uh oh!

rama-mullapudi commented Sep 14, 2015

Uh oh!

pallavipr commented Sep 16, 2015

Uh oh!

rama-mullapudi commented Sep 26, 2015

Uh oh!

marmbrus commented Sep 27, 2015

Uh oh!

SparkQA commented Sep 27, 2015

Uh oh!

sureshthalamati commented Sep 27, 2015

Uh oh!

rama-mullapudi commented Sep 28, 2015

Uh oh!

SparkQA commented Sep 28, 2015

Uh oh!

pallavipr commented Sep 28, 2015

Uh oh!

marmbrus commented Sep 28, 2015

Uh oh!

SparkQA commented Sep 28, 2015

Uh oh!

SparkQA commented Sep 28, 2015

Uh oh!

SparkQA commented Sep 29, 2015

Uh oh!

SparkQA commented Sep 29, 2015

Uh oh!

sureshthalamati commented Sep 29, 2015

Uh oh!

SparkQA commented Sep 29, 2015

Uh oh!

SparkQA commented Oct 4, 2015

Uh oh!

rama-mullapudi commented Oct 4, 2015

Uh oh!

maropu Oct 29, 2015

Choose a reason for hiding this comment

Uh oh!

rama-mullapudi Nov 1, 2015

Choose a reason for hiding this comment

Uh oh!

maropu Nov 4, 2015

Choose a reason for hiding this comment

Uh oh!

maropu commented Oct 29, 2015

Uh oh!

pallavipr commented Nov 18, 2015

Uh oh!

rishitesh commented Apr 20, 2016

Uh oh!

maropu commented Apr 21, 2016

Uh oh!

HyukjinKwon commented Oct 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Feb 9, 2017

HyukjinKwon commented Oct 12, 2016 •

edited

Loading