-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10101] [SQL] Add maxlength to JDBC field metadata and override JDBCDialects for strings as VARCHAR. #8374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I think the current solution is to use the jdbc dialect, similar to the pull request here: https://github.com/apache/spark/pull/8393/files Are you seeing a database that we don't support yet? Maybe we can just add a dialect for it. |
|
I was using Oracle and IBM Netezza databases to write to, I thought of adding Jdbc dialect but in case of strings CLOB is the one without character limit which can be added to jdbc dialect but most databases i.e Oracle, DB2, Teradata have restrictions on usage of CLOB columns ex can not use CLOB columns in group by or distinct. Since VARCHAR(n) is the most preferred option for these databases and in most cases its VARCHAR(255) or less, what would be best way to implement string to VARCHAR conversion. |
|
In spark-redshift we are planning to use an optional metadata field on the column The other advantage to this approach is it is per column instead of global to a table. |
|
Since String to varchar(n) is common problem across multiple databases, would it be possible to push spark-redshift change to spark core so every one can benifit from it. Should I go ahead and close this pull request so we can have more flexible solution. |
|
I don't think there is anything in conflict with implementing this in both libraries. Really the redshift library is just a simplified version of this since it only needs to handle one dialect (the reason it is its own library is because it actually does extra work using S3 to parallel extract data instead of using JDBC as the channel). I'm only suggestion that we use the same option so that users don't have to learn two different concepts. Can we just add some logic like this (https://github.com/databricks/spark-redshift/pull/54/files#diff-69806564231efb590460b162532ba683R145) in the appropriate dialects here in spark core? |
Changed getJDBCType to take 2 parameters DataType and MetaData
Usage in Scala
import org.apache.spark.sql.types.MetadataBuilder
val metadata = new MetadataBuilder().putLong("maxlength", 10).build()
df.withColumn("colName", col("colName").as("colName", metadata)
Changed getJDBCType to take 2 parameters DataType and MetaData
Usage in Scala
import org.apache.spark.sql.types.MetadataBuilder
val metadata = new MetadataBuilder().putLong("maxlength", 10).build()
df.withColumn("colName", col("colName").as("colName", metadata)
Changed getJDBCType to take 2 parameters DataType and MetaData
Usage in Scala
import org.apache.spark.sql.types.MetadataBuilder
val metadata = new MetadataBuilder().putLong("maxlength", 10).build()
df.withColumn("colName", col("colName").as("colName", metadata)
|
Added maxlength for field metadata so string types can be converted to VARCHAR(n). Added JDBCDialects for OracleDialect, NetezzaDialect and modifed DB2Dialect to use VARCHAR when maxlength exists. Modified existing getJDBCType function to take 2 parameters DataType and MetaData, so maxlength metadata can be used in JDBCDialects. Can you review the change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is a DeveloperAPI, it is public so it would be good to fix without breaking binary compatibility.
|
DB2 team is working on its own dialect that works with different flavors of DB2. Code will shortly be submitted. |
…nd MetaData and removed DB2 JdbcDialect as DB2 team is working on the dialect.
|
Left getJDBCType with single parameter for binary compatibility and removed DB2 dialect for DB2 team. Can you take a look. |
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
|
Looks good Rama. We are almost done with DB2 changes - will send for review soon. One question, did you introduce a stringDataType property in connection url? And StringType will be mapped to the value provided for stringDataType? Thanks, |
|
Can one of admins review changes |
|
ok to test |
|
Test build #43054 has finished for PR 8374 at commit
|
|
I am new to git, not sure if I am reading the changes correctly. It looks you I see you fixed the current dialect implementation in the code to use the new method, but there Example : Another minor thing, any particualar reason for setting java.sql.Types.CHAR insead of OracleDialect : |
|
Intent of the change is make field metadata available to getJDBCType so it can be used by dialects to handle special cases. One I was trying to solve is by default java string types are stored as TEXT data type in database and most databases doesn't support TEXT data type, so trying to use VARCHAR which needs max length to create a varchar field in database. What would be the best way to handle custom dialects, since its developerapi and can change. Thanks for looking in to the change and will update code to use java.sql.Types.VARCHAR instead of java.sql.Types.CHAR |
…java.sql.Types.CHAR to java.sql.Types.VARCHAR when type used is VARCHAR
|
Test build #43063 has finished for PR 8374 at commit
|
|
Rama, while specifying DataFrame metadata length is good approach, can we please have dbStringMappingType and dbStringMappingLength as properties on the connection url as well? It will make it really easy for developers, as opposed to dealing with dataframe length. We have it implemented for IBM data servers, trying to see if we can make it mainline Spark since its a useful feature. VARCHAR has its limitations on length, hence under many circumstances, users may want to map StringType to CLOB of varying lengths if data server supports. Let me know if we can pursue - we have the code ready. Thanks. |
|
Just to let you know, we are busy wrapping up 1.5.1, but I have put reviewing this PR on our schedule for the next 2 week sprint. |
|
Test build #43068 has finished for PR 8374 at commit
|
|
Test build #43071 has finished for PR 8374 at commit
|
|
Test build #43074 timed out for PR 8374 at commit |
|
Test build #43089 has finished for PR 8374 at commit
|
|
The new contract defined in the JdbcDialect class is sub classes can define any one of the getJDBCType methods or both. Both the methods has to be called to check if dialect has specified different mapping. I think it makes sense to put the call with metadata first, and if it is not defined then call the one without the metadata. If we address this, any existing custom dialects I mentioned in my previous comment should also work fine. I think Code change in JdbcUtils.scala that call getJDBCType() has to be changed to something like the following: val typ: String = dialect.getJDBCType(field.dataType, field.metadata).map(.databaseTypeDefinition).orElse(dialect.getJDBCType(field.dataType).map(.databaseTypeDefinition)).getOrElse( I hope that helps . Thank you for working on this issue. |
|
Test build #43099 has finished for PR 8374 at commit
|
…etadata and one with only datatype.
|
Test build #43224 has finished for PR 8374 at commit
|
|
Please review the pull for approval to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about calling getJDBCType(dt,Metadata.empty) instead of None in getJDBCType(dt: DataType), because the classes that extends JdbcDialect possibly implement one of them and then the behaviours of the two functions totally different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry did not get the change you are suggesting, do you mean to call getJDBCType(dt,Metadata.empty) from getJDBCType(dt: DataType).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rama-mullapudi yes.
|
Great work! I left some trivial comments. |
|
Rama, can we please get the metadata maxlength changes in? We have a serious problem with writing Stringtypes, and being able to map according to Dataframe maxlength would be an alternative to get around this issue. Thanks. |
|
Is this PR going to be merged shortly. We are also expecting this PR in so that we don't have to change Spark. Can anybody respond whats the status ? If any help needed do let me know. |
|
Seems no because it gets stale. If nobody takes this, I'll do it. |
|
It seems this PR is stale.. ping @rama-mullapudi |
|
ping @rama-mullapudi |
|
Apologies for misunderstanding this issue but I'm going through several resources to try and understand how to maintain my schema created outside of spark and then just truncating my tables from spark followed by writing with a savemode of overwrite. My problem exactly this issue with respect to my db netezza failing when it sees spark trying to save a text data type so I then have to go specify in my new jdbc dialect to use varchar(n) which does work however that just replaces all of my varchar columns (different lengths for different columns) with whatever I specified in my dialect which is not what I want. How can I just have it save the TEXT as varchar without specifying a length in the custom dialect? |
No description provided.