Skip to content

Conversation

@rama-mullapudi
Copy link

No description provided.

@rxin
Copy link
Contributor

rxin commented Aug 25, 2015

I think the current solution is to use the jdbc dialect, similar to the pull request here: https://github.com/apache/spark/pull/8393/files

Are you seeing a database that we don't support yet? Maybe we can just add a dialect for it.

@rama-mullapudi
Copy link
Author

I was using Oracle and IBM Netezza databases to write to, I thought of adding Jdbc dialect but in case of strings CLOB is the one without character limit which can be added to jdbc dialect but most databases i.e Oracle, DB2, Teradata have restrictions on usage of CLOB columns ex can not use CLOB columns in group by or distinct. Since VARCHAR(n) is the most preferred option for these databases and in most cases its VARCHAR(255) or less, what would be best way to implement string to VARCHAR conversion.
My idea is to provide an optional jdbc connection parameter so developer can decide to use VARCHAR(20) or VARCHAR(2000) or CLOB and can pass it as jdbc connection options and all string columns for that table will use the data type passed by developer.

@marmbrus
Copy link
Contributor

In spark-redshift we are planning to use an optional metadata field on the column maxLength that when present will let the data source use a fixed length column. I would be good to standardize here.

The other advantage to this approach is it is per column instead of global to a table.

@rama-mullapudi
Copy link
Author

Since String to varchar(n) is common problem across multiple databases, would it be possible to push spark-redshift change to spark core so every one can benifit from it. Should I go ahead and close this pull request so we can have more flexible solution.

@marmbrus
Copy link
Contributor

I don't think there is anything in conflict with implementing this in both libraries. Really the redshift library is just a simplified version of this since it only needs to handle one dialect (the reason it is its own library is because it actually does extra work using S3 to parallel extract data instead of using JDBC as the channel).

I'm only suggestion that we use the same option so that users don't have to learn two different concepts. Can we just add some logic like this (https://github.com/databricks/spark-redshift/pull/54/files#diff-69806564231efb590460b162532ba683R145) in the appropriate dialects here in spark core?

Changed  getJDBCType to take 2 parameters DataType and MetaData

Usage in Scala
import org.apache.spark.sql.types.MetadataBuilder
val metadata = new MetadataBuilder().putLong("maxlength", 10).build()
df.withColumn("colName", col("colName").as("colName", metadata)
Changed  getJDBCType to take 2 parameters DataType and MetaData

Usage in Scala
import org.apache.spark.sql.types.MetadataBuilder
val metadata = new MetadataBuilder().putLong("maxlength", 10).build()
df.withColumn("colName", col("colName").as("colName", metadata)
Changed  getJDBCType to take 2 parameters DataType and MetaData

Usage in Scala
import org.apache.spark.sql.types.MetadataBuilder
val metadata = new MetadataBuilder().putLong("maxlength", 10).build()
df.withColumn("colName", col("colName").as("colName", metadata)
@rama-mullapudi
Copy link
Author

Added maxlength for field metadata so string types can be converted to VARCHAR(n).

Added JDBCDialects for OracleDialect, NetezzaDialect and modifed DB2Dialect to use VARCHAR when maxlength exists.

Modified existing getJDBCType function to take 2 parameters DataType and MetaData, so maxlength metadata can be used in JDBCDialects.

Can you review the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is a DeveloperAPI, it is public so it would be good to fix without breaking binary compatibility.

@pallavipr
Copy link

DB2 team is working on its own dialect that works with different flavors of DB2. Code will shortly be submitted.

…nd MetaData and removed DB2 JdbcDialect as DB2 team is working on the dialect.
@rama-mullapudi
Copy link
Author

Left getJDBCType with single parameter for binary compatibility and removed DB2 dialect for DB2 team. Can you take a look.

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
	sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
@pallavipr
Copy link

Looks good Rama. We are almost done with DB2 changes - will send for review soon.

One question, did you introduce a stringDataType property in connection url? And StringType will be mapped to the value provided for stringDataType?

Thanks,
Pallavi

@rama-mullapudi
Copy link
Author

Can one of admins review changes

@rama-mullapudi rama-mullapudi changed the title [SPARK-10101] [SQL] Added stringDataType option to jdbc connection properties [SPARK-10101] [SQL] Added maxlength to JDBC field metadata and override JDBCDialects for strings as VARCHAR. Sep 26, 2015
@rama-mullapudi rama-mullapudi changed the title [SPARK-10101] [SQL] Added maxlength to JDBC field metadata and override JDBCDialects for strings as VARCHAR. [SPARK-10101] [SQL] Add maxlength to JDBC field metadata and override JDBCDialects for strings as VARCHAR. Sep 26, 2015
@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Sep 27, 2015

Test build #43054 has finished for PR 8374 at commit dddc137.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sureshthalamati
Copy link
Contributor

I am new to git, not sure if I am reading the changes correctly. It looks you
are undoing https://issues.apache.org/jira/browse/SPARK-9078 changes in your last
merge.

I see you fixed the current dialect implementation in the code to use the new method, but there
can be custom dialect implementations using the old getJDBCType(dt: DataType) method.
After this change , user specified existing custom dialects may not work.

Example :
import org.apache.spark.sql.jdbc._
JdbcDialects.registerDialect(MyNetEzzaDialect)

Another minor thing, any particualar reason for setting java.sql.Types.CHAR insead of
java.sql.Types.VARCHAR in the following statements.

OracleDialect :
Some(JdbcType(s"VARCHAR(${md.getLong("maxlength")})", java.sql.Types.CHAR))
NetEzzaDialect :
Some(JdbcType("VARCHAR(255)", java.sql.Types.CHAR))

@rama-mullapudi
Copy link
Author

Intent of the change is make field metadata available to getJDBCType so it can be used by dialects to handle special cases. One I was trying to solve is by default java string types are stored as TEXT data type in database and most databases doesn't support TEXT data type, so trying to use VARCHAR which needs max length to create a varchar field in database.

What would be the best way to handle custom dialects, since its developerapi and can change.

Thanks for looking in to the change and will update code to use java.sql.Types.VARCHAR instead of java.sql.Types.CHAR

…java.sql.Types.CHAR to java.sql.Types.VARCHAR when type used is VARCHAR
@SparkQA
Copy link

SparkQA commented Sep 28, 2015

Test build #43063 has finished for PR 8374 at commit 5f532e8.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pallavipr
Copy link

Rama, while specifying DataFrame metadata length is good approach, can we please have dbStringMappingType and dbStringMappingLength as properties on the connection url as well? It will make it really easy for developers, as opposed to dealing with dataframe length.

We have it implemented for IBM data servers, trying to see if we can make it mainline Spark since its a useful feature.

VARCHAR has its limitations on length, hence under many circumstances, users may want to map StringType to CLOB of varying lengths if data server supports.

Let me know if we can pursue - we have the code ready.

Thanks.
Pallavi

@marmbrus
Copy link
Contributor

Just to let you know, we are busy wrapping up 1.5.1, but I have put reviewing this PR on our schedule for the next 2 week sprint.

@SparkQA
Copy link

SparkQA commented Sep 28, 2015

Test build #43068 has finished for PR 8374 at commit 27f118b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 28, 2015

Test build #43071 has finished for PR 8374 at commit 44e1978.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 29, 2015

Test build #43074 timed out for PR 8374 at commit e605a11 after a configured wait of 250m.

@SparkQA
Copy link

SparkQA commented Sep 29, 2015

Test build #43089 has finished for PR 8374 at commit 4c2a7a4.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sureshthalamati
Copy link
Contributor

The new contract defined in the JdbcDialect class is sub classes can define any one of the getJDBCType methods or both. Both the methods has to be called to check if dialect has specified different mapping. I think it makes sense to put the call with metadata first, and if it is not defined then call the one without the metadata. If we address this, any existing custom dialects I mentioned in my previous comment should also work fine.

I think Code change in JdbcUtils.scala that call getJDBCType() has to be changed to something like the following:

val typ: String = dialect.getJDBCType(field.dataType, field.metadata).map(.databaseTypeDefinition).orElse(dialect.getJDBCType(field.dataType).map(.databaseTypeDefinition)).getOrElse(
field.dataType match { …

I hope that helps . Thank you for working on this issue.

@SparkQA
Copy link

SparkQA commented Sep 29, 2015

Test build #43099 has finished for PR 8374 at commit a0cb024.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 4, 2015

Test build #43224 has finished for PR 8374 at commit d50bdf7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rama-mullapudi
Copy link
Author

Please review the pull for approval to merge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling getJDBCType(dt,Metadata.empty) instead of None in getJDBCType(dt: DataType), because the classes that extends JdbcDialect possibly implement one of them and then the behaviours of the two functions totally different?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry did not get the change you are suggesting, do you mean to call getJDBCType(dt,Metadata.empty) from getJDBCType(dt: DataType).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu
Copy link
Member

maropu commented Oct 29, 2015

Great work! I left some trivial comments.

@pallavipr
Copy link

Rama, can we please get the metadata maxlength changes in? We have a serious problem with writing Stringtypes, and being able to map according to Dataframe maxlength would be an alternative to get around this issue.

Thanks.

@rishitesh
Copy link

Is this PR going to be merged shortly. We are also expecting this PR in so that we don't have to change Spark. Can anybody respond whats the status ? If any help needed do let me know.

@maropu
Copy link
Member

maropu commented Apr 21, 2016

Seems no because it gets stale. If nobody takes this, I'll do it.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 12, 2016

It seems this PR is stale.. ping @rama-mullapudi

@HyukjinKwon
Copy link
Member

ping @rama-mullapudi

@robbyki
Copy link

robbyki commented Jan 4, 2018

Apologies for misunderstanding this issue but I'm going through several resources to try and understand how to maintain my schema created outside of spark and then just truncating my tables from spark followed by writing with a savemode of overwrite. My problem exactly this issue with respect to my db netezza failing when it sees spark trying to save a text data type so I then have to go specify in my new jdbc dialect to use varchar(n) which does work however that just replaces all of my varchar columns (different lengths for different columns) with whatever I specified in my dialect which is not what I want. How can I just have it save the TEXT as varchar without specifying a length in the custom dialect?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants