[SPARK-20427][SQL] Read JDBC table use custom schema #18266

wangyum · 2017-06-11T02:26:58Z

What changes were proposed in this pull request?

Auto generated Oracle schema some times not we expect:

number(1) auto mapped to BooleanType, some times it's not we expect, per SPARK-20921.
number auto mapped to Decimal(38,10), It can't read big data, per SPARK-20427.

This PR fix this issue by custom schema as follows:

val props = new Properties()
props.put("customSchema", "ID decimal(38, 0), N1 int, N2 boolean")
val dfRead = spark.read.schema(schema).jdbc(jdbcUrl, "tableWithCustomSchema", props)
dfRead.show()

or

CREATE TEMPORARY VIEW tableWithCustomSchema
USING org.apache.spark.sql.jdbc
OPTIONS (url '$jdbcUrl', dbTable 'tableWithCustomSchema', customSchema'ID decimal(38, 0), N1 int, N2 boolean')

How was this patch tested?

unit tests

SparkQA · 2017-06-11T04:50:32Z

Test build #77887 has finished for PR 18266 at commit 871c303.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-11T16:56:20Z

Test build #77895 has finished for PR 18266 at commit 0444c4d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2017-06-12T03:33:11Z

Jenkins, retest this please

gatorsmile · 2017-06-12T05:47:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala

  def putMetadataArray(key: String, value: Array[Metadata]): this.type = put(key, value)

+  /** Puts a name. */
+  def putName(name: String): this.type = put("name", name)


This interface change is not desired. See the PR #16209

You can further enhance our parser by supporting the data types that are not natively supported by Spark.

SparkQA · 2017-06-12T05:53:00Z

Test build #77908 has finished for PR 18266 at commit 0444c4d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-15T11:08:16Z

Test build #78093 has finished for PR 18266 at commit 06881e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-06-16T06:32:57Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

  }

-  test("SPARK-16848: jdbc API throws an exception for user specified schema") {
+  ignore("SPARK-16848: jdbc API throws an exception for user specified schema") {


JDBC didn't support specified schema before:
https://github.com/apache/spark/blob/v2.2.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L188

Then, we should remove this test case.

gatorsmile · 2017-06-16T06:39:04Z

...cker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala

+      StructField("N1", IntegerType, true, new MetadataBuilder().putString("name", "N1").build()),
+      StructField("N2", BooleanType, true, new MetadataBuilder().putString("name", "N2").build())))
+
+    val dfRead = spark.read.schema(schema).jdbc(jdbcUrl, "custom_column_types", new Properties())


gatorsmile · 2017-06-16T07:22:00Z

...cker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala

+    val schema = StructType(Seq(
+      StructField("ID", DecimalType(DecimalType.MAX_PRECISION, 0), true,
+        new MetadataBuilder().putString("name", "ID").build()),
+      StructField("N1", IntegerType, true, new MetadataBuilder().putString("name", "N1").build()),


Why adding new MetadataBuilder().putString("name", "N1").build()?

JDBCRDD will read metadata:
https://github.com/apache/spark/blob/v2.2.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L85
I'll change here next commit.

wangyum · 2017-06-23T12:32:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

   */
  private def pruneSchema(schema: StructType, columns: Array[String]): StructType = {
-    val fieldMap = Map(schema.fields.map(x => x.metadata.getString("name") -> x): _*)
+    val fieldMap = Map(schema.fields.map(x => x.name -> x): _*)


x.metadata.getString("name") always equals x.name:
https://github.com/apache/spark/blob/v2.2.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L291

This is not a related change. Could you revert it back?

CatalystSqlParser.parseTableSchema(columnTypes) constructed StructType without metadata, error message:

key not found: name java.util.NoSuchElementException: key not found: name at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.apache.spark.sql.types.Metadata.get(Metadata.scala:111) at org.apache.spark.sql.types.Metadata.getString(Metadata.scala:60) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:83) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:83)

SparkQA · 2017-06-23T15:21:48Z

Test build #78526 has finished for PR 18266 at commit a984f3b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-06-30T04:40:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala

  // TODO: to reuse the existing partition parameters for those partition specific options
  val createTableOptions = parameters.getOrElse(JDBC_CREATE_TABLE_OPTIONS, "")
  val createTableColumnTypes = parameters.get(JDBC_CREATE_TABLE_COLUMN_TYPES)
+  val customSchema = parameters.get(JDBC_CUSTOM_SCHEMA)


convert it to StructType here.

gatorsmile · 2017-06-30T04:43:39Z

I am fine to support customized schema for read path of JDBC relation. However, we need to check whether the user-specified schema matches the underlying the table schema. If not matched, we need to capture it earlier and issue a proper error message.

…umn names

…chema

SparkQA · 2017-07-04T10:14:20Z

Test build #79139 has finished for PR 18266 at commit 5fdd2bb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-04T10:34:15Z

Test build #79137 has finished for PR 18266 at commit 9e6f7cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-04T10:50:42Z

Test build #79136 has finished for PR 18266 at commit e0fc6b4.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-15T16:24:18Z

The example in the PR description looks a little bit confusing.

val dfRead = spark.read.schema(schema).jdbc(jdbcUrl, "tableWithCustomSchema", new Properties())

Could you update it?

gatorsmile · 2017-08-15T16:26:04Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

    this.extraOptions ++= properties.asScala
    // explicit url and dbtable should override all
    this.extraOptions += (JDBCOptions.JDBC_URL -> url, JDBCOptions.JDBC_TABLE_NAME -> table)
+    if (userSpecifiedSchema.isDefined) {


Please also update another API in the line 273.

gatorsmile · 2017-08-15T16:28:44Z

...cker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala

+
+    // default will throw IllegalArgumentException
+    val e = intercept[org.apache.spark.SparkException] {
+      spark.read.jdbc(jdbcUrl, "custom_column_types", new Properties()).collect()


Nit: Change the table names in all the test cases.

SparkQA · 2017-08-24T17:34:00Z

Test build #81084 has finished for PR 18266 at commit 1e2c1d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sobusiak · 2017-09-07T11:37:16Z

As far as I understand the proposed solution recommends using DecimalType with user-chosen precision and scale but DecimalType cannot represent all numbers that can be stored in NUMBER. I believe DoubleType is more suitable. Actually why isn't it the default Spark type for Oracle's NUMBER? Please see my comment in SPARK-20427.

wangyum · 2017-09-07T13:05:38Z

Yes, mapping to Double seems fine. this test passed:

  test("SPARK-20427/SPARK-20921: read table use custom schema by jdbc api") {
    // default will throw IllegalArgumentException
    val e = intercept[org.apache.spark.SparkException] {
      spark.read.jdbc(jdbcUrl, "tableWithCustomSchema", new Properties()).collect()
    }
    assert(e.getMessage.contains(
      "requirement failed: Decimal precision 39 exceeds max precision 38"))

    // custom schema can read data
    val props = new Properties()
    props.put("customDataFrameColumnTypes",
      s"ID double, N1 int, N2 boolean")
    val dfRead = spark.read.jdbc(jdbcUrl, "tableWithCustomSchema", props)

    val rows = dfRead.collect()
    // verify the data type
    val types = rows(0).toSeq.map(x => x.getClass.toString)
    assert(types(0).equals("class java.lang.Double"))
    assert(types(1).equals("class java.lang.Integer"))
    assert(types(2).equals("class java.lang.Boolean"))

    // verify the value
    val values = rows(0)
    assert(values.getDouble(0).equals(12312321321321312312312312123D))
    assert(values.getInt(1).equals(1))
    assert(values.getBoolean(2).equals(false))
  }

gatorsmile · 2017-09-07T17:25:44Z

will review this today.

gatorsmile · 2017-09-08T06:30:35Z

docs/sql-programming-guide.md

+  </tr>
+
+  <tr>
+    <td><code>customDataFrameColumnTypes</code></td>


customSchema

gatorsmile · 2017-09-08T06:33:52Z

docs/sql-programming-guide.md

+  <tr>
+    <td><code>customDataFrameColumnTypes</code></td>
+    <td>
+     The DataFrame column data types to use instead of the defaults when reading data from jdbc API. (e.g: <code>"id DECIMAL(38, 0), name STRING")</code>. The specified types should be valid spark sql data types. This option applies only to reading.


This is not limited to DataFrame.

The custom schema to use for reading data from JDBC connectors. For example, "id DECIMAL(38, 0), name STRING"). The column names should be identical to the corresponding column names of JDBC table. Users can specify the corresponding data types of Spark SQL instead of using the defaults. This option applies only to reading.

gatorsmile · 2017-09-08T06:34:31Z

examples/src/main/python/sql/datasource.py

+        .option("dbtable", "schema.tablename") \
+        .option("user", "username") \
+        .option("password", "password") \
+        .option("customDataFrameColumnTypes", "id DECIMAL(38, 0), name STRING") \


customSchema

gatorsmile · 2017-09-08T06:38:03Z

examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala

    val jdbcDF2 = spark.read
      .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
+    // Specifying dataframe column data types on read
+    connectionProperties.put("customDataFrameColumnTypes", "id DECIMAL(38, 0), name STRING")


customSchema

gatorsmile · 2017-09-08T06:45:21Z

examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala

    connectionProperties.put("password", "password")
    val jdbcDF2 = spark.read
      .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
+    // Specifying dataframe column data types on read


Specifying the custom data types of the read schema

gatorsmile · 2017-09-08T06:58:48Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

   */
  def jdbc(url: String, table: String, properties: Properties): DataFrame = {
-    assertNoSpecifiedSchema("jdbc")
+    assertJdbcAPISpecifiedDataFrameSchema()


Users should be able to do it in either way. If users specify them in both schema() API and the customerSchema option, we should issue an exception.

SparkQA · 2017-09-08T07:04:46Z

Test build #81536 has finished for PR 18266 at commit b38a1a8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-08T07:05:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

   */
  private def pruneSchema(schema: StructType, columns: Array[String]): StructType = {
-    val fieldMap = Map(schema.fields.map(x => x.metadata.getString("name") -> x): _*)
+    val fieldMap = Map(schema.fields.map(x => x.name -> x): _*)


Sorry, I did not get your point. Could you show me an example? Is it a behavior breaking change?

scala> org.apache.spark.sql.catalyst.parser.CatalystSqlParser.parseTableSchema("id int, name string").fields.map(x => x.metadata.getString("name") -> x) java.util.NoSuchElementException: key not found: name at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at org.apache.spark.sql.types.Metadata.get(Metadata.scala:111) at org.apache.spark.sql.types.Metadata.getString(Metadata.scala:60) at $anonfun$1.apply(<console>:24) at $anonfun$1.apply(<console>:24) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) ... 48 elided

SparkQA · 2017-09-10T12:24:26Z

Test build #81601 has finished for PR 18266 at commit 7fc97b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

Looks pretty good! Thanks!

gatorsmile · 2017-09-12T21:02:07Z

examples/src/main/python/sql/datasource.py

+        .option("dbtable", "schema.tablename") \
+        .option("user", "username") \
+        .option("password", "password") \
+        .option("customDataFrameColumnTypes", "id DECIMAL(38, 0), name STRING") \


Please rename this.

gatorsmile · 2017-09-12T21:39:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

   */
  private def pruneSchema(schema: StructType, columns: Array[String]): StructType = {
-    val fieldMap = Map(schema.fields.map(x => x.metadata.getString("name") -> x): _*)
+    val fieldMap = Map(schema.fields.map(x => x.name -> x): _*)


I see. Could we just get rid of the line where we put name in the metadata?

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L304

It seems safe to remove this line.

gatorsmile · 2017-09-12T21:46:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

+        sqlContext.sessionState.conf.resolver)
+    } else {
+      schema
+    }


val tableSchema = JDBCRDD.resolveTable(jdbcOptions) jdbcOptions.customSchema match { case Some(customSchema) => JdbcUtils.parseUserSpecifiedColumnTypes( tableSchema, customSchema, sparkSession.sessionState.conf.resolver) case None => tableSchema }

gatorsmile · 2017-09-12T21:48:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+   */
+  def parseUserSpecifiedColumnTypes(
+       schema: StructType,
+       columnTypes: String,


def getCustomSchema( tableSchema: StructType, customSchema: String, nameEquality: Resolver): StructType = {

gatorsmile · 2017-09-12T21:52:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+    userSchema.fieldNames.foreach { col =>
+      schema.find(f => nameEquality(f.name, col)).getOrElse {
+        throw new AnalysisException(
+          s"${JDBCOptions.JDBC_CUSTOM_DATAFRAME_COLUMN_TYPES} option column $col not found in " +


val colNames = tableSchema.fieldNames.mkString(",") throw new AnalysisException(s"Please provide all the columns, all columns are: $colNames")

SparkQA · 2017-09-13T16:50:26Z

Test build #81719 has finished for PR 18266 at commit 1fdf002.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-13T17:04:27Z

@wangyum Could you update the example in the PR description?

wangyum · 2017-09-13T23:14:44Z

@gatorsmile Done

gatorsmile · 2017-09-13T23:32:18Z

LGTM

gatorsmile · 2017-09-13T23:33:31Z

Thanks! Merged to master.

…partial fields. ## What changes were proposed in this pull request? apache#18266 add a new feature to support read JDBC table use custom schema, but we must specify all the fields. For simplicity, this PR support specify partial fields. ## How was this patch tested? unit tests Author: Yuming Wang <[email protected]> Closes apache#19231 from wangyum/SPARK-22002.

Read JDBC table use custom schema

871c303

Fix test error.

0444c4d

gatorsmile reviewed Jun 12, 2017

View reviewed changes

Revert api.

06881e8

gatorsmile reviewed Jun 16, 2017

View reviewed changes

wangyum mentioned this pull request Jun 23, 2017

[SPARK-20921][SQL][WIP] Support can config OracleDialect whether convert number(1) to BooleanType #18195

Closed

wangyum added 2 commits June 23, 2017 20:29

Improve custom schema.

ffaee42

Improve custom schema.

a984f3b

wangyum commented Jun 23, 2017

View reviewed changes

gatorsmile reviewed Jun 30, 2017

View reviewed changes

wangyum added 3 commits July 4, 2017 15:36

Throw exception if custom schema field names does not match table col…

e0fc6b4

…umn names

Merge branch 'master' into SPARK-20427

9e6f7cf

Remove SPARK-16848: jdbc API throws an exception for user specified s…

5fdd2bb

…chema

leophan approved these changes Jul 27, 2017

View reviewed changes

wangyum mentioned this pull request Aug 15, 2017

[SPARK-16625][SQL] General data types to be mapped to Oracle #14377

Closed

gatorsmile reviewed Aug 15, 2017

View reviewed changes

Change all SQL type to upper case.

b38a1a8

gatorsmile reviewed Sep 8, 2017

View reviewed changes

wangyum added 2 commits September 10, 2017 16:02

customDataFrameColumnTypes -> customSchema

0b67f0f

revert assertNoSpecifiedSchema

7fc97b4

gatorsmile reviewed Sep 12, 2017

View reviewed changes

parseUserSpecifiedColumnTypes -> getCustomSchema

1fdf002

asfgit closed this in 17edfec Sep 13, 2017

wangyum mentioned this pull request Sep 14, 2017

[SPARK-22002][SQL] Read JDBC table use custom schema support specify partial fields. #19231

Closed

maropu mentioned this pull request Jul 31, 2020

[SPARK-32510][SQL] Check duplicate nested columns in read from JDBC datasource #29317

Closed

[SPARK-20427][SQL] Read JDBC table use custom schema #18266

[SPARK-20427][SQL] Read JDBC table use custom schema #18266

Uh oh!

Conversation

wangyum commented Jun 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 11, 2017

Uh oh!

SparkQA commented Jun 11, 2017

Uh oh!

wangyum commented Jun 12, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 12, 2017

Uh oh!

SparkQA commented Jun 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyum Jun 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 23, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jun 30, 2017

Uh oh!

SparkQA commented Jul 4, 2017

Uh oh!

SparkQA commented Jul 4, 2017

Uh oh!

SparkQA commented Jul 4, 2017

Uh oh!

gatorsmile commented Aug 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 24, 2017

Uh oh!

sobusiak commented Sep 7, 2017

Uh oh!

wangyum commented Sep 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Sep 7, 2017

Uh oh!

gatorsmile Sep 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Sep 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Sep 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangyum commented Jun 11, 2017 •

edited

Loading

wangyum Jun 23, 2017 •

edited

Loading

wangyum commented Sep 7, 2017 •

edited

Loading

gatorsmile Sep 8, 2017 •

edited

Loading

gatorsmile Sep 8, 2017 •

edited

Loading

gatorsmile Sep 8, 2017 •

edited

Loading

gatorsmile Sep 8, 2017 •

edited

Loading

gatorsmile Sep 8, 2017 •

edited

Loading