[SPARK-18877][SQL] `CSVInferSchema.inferField` on DecimalType should find a common type with `typeSoFar` #16320

dongjoon-hyun · 2016-12-16T22:34:58Z

What changes were proposed in this pull request?

CSV type inferencing causes IllegalArgumentException on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a partition. Specifically, inferRowType, the seqOp of aggregate, returns the last decimal type. This PR fixes it to use findTightestCommonType.

decimal.csv

9.03E+12
1.19E+11

BEFORE

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
 |-- _c0: decimal(3,-9) (nullable = true)

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3

AFTER

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
 |-- _c0: decimal(4,-9) (nullable = true)

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
+---------+
|      _c0|
+---------+
|9.030E+12|
| 1.19E+11|
+---------+

How was this patch tested?

Pass the newly add test case.

SparkQA · 2016-12-17T00:50:21Z

Test build #70280 has finished for PR 16320 at commit 9e59ce4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-12-17T01:18:00Z

Hi, @rxin , @falaki , and @HyukjinKwon .
Could you review this PR?

HyukjinKwon · 2016-12-17T02:56:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

Aha, so, typeSoFar should keep the precision and scale while being (partially) aggregated within each partition.

Thank you for review, @HyukjinKwon .

Looks like the fallback policy here is to use StringType, shoud we follow?

Thank you for review, @cloud-fan . I used NullType since mergeRowTypes does.

def mergeRowTypes(first: Array[DataType], second: Array[DataType]): Array[DataType] = { first.zipAll(second, NullType, NullType).map { case (a, b) => findTightestCommonType(a, b).getOrElse(NullType) } }

Yes, otherwise, it might end up with an incorrect datatypes. For example,

val path = "/tmp/test1" Seq(s"${Long.MaxValue}1", "2015-12-01 00:00:00", "1").toDF().coalesce(1).write.text(path) spark.read.option("inferSchema", true).csv(path).printSchema()

root |-- _c0: integer (nullable = true)

You're correct. I'll change into StringType.

HyukjinKwon · 2016-12-17T03:48:51Z

@dongjoon-hyun would we need a end-to-end test too?

dongjoon-hyun · 2016-12-17T04:44:01Z

Actually, I made the end-to-end test based on the example of the use case in JIRA first. And, I removed that from here because the current test case is the minimal version of that.

dongjoon-hyun · 2016-12-18T01:13:34Z

Hi, @gatorsmile .
Could you review this PR?

dongjoon-hyun · 2016-12-19T20:45:25Z

Hi, @rxin and @falaki .
If there are some committers to review this part, I think you are the best persons.
Could you give some opinion about this when you have some time?

SparkQA · 2016-12-26T04:45:36Z

Test build #70576 has finished for PR 16320 at commit 308de12.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-12-26T20:22:22Z

Hi, @gatorsmile .
Could you review this PR when you have some time?

gatorsmile · 2016-12-27T05:28:12Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala

Could you add another test case here using the input constant with more than 38 precision?

Oh, I missed your comment here. Let me try!

…ype with `typeSoFar`

dongjoon-hyun · 2016-12-28T22:20:36Z

Thank you for review, @gatorsmile .
I added another test case for the input constant with more than 38 precision.

SparkQA · 2016-12-29T00:30:46Z

Test build #70685 has finished for PR 16320 at commit c1e07a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-12-29T20:06:21Z

Could you review this CSVInferSchema issue again, @gatorsmile ?

gatorsmile · 2016-12-30T18:47:26Z

LGTM cc @cloud-fan

dongjoon-hyun · 2017-01-01T07:08:08Z

Thank you for review, @gatorsmile .
Happy new year!

gatorsmile · 2017-01-01T18:43:04Z

Happy New Year!

dongjoon-hyun · 2017-01-03T04:16:27Z

Thank you again, @cloud-fan and @HyukjinKwon . I updated the fallback datatype.

gatorsmile · 2017-01-03T05:26:03Z

Please add the test case?

dongjoon-hyun · 2017-01-03T05:52:21Z

I assumed this one. Right?

val path = "/tmp/test1"
Seq(s"${Long.MaxValue}1", "2015-12-01 00:00:00", "1").toDF().coalesce(1).write.text(path)
spark.read.option("inferSchema", true).csv(path).printSchema()

dongjoon-hyun · 2017-01-03T06:05:11Z

Yep. I added the testcase as a minimized version, too. @gatorsmile

SparkQA · 2017-01-03T06:07:37Z

Test build #70798 has started for PR 16320 at commit e59631b.

SparkQA · 2017-01-03T06:24:04Z

Test build #70795 has finished for PR 16320 at commit 393d3a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-03T06:39:43Z

The test case coverage in the suite CSVInferSchemaSuite.scala looks random. I am afraid the future code changes could easily break the existing type inference rules. Could you improve it in a separate PR? You might find more issues when you try to improve the test cases.

dongjoon-hyun · 2017-01-03T09:06:36Z

Retest this please

dongjoon-hyun · 2017-01-03T09:09:38Z

I see, @gatorsmile . I will try to make a PR to improve the coverage.

For this issue, the failure about the last commit (adding test case) was a R failure. So, it's irrelevant. I rerun it.

SparkQA · 2017-01-03T11:15:47Z

Test build #70809 has finished for PR 16320 at commit e59631b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-03T15:07:16Z

thanks, merging to master!

cloud-fan · 2017-01-03T15:08:58Z

oh forget to backport it to 2.1/2.0, @gatorsmile can you do it? My connection is bad now.

dongjoon-hyun · 2017-01-03T15:24:50Z

Thank you, @cloud-fan !

gatorsmile · 2017-01-03T19:42:25Z

@dongjoon-hyun Could you submit a backport PR to 2.1? I am unable to merge this PR to 2.1. Thanks!

dongjoon-hyun · 2017-01-03T20:28:51Z

Sure. I'll create a backpor PR for 2.1.

…find a common type with `typeSoFar` ## What changes were proposed in this pull request? CSV type inferencing causes `IllegalArgumentException` on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a **partition**. Specifically, `inferRowType`, the **seqOp** of **aggregate**, returns the last decimal type. This PR fixes it to use `findTightestCommonType`. **decimal.csv** ``` 9.03E+12 1.19E+11 ``` **BEFORE** ```scala scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema root |-- _c0: decimal(3,-9) (nullable = true) scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show 16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3 ``` **AFTER** ```scala scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema root |-- _c0: decimal(4,-9) (nullable = true) scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show +---------+ | _c0| +---------+ |9.030E+12| | 1.19E+11| +---------+ ``` ## How was this patch tested? Pass the newly add test case. Author: Dongjoon Hyun <[email protected]> Closes apache#16320 from dongjoon-hyun/SPARK-18877.

HyukjinKwon reviewed Dec 17, 2016

View reviewed changes

gatorsmile reviewed Dec 27, 2016

View reviewed changes

dongjoon-hyun added 2 commits December 28, 2016 14:00

[SPARK-18877][SQL] inferField on DecimalType should find a common t…

28d9ac3

…ype with `typeSoFar`

Add another testcase having precision 40 (over 38) and scale 20.

c1e07a9

Use StringType.

393d3a9

Add the corresponding testcase

e59631b

asfgit closed this in 7a2b5f9 Jan 3, 2017

dongjoon-hyun mentioned this pull request Jan 4, 2017

[SPARK-18877][SQL][BACKPORT-2.1] CSVInferSchema.inferField on DecimalType should find a common type with typeSoFar #16463

Closed

dongjoon-hyun mentioned this pull request Jan 4, 2017

[SPARK-18877][SQL][BACKPORT-2.0] CSVInferSchema.inferField on DecimalType should find a common type with typeSoFar #16472

Closed

dongjoon-hyun deleted the SPARK-18877 branch January 6, 2017 18:18

[SPARK-18877][SQL] CSVInferSchema.inferField on DecimalType should find a common type with typeSoFar #16320

[SPARK-18877][SQL] CSVInferSchema.inferField on DecimalType should find a common type with typeSoFar #16320

Uh oh!

Conversation

dongjoon-hyun commented Dec 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 17, 2016

Uh oh!

dongjoon-hyun commented Dec 17, 2016

Uh oh!

HyukjinKwon Dec 17, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Dec 17, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jan 3, 2017

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jan 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 3, 2017

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jan 3, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Dec 17, 2016

Uh oh!

dongjoon-hyun commented Dec 17, 2016

Uh oh!

dongjoon-hyun commented Dec 18, 2016

Uh oh!

dongjoon-hyun commented Dec 19, 2016

Uh oh!

SparkQA commented Dec 26, 2016

Uh oh!

dongjoon-hyun commented Dec 26, 2016

Uh oh!

gatorsmile Dec 27, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Dec 28, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Dec 28, 2016

Uh oh!

SparkQA commented Dec 29, 2016

Uh oh!

dongjoon-hyun commented Dec 29, 2016

Uh oh!

gatorsmile commented Dec 30, 2016

Uh oh!

dongjoon-hyun commented Jan 1, 2017

Uh oh!

gatorsmile commented Jan 1, 2017

Uh oh!

dongjoon-hyun commented Jan 3, 2017

Uh oh!

gatorsmile commented Jan 3, 2017

Uh oh!

dongjoon-hyun commented Jan 3, 2017

Uh oh!

dongjoon-hyun commented Jan 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 3, 2017

Uh oh!

SparkQA commented Jan 3, 2017

Uh oh!

gatorsmile commented Jan 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jan 3, 2017

Uh oh!

[SPARK-18877][SQL] `CSVInferSchema.inferField` on DecimalType should find a common type with `typeSoFar` #16320

[SPARK-18877][SQL] `CSVInferSchema.inferField` on DecimalType should find a common type with `typeSoFar` #16320

dongjoon-hyun commented Dec 16, 2016 •

edited

Loading

dongjoon-hyun Jan 3, 2017 •

edited

Loading

dongjoon-hyun commented Jan 3, 2017 •

edited

Loading

gatorsmile commented Jan 3, 2017 •

edited

Loading