Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Dec 16, 2016

What changes were proposed in this pull request?

CSV type inferencing causes IllegalArgumentException on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a partition. Specifically, inferRowType, the seqOp of aggregate, returns the last decimal type. This PR fixes it to use findTightestCommonType.

decimal.csv

9.03E+12
1.19E+11

BEFORE

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
 |-- _c0: decimal(3,-9) (nullable = true)

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3

AFTER

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
 |-- _c0: decimal(4,-9) (nullable = true)

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
+---------+
|      _c0|
+---------+
|9.030E+12|
| 1.19E+11|
+---------+

How was this patch tested?

Pass the newly add test case.

@SparkQA
Copy link

SparkQA commented Dec 17, 2016

Test build #70280 has finished for PR 16320 at commit 9e59ce4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin , @falaki , and @HyukjinKwon .
Could you review this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, so, typeSoFar should keep the precision and scale while being (partially) aggregated within each partition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @HyukjinKwon .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the fallback policy here is to use StringType, shoud we follow?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @cloud-fan . I used NullType since mergeRowTypes does.

  def mergeRowTypes(first: Array[DataType], second: Array[DataType]): Array[DataType] = {
    first.zipAll(second, NullType, NullType).map { case (a, b) =>
      findTightestCommonType(a, b).getOrElse(NullType)
    }
  }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, otherwise, it might end up with an incorrect datatypes. For example,

val path = "/tmp/test1"
Seq(s"${Long.MaxValue}1", "2015-12-01 00:00:00", "1").toDF().coalesce(1).write.text(path)
spark.read.option("inferSchema", true).csv(path).printSchema()
root
 |-- _c0: integer (nullable = true)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct. I'll change into StringType.

@HyukjinKwon
Copy link
Member

@dongjoon-hyun would we need a end-to-end test too?

@dongjoon-hyun
Copy link
Member Author

Actually, I made the end-to-end test based on the example of the use case in JIRA first. And, I removed that from here because the current test case is the minimal version of that.

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile .
Could you review this PR?

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin and @falaki .
If there are some committers to review this part, I think you are the best persons.
Could you give some opinion about this when you have some time?

@SparkQA
Copy link

SparkQA commented Dec 26, 2016

Test build #70576 has finished for PR 16320 at commit 308de12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile .
Could you review this PR when you have some time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add another test case here using the input constant with more than 38 precision?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed your comment here. Let me try!

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @gatorsmile .
I added another test case for the input constant with more than 38 precision.

@SparkQA
Copy link

SparkQA commented Dec 29, 2016

Test build #70685 has finished for PR 16320 at commit c1e07a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Could you review this CSVInferSchema issue again, @gatorsmile ?

@gatorsmile
Copy link
Member

LGTM cc @cloud-fan

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @gatorsmile .
Happy new year!

@gatorsmile
Copy link
Member

Happy New Year!

@dongjoon-hyun
Copy link
Member Author

Thank you again, @cloud-fan and @HyukjinKwon . I updated the fallback datatype.

@gatorsmile
Copy link
Member

Please add the test case?

@dongjoon-hyun
Copy link
Member Author

I assumed this one. Right?

val path = "/tmp/test1"
Seq(s"${Long.MaxValue}1", "2015-12-01 00:00:00", "1").toDF().coalesce(1).write.text(path)
spark.read.option("inferSchema", true).csv(path).printSchema()

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jan 3, 2017

Yep. I added the testcase as a minimized version, too. @gatorsmile

@SparkQA
Copy link

SparkQA commented Jan 3, 2017

Test build #70798 has started for PR 16320 at commit e59631b.

@SparkQA
Copy link

SparkQA commented Jan 3, 2017

Test build #70795 has finished for PR 16320 at commit 393d3a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented Jan 3, 2017

The test case coverage in the suite CSVInferSchemaSuite.scala looks random. I am afraid the future code changes could easily break the existing type inference rules. Could you improve it in a separate PR? You might find more issues when you try to improve the test cases.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@dongjoon-hyun
Copy link
Member Author

I see, @gatorsmile . I will try to make a PR to improve the coverage.

For this issue, the failure about the last commit (adding test case) was a R failure. So, it's irrelevant. I rerun it.

@SparkQA
Copy link

SparkQA commented Jan 3, 2017

Test build #70809 has finished for PR 16320 at commit e59631b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan
Copy link
Contributor

oh forget to backport it to 2.1/2.0, @gatorsmile can you do it? My connection is bad now.

@asfgit asfgit closed this in 7a2b5f9 Jan 3, 2017
@dongjoon-hyun
Copy link
Member Author

Thank you, @cloud-fan !

@gatorsmile
Copy link
Member

@dongjoon-hyun Could you submit a backport PR to 2.1? I am unable to merge this PR to 2.1. Thanks!

@dongjoon-hyun
Copy link
Member Author

Sure. I'll create a backpor PR for 2.1.

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Jan 4, 2017
…find a common type with `typeSoFar`

## What changes were proposed in this pull request?

CSV type inferencing causes `IllegalArgumentException` on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a **partition**. Specifically, `inferRowType`, the **seqOp** of **aggregate**, returns the last decimal type. This PR fixes it to use `findTightestCommonType`.

**decimal.csv**
```
9.03E+12
1.19E+11
```

**BEFORE**
```scala
scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
 |-- _c0: decimal(3,-9) (nullable = true)

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3
```

**AFTER**
```scala
scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
 |-- _c0: decimal(4,-9) (nullable = true)

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
+---------+
|      _c0|
+---------+
|9.030E+12|
| 1.19E+11|
+---------+
```

## How was this patch tested?

Pass the newly add test case.

Author: Dongjoon Hyun <[email protected]>

Closes apache#16320 from dongjoon-hyun/SPARK-18877.
@dongjoon-hyun dongjoon-hyun deleted the SPARK-18877 branch January 6, 2017 18:18
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…find a common type with `typeSoFar`

## What changes were proposed in this pull request?

CSV type inferencing causes `IllegalArgumentException` on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a **partition**. Specifically, `inferRowType`, the **seqOp** of **aggregate**, returns the last decimal type. This PR fixes it to use `findTightestCommonType`.

**decimal.csv**
```
9.03E+12
1.19E+11
```

**BEFORE**
```scala
scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
 |-- _c0: decimal(3,-9) (nullable = true)

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3
```

**AFTER**
```scala
scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
 |-- _c0: decimal(4,-9) (nullable = true)

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
+---------+
|      _c0|
+---------+
|9.030E+12|
| 1.19E+11|
+---------+
```

## How was this patch tested?

Pass the newly add test case.

Author: Dongjoon Hyun <[email protected]>

Closes apache#16320 from dongjoon-hyun/SPARK-18877.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants