Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

The code below currently infers as decimal but previously it was inferred as string.

In branch-2.4, type inference path for decimal and parsing data are different.

So the code below:

scala> spark.read.option("delimiter", "|").option("inferSchema", "true").csv(Seq("1,2").toDS).printSchema()

produced string as its type.

root
 |-- _c0: string (nullable = true)

In the current master, it now infers decimal as below:

root
 |-- _c0: decimal(2,0) (nullable = true)

It happened after #22979 because, now after this PR, we only have one way to parse decimal:

(s: String) => new java.math.BigDecimal(s.replaceAll(",", ""))

After the fix:

root
 |-- _c0: string (nullable = true)

This PR proposes to restore the previous behaviour back in CSVInferSchema.

How was this patch tested?

Manually tested and unit tests were added.

@HyukjinKwon
Copy link
Member Author

Honestly, I think that it's also okay to drop this weird previous behaviour ... but just decided to follow the intention of keeping backward compatibility since it's simple, for now.

I think we should drop this replace(",") behaviour at all soon.

@HyukjinKwon
Copy link
Member Author

cc @MaxGekk and @cloud-fan

@MaxGekk
Copy link
Member

MaxGekk commented Apr 22, 2019

What about to support prefersDecimal option for CSV as we have for JSON:

val prefersDecimal =
parameters.get("prefersDecimal").map(_.toBoolean).getOrElse(false)

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Apr 22, 2019

Makes sense but I think that doesn't fully address the current issue - some people could still complain about changing this behaviour.

IIUC, prefersDecimal is a switch to decide if we infer decimal or not. In this issue, looks like users still want the decimal inference but want to avoid one case containing ,.

Let's consider that option later when we drop this behaviour later.

}

Seq("en-US", "ko-KR", "ru-RU", "de-DE").foreach(checkDecimalInfer(_, DecimalType(7, 0)))
// input like '1,0' is inferred as strings for backward compatibility.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I may not know all the context, but doesn't 1,0 mean 2 int columns?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field separator here is | not ,. So 1,0 cannot be considered as separate columns.

@SparkQA
Copy link

SparkQA commented Apr 22, 2019

Test build #104804 has finished for PR 24437 at commit b9b76d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

options.locale)

private val decimalParser = {
private val decimalParser = if (options.locale == Locale.US) {
Copy link
Member

@MaxGekk MaxGekk Apr 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks slightly ugly that we handle special case of Locale.US for JSON inside of ExprUtils.getDecimalParser but for CSV outside of it. Maybe we unify the implementation, and handle the special cases outside of generic implementation ExprUtils.getDecimalParser (or process both special cases inside of it)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.. we gotta do that I think .. Let's do that later when we drop the replacement thing entirely. I realised that CSV inference path alone handles it without replacement.

@MaxGekk
Copy link
Member

MaxGekk commented Apr 22, 2019

ExprUtils.getDecimalParser is used in UnivocityParser:

Decimal(decimalParser(datum), dt.precision, dt.scale)
. Do we need to handle the backward compatibility case there as well?

@HyukjinKwon
Copy link
Member Author

Nope, for parsing itself somehow replacement was being done before and it looks being handled fine

(s: String) => new java.math.BigDecimal(s.replaceAll(",", ""))

@HyukjinKwon
Copy link
Member Author

Let me get this in if you guys don't mind. This just literally restores previous behaviour as was.

@HyukjinKwon
Copy link
Member Author

Thanks, @MaxGekk and @cloud-fan. Merged to master.

@HyukjinKwon HyukjinKwon deleted the SPARK-27512 branch March 3, 2020 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants