[SPARK-49082][SQL] Widening type promotions in `AvroDeserializer` #47582

wayneguow · 2024-08-02T09:53:42Z

What changes were proposed in this pull request?

This PR aims to widen type promotions in AvroDeserializer. Supported as following(Avro Type -> Spark Type):

Int -> Long ;
Int -> Double ;
Float -> Double;

Why are the changes needed?

Similar to PR #44368 for Parquet reader, we'd better to enable type promotion/widening for Avro deserializer.

Does this PR introduce any user-facing change?

Yes, but more convenient for users.

How was this patch tested?

Pass GA and add a new test case.

Was this patch authored or co-authored using generative AI tooling?

No.

wayneguow · 2024-08-05T02:39:03Z

cc @cloud-fan and @LuciferYang

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

wayneguow · 2024-08-05T11:03:27Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala

@cloud-fan I made a change here.

When it comes to converting maximum or minimum values from float to double, if using the implicit toDouble directly, the result will not meet expectations. For example, the difference in the result is actually a very large number.

And I think this problem is also faced in the processing of parquet.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

Line 330 in d431d40

this.updater.setDouble(value)

At the same time, I think the logic of converting float to double(for float, case x: NumericType is matched, and then toDouble is called directly) in our cast expression is also worth discussing the implementation, which is similar to this problem.

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

Lines 1008 to 1031 in 6d32472

private[this] def castToDouble(from: DataType): Any => Any = from match {

case _: StringType =>

buildCast[UTF8String](_, s => {

val doubleStr = s.toString

try doubleStr.toDouble catch {

case _: NumberFormatException =>

val d = Cast.processFloatingPointSpecialLiterals(doubleStr, false)

if(ansiEnabled && d == null) {

throw QueryExecutionErrors.invalidInputInCastToNumberError(

DoubleType, s, getContextOrNull())

} else {

d

}

}

})

case BooleanType =>

buildCast[Boolean](_, b => if (b) 1d else 0d)

case DateType =>

buildCast[Int](_, d => null)

case TimestampType =>

buildCast[Long](_, t => timestampToDouble(t))

case x: NumericType =>

val numeric = PhysicalNumericType.numeric(x)

b => numeric.toDouble(b)

what's the behavior of SQL CAST? We should be consistent with that

It calls toDouble directly, just like in the code above numeric.toDouble(b).

But the actual difference will be very large.

Seq(Float.MinValue, Float.MinPositiveValue, Float.MaxValue)

shall we use the same code instead of .toString.toDouble?

Well, I agree to be consistent with Expression Cast to avoid unnecessary trouble. Update after soon.

(But actually if we want to ensure accurate calculations, we'd better avoid using toDouble directly.)

jackierwzhang · 2024-08-06T23:14:16Z

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

Thanks for implementing this (I posted the original ticket :D)!

Is there a plan to convert Date -> TimestampNTZ (Parquet PR) and Int -> Decimal (Parquet PR)?

Yes, I will seperate another two PR.

cloud-fan

LGTM if CI passes

wayneguow · 2024-08-07T03:49:55Z

Rebase master and waiting for CI~

cloud-fan · 2024-08-07T14:26:36Z

thanks, merged to master

### What changes were proposed in this pull request? This change adds support for widening type promotions from `Date` to `TimestampNTZ` in `AvroDeserializer. This PR is a follow-up to #47582 which adds support for other widening type promotions. ### Why are the changes needed? When reading Avro files with a mix of Date and TimestampNTZ for a given column, the reader should be able to read all files and promote Date to TimestampNTZ instead of throwing an error when reading files with Date. Although [SPARK-49082](https://issues.apache.org/jira/browse/SPARK-49082) was resolved by #47582, that PR did not include Date -> TimestampNTZ widening. The change in this PR is very similar to #44368 which adds support for Date -> TimestampNTZ widening for the Parquet reader. ### Does this PR introduce _any_ user-facing change? Yes, users will no longer see an error when attempting to read a file containing Date when the read schema contains TimestampNTZ. The time will be set to 00:00, as has been done in #44368. ### How was this patch tested? New test in `AvroSuite`. ### Was this patch authored or co-authored using generative AI tooling? No Closes #50315 from aldenlau-db/SPARK-49082. Authored-by: Alden Lau <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

github-actions bot added SQL AVRO labels Aug 2, 2024

wayneguow force-pushed the SPARK-49082 branch from 3bcc664 to fe3558d Compare August 2, 2024 11:41

cloud-fan reviewed Aug 5, 2024

View reviewed changes

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Aug 5, 2024

View reviewed changes

cloud-fan reviewed Aug 5, 2024

View reviewed changes

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala Outdated Show resolved Hide resolved

wayneguow commented Aug 5, 2024

View reviewed changes

wayneguow force-pushed the SPARK-49082 branch from c41e0d9 to 6812f9b Compare August 5, 2024 12:16

wayneguow requested a review from cloud-fan August 6, 2024 01:42

jackierwzhang reviewed Aug 6, 2024

View reviewed changes

cloud-fan approved these changes Aug 7, 2024

View reviewed changes

wayneguow added 3 commits August 7, 2024 11:47

update

0e8042f

update

418170e

update

a7b7793

wayneguow force-pushed the SPARK-49082 branch from 0a41aa0 to a7b7793 Compare August 7, 2024 03:48

cloud-fan closed this in 7769ef1 Aug 7, 2024

wayneguow deleted the SPARK-49082 branch February 11, 2025 04:25

aldenlau-db mentioned this pull request Mar 19, 2025

[SPARK-49082][SQL] Support widening Date to TimestampNTZ in Avro reader #50315

Closed

	private[this] def castToDouble(from: DataType): Any => Any = from match {
	case _: StringType =>
	buildCast[UTF8String](_, s => {
	val doubleStr = s.toString
	try doubleStr.toDouble catch {
	case _: NumberFormatException =>
	val d = Cast.processFloatingPointSpecialLiterals(doubleStr, false)
	if(ansiEnabled && d == null) {
	throw QueryExecutionErrors.invalidInputInCastToNumberError(
	DoubleType, s, getContextOrNull())
	} else {
	d
	}
	}
	})
	case BooleanType =>
	buildCast[Boolean](_, b => if (b) 1d else 0d)
	case DateType =>
	buildCast[Int](_, d => null)
	case TimestampType =>
	buildCast[Long](_, t => timestampToDouble(t))
	case x: NumericType =>
	val numeric = PhysicalNumericType.numeric(x)
	b => numeric.toDouble(b)

[SPARK-49082][SQL] Widening type promotions in AvroDeserializer #47582

[SPARK-49082][SQL] Widening type promotions in AvroDeserializer #47582

Uh oh!

Conversation

wayneguow commented Aug 2, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

wayneguow commented Aug 5, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wayneguow Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

wayneguow commented Aug 7, 2024

Uh oh!

cloud-fan commented Aug 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-49082][SQL] Widening type promotions in `AvroDeserializer` #47582

[SPARK-49082][SQL] Widening type promotions in `AvroDeserializer` #47582

wayneguow Aug 6, 2024 •

edited

Loading