Skip to content

Conversation

@wayneguow
Copy link
Contributor

What changes were proposed in this pull request?

This PR aims to widen type promotions in AvroDeserializer. Supported as following(Avro Type -> Spark Type):

  • Int -> Long ;
  • Int -> Double ;
  • Float -> Double;

Why are the changes needed?

Similar to PR #44368 for Parquet reader, we'd better to enable type promotion/widening for Avro deserializer.

Does this PR introduce any user-facing change?

Yes, but more convenient for users.

How was this patch tested?

Pass GA and add a new test case.

Was this patch authored or co-authored using generative AI tooling?

No.

@wayneguow
Copy link
Contributor Author

cc @cloud-fan and @LuciferYang

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I made a change here.

When it comes to converting maximum or minimum values ​​from float to double, if using the implicit toDouble directly, the result will not meet expectations. For example, the difference in the result is actually a very large number.

image

And I think this problem is also faced in the processing of parquet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the same time, I think the logic of converting float to double(for float, case x: NumericType is matched, and then toDouble is called directly) in our cast expression is also worth discussing the implementation, which is similar to this problem.

private[this] def castToDouble(from: DataType): Any => Any = from match {
case _: StringType =>
buildCast[UTF8String](_, s => {
val doubleStr = s.toString
try doubleStr.toDouble catch {
case _: NumberFormatException =>
val d = Cast.processFloatingPointSpecialLiterals(doubleStr, false)
if(ansiEnabled && d == null) {
throw QueryExecutionErrors.invalidInputInCastToNumberError(
DoubleType, s, getContextOrNull())
} else {
d
}
}
})
case BooleanType =>
buildCast[Boolean](_, b => if (b) 1d else 0d)
case DateType =>
buildCast[Int](_, d => null)
case TimestampType =>
buildCast[Long](_, t => timestampToDouble(t))
case x: NumericType =>
val numeric = PhysicalNumericType.numeric(x)
b => numeric.toDouble(b)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the behavior of SQL CAST? We should be consistent with that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It calls toDouble directly, just like in the code above numeric.toDouble(b).

image

Copy link
Contributor Author

@wayneguow wayneguow Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the actual difference will be very large.

Seq(Float.MinValue, Float.MinPositiveValue, Float.MaxValue)

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use the same code instead of .toString.toDouble?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I agree to be consistent with Expression Cast to avoid unnecessary trouble. Update after soon.

(But actually if we want to ensure accurate calculations, we'd better avoid using toDouble directly.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this (I posted the original ticket :D)!

Is there a plan to convert Date -> TimestampNTZ (Parquet PR) and Int -> Decimal (Parquet PR)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will seperate another two PR.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if CI passes

@wayneguow
Copy link
Contributor Author

Rebase master and waiting for CI~

@cloud-fan cloud-fan closed this in 7769ef1 Aug 7, 2024
@cloud-fan
Copy link
Contributor

thanks, merged to master

@wayneguow wayneguow deleted the SPARK-49082 branch February 11, 2025 04:25
HyukjinKwon pushed a commit that referenced this pull request Mar 24, 2025
### What changes were proposed in this pull request?

This change adds support for widening type promotions from `Date` to `TimestampNTZ` in `AvroDeserializer. This PR is a follow-up to #47582 which adds support for other widening type promotions.

### Why are the changes needed?

When reading Avro files with a mix of Date and TimestampNTZ for a given column, the reader should be able to read all files and promote Date to TimestampNTZ instead of throwing an error when reading files with Date.

Although [SPARK-49082](https://issues.apache.org/jira/browse/SPARK-49082) was resolved by #47582, that PR did not include Date -> TimestampNTZ widening. The change in this PR is very similar to #44368 which adds support for Date -> TimestampNTZ widening for the Parquet reader.

### Does this PR introduce _any_ user-facing change?

Yes, users will no longer see an error when attempting to read a file containing Date when the read schema contains TimestampNTZ. The time will be set to 00:00, as has been done in #44368.

### How was this patch tested?

New test in `AvroSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #50315 from aldenlau-db/SPARK-49082.

Authored-by: Alden Lau <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants