-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-49082][SQL] Widening type promotions in AvroDeserializer
#47582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan and @LuciferYang |
connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
Outdated
Show resolved
Hide resolved
connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I made a change here.
When it comes to converting maximum or minimum values from float to double, if using the implicit toDouble directly, the result will not meet expectations. For example, the difference in the result is actually a very large number.
And I think this problem is also faced in the processing of parquet.
Line 330 in d431d40
| this.updater.setDouble(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the same time, I think the logic of converting float to double(for float, case x: NumericType is matched, and then toDouble is called directly) in our cast expression is also worth discussing the implementation, which is similar to this problem.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Lines 1008 to 1031 in 6d32472
| private[this] def castToDouble(from: DataType): Any => Any = from match { | |
| case _: StringType => | |
| buildCast[UTF8String](_, s => { | |
| val doubleStr = s.toString | |
| try doubleStr.toDouble catch { | |
| case _: NumberFormatException => | |
| val d = Cast.processFloatingPointSpecialLiterals(doubleStr, false) | |
| if(ansiEnabled && d == null) { | |
| throw QueryExecutionErrors.invalidInputInCastToNumberError( | |
| DoubleType, s, getContextOrNull()) | |
| } else { | |
| d | |
| } | |
| } | |
| }) | |
| case BooleanType => | |
| buildCast[Boolean](_, b => if (b) 1d else 0d) | |
| case DateType => | |
| buildCast[Int](_, d => null) | |
| case TimestampType => | |
| buildCast[Long](_, t => timestampToDouble(t)) | |
| case x: NumericType => | |
| val numeric = PhysicalNumericType.numeric(x) | |
| b => numeric.toDouble(b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the behavior of SQL CAST? We should be consistent with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use the same code instead of .toString.toDouble?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I agree to be consistent with Expression Cast to avoid unnecessary trouble. Update after soon.
(But actually if we want to ensure accurate calculations, we'd better avoid using toDouble directly.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will seperate another two PR.
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if CI passes
|
Rebase master and waiting for CI~ |
|
thanks, merged to master |
### What changes were proposed in this pull request? This change adds support for widening type promotions from `Date` to `TimestampNTZ` in `AvroDeserializer. This PR is a follow-up to #47582 which adds support for other widening type promotions. ### Why are the changes needed? When reading Avro files with a mix of Date and TimestampNTZ for a given column, the reader should be able to read all files and promote Date to TimestampNTZ instead of throwing an error when reading files with Date. Although [SPARK-49082](https://issues.apache.org/jira/browse/SPARK-49082) was resolved by #47582, that PR did not include Date -> TimestampNTZ widening. The change in this PR is very similar to #44368 which adds support for Date -> TimestampNTZ widening for the Parquet reader. ### Does this PR introduce _any_ user-facing change? Yes, users will no longer see an error when attempting to read a file containing Date when the read schema contains TimestampNTZ. The time will be set to 00:00, as has been done in #44368. ### How was this patch tested? New test in `AvroSuite`. ### Was this patch authored or co-authored using generative AI tooling? No Closes #50315 from aldenlau-db/SPARK-49082. Authored-by: Alden Lau <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>



What changes were proposed in this pull request?
This PR aims to widen type promotions in
AvroDeserializer. Supported as following(Avro Type -> Spark Type):Why are the changes needed?
Similar to PR #44368 for
Parquetreader, we'd better to enable type promotion/widening forAvrodeserializer.Does this PR introduce any user-facing change?
Yes, but more convenient for users.
How was this patch tested?
Pass GA and add a new test case.
Was this patch authored or co-authored using generative AI tooling?
No.