-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16135][SQL] Remove hashCode and euqals in ArrayBasedMapData #13847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
LGTM |
|
@maropu I think you also need to add these methods to Where in the spark code base do we compare two (unsafe) |
|
You could also just also use the approach taken in |
|
It seems |
|
aha, yes. It'd better to take the same approach in |
|
Yeah you are right about I would take the same approach as |
|
okay, I'm fixing now. |
|
okay, done. |
|
Test build #61041 has started for PR 13847 at commit |
|
Test build #61038 has finished for PR 13847 at commit
|
|
Do we need to hash all values? This could be a performance issue if Story: MLlib had some performance issues caused by |
|
Does the current implementation of |
|
The performance of |
|
At least, we'd be better to leave comments for that. |
| // This `hashCode` computation could consume much processor time for large data. | ||
| // If the computation becomes a bottleneck, we can use a light-weight logic; the first fixed bytes | ||
| // are used to compute `hashCode` (See `Vector.hashCode`). | ||
| // The same issue exists in `UnsafeMapData.hashCode`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same issue also exists for UnsafeRow...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I'll add now.
|
Test build #61052 has finished for PR 13847 at commit
|
|
Test build #61053 has finished for PR 13847 at commit
|
|
Test build #3126 has finished for PR 13847 at commit
|
|
I think we don't need to implement We should remove the |
|
Thx, good direction. The current master doesn't throw any exception in an analyzer when map-typed data are passed into |
|
yea we should improve the type check of |
|
Test build #61093 has finished for PR 13847 at commit
|
|
I'm now checking failed tests... |
|
Test build #61168 has finished for PR 13847 at commit
|
| new GenericArrayData(Seq.fill(length)(true)))) | ||
|
|
||
| if (!checkResult(actual, expected)) { | ||
| if (!actual.zip(expected).forall { case (data, answer) => checkResult(data, answer)}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we turn map data in actual to scala map and compare it with expected? (also use scala map in expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I'll try to replace it.
|
looks pretty good, thanks for working on it! |
|
Test build #61173 has finished for PR 13847 at commit
|
|
Test build #61174 has finished for PR 13847 at commit
|
|
Test build #61175 has finished for PR 13847 at commit
|
|
Test build #61178 has finished for PR 13847 at commit
|
| } | ||
| } | ||
|
|
||
| // `MapData` should not implement `equals` and `hashCode` because the type cannot be used as join |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we put this in the class header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
|
LGTM - @cloud-fan? |
|
Test build #61208 has finished for PR 13847 at commit
|
| import org.apache.spark.sql.types.DataType | ||
|
|
||
| /** | ||
| * `MapData` should not implement `equals` and `hashCode` because the type cannot be used as join |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not your fault, but I think we need to add some comment for MapData itself, and then follows this comment as a note. We can simply say: An internal data representation for map type in Spark SQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, how about this?
|
Test build #61237 has finished for PR 13847 at commit
|
|
retest this please |
|
LGTM, retest this as the last test pass is 2 days ago. |
|
Test build #61298 has finished for PR 13847 at commit
|
## What changes were proposed in this pull request? This pr is to remove `hashCode` and `equals` in `ArrayBasedMapData` because the type cannot be used as join keys, grouping keys, or in equality tests. ## How was this patch tested? Add a new test suite `MapDataSuite` for comparison tests. Author: Takeshi YAMAMURO <[email protected]> Closes #13847 from maropu/UnsafeMapTest. (cherry picked from commit 3e4e868) Signed-off-by: Wenchen Fan <[email protected]>
|
thanks, merging to master/2.0! |
…ion in NormalizePlan ### What changes were proposed in this pull request? Substitute `LocalRelation` with `ComparableLocalRelation` in `NormalizePlan`. `ComparableLocalRelation` has `Seq[Seq[Expression]]` instead of `Seq[InternalRow]`. The conversion happens through `Literal`s. ### Why are the changes needed? `LocalRelation`'s data field is incomparable if it contains maps, because `ArrayBasedMapData` doesn't define `equals`: #13847 ### Does this PR introduce _any_ user-facing change? No. This is to compare logical plans in the single-pass Analyzer. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? copilot.nvim. Closes #49287 from vladimirg-db/vladimirg-db/normalize-local-relation. Authored-by: Vladimir Golubev <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…izer Rule ### What changes were proposed in this pull request? In the current version of Spark, its possible to use `MapType` as column for repartitioning. But `MapData` does not implement `equals` and `hashCode` (in according to [SPARK-9415](https://issues.apache.org/jira/browse/SPARK-9415) and [[SPARK-16135][SQL] Remove hashCode and equals in ArrayBasedMapData](#13847)). Considering that, hash value for same Maps can be different. In an attempt to run `xxhash64` or `hash` function on `MapType`, ```org.apache.spark.sql.catalyst.ExtendedAnalysisException: [DATATYPE_MISMATCH.HASH_MAP_TYPE] Cannot resolve "xxhash64(value)" due to data type mismatch: Input to the function `xxhash64` cannot contain elements of the "MAP" type. In Spark, same maps may have different hashcode, thus hash expressions are prohibited on "MAP" elements. To restore previous behavior set "spark.sql.legacy.allowHashOnMapType" to "true".;``` will be thrown. Also, when trying to run `ds.distinct(col("value"))`, where `value` has `MapType`, the following exception is thrown: ```org.apache.spark.sql.catalyst.ExtendedAnalysisException: [UNSUPPORTED_FEATURE.SET_OPERATION_ON_MAP_TYPE] The feature is not supported: Cannot have MAP type columns in DataFrame which calls set operations (INTERSECT, EXCEPT, etc.), but the type of column `value` is "MAP<INT, STRING>".;``` With the above consideration, a new `InsertMapSortInRepartitionExpressions` `Rule[LogicalPlan]` was implemented to insert `mapsort` for every `MapType` in `RepartitionByExpression.partitionExpressions`. ### Why are the changes needed? To keep `repartition` API for MapType consistent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #49144 from ostronaut/features/map_repartition. Authored-by: Dima <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This pr is to remove
hashCodeandequalsinArrayBasedMapDatabecause the type cannot be used as join keys, grouping keys, or in equality tests.How was this patch tested?
Add a new test suite
MapDataSuitefor comparison tests.