[SPARK-16135][SQL] Remove hashCode and euqals in ArrayBasedMapData #13847

maropu · 2016-06-22T14:51:23Z

What changes were proposed in this pull request?

This pr is to remove hashCode and equals in ArrayBasedMapData because the type cannot be used as join keys, grouping keys, or in equality tests.

How was this patch tested?

Add a new test suite MapDataSuite for comparison tests.

srowen · 2016-06-22T14:54:31Z

LGTM

hvanhovell · 2016-06-22T14:58:27Z

@maropu I think you also need to add these methods to UnsafeArray for this to work.

Where in the spark code base do we compare two (unsafe) MapData objects? Or are you comparing these in your own code?

hvanhovell · 2016-06-22T15:05:16Z

You could also just also use the approach taken in UnsafeRow.

maropu · 2016-06-22T15:06:08Z

It seems UnsafeArrayData already has its own equals and hashCode`.
Currently, spark doesn't compare unsafe MapData though, I think this might cause implicit bugs in future codes.

maropu · 2016-06-22T15:08:00Z

aha, yes. It'd better to take the same approach in UnsafeRow?

hvanhovell · 2016-06-22T15:14:37Z

Yeah you are right about UnsafeArrayData (my bad).

I would take the same approach as UnsafeRow.

maropu · 2016-06-22T15:15:38Z

okay, I'm fixing now.

maropu · 2016-06-22T15:34:38Z

okay, done.

SparkQA · 2016-06-22T15:56:30Z

Test build #61041 has started for PR 13847 at commit 30d28bc.

SparkQA · 2016-06-22T16:51:18Z

Test build #61038 has finished for PR 13847 at commit 50f8be3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-06-22T17:03:33Z

Do we need to hash all values? This could be a performance issue if hashCode is called frequently on very large arrays.

Story: MLlib had some performance issues caused by Vector.hashCode, which is called during Pyrolite serialization. It saves the Vector as the key in a hash map to avoid re-serialization of the same object. But the hashCode costs almost the same as re-serialization.

maropu · 2016-06-22T17:09:21Z

Does the current implementation of Vector.hashCode have enough performance? If so, it's okay to follow the impl. to me.

hvanhovell · 2016-06-22T17:14:12Z

The performance of hashCode() should be pretty good in this case, and this implementation is in line with the ones used in all other Unsafe* objects (MurMurHash). I'd rather be consistent. If this turns out to be a problem, we could always use the first n bytes (similar to the Vector.hashCode) for hashCode(), if it turns out to be a problem.

maropu · 2016-06-22T17:17:50Z

At least, we'd be better to leave comments for that.

hvanhovell · 2016-06-22T17:44:49Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java

+  // This `hashCode` computation could consume much processor time for large data.
+  // If the computation becomes a bottleneck, we can use a light-weight logic; the first fixed bytes
+  // are used to compute `hashCode` (See `Vector.hashCode`).
+  // The same issue exists in `UnsafeMapData.hashCode`.


The same issue also exists for UnsafeRow...

okay, I'll add now.

SparkQA · 2016-06-22T17:52:02Z

Test build #61052 has finished for PR 13847 at commit e8eaf70.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-22T18:05:36Z

Test build #61053 has finished for PR 13847 at commit b6bac43.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-22T21:58:00Z

Test build #3126 has finished for PR 13847 at commit b6bac43.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-06-23T01:03:47Z

I think we don't need to implement equals and hashCode for map type, as map type doesn't support equality and ordering by design, see https://issues.apache.org/jira/browse/SPARK-9415

We should remove the equals and hashCode on ArrayBasedMapData and fix tests if needed

maropu · 2016-06-23T02:25:16Z

Thx, good direction. The current master doesn't throw any exception in an analyzer when map-typed data are passed into collect_set/collect_list. Probably, should we check the case in there?
https://github.com/apache/spark/pull/13802/files#diff-4b06b5fe0cedf425de14469d1356d6ecR474

cloud-fan · 2016-06-23T03:04:33Z

yea we should improve the type check of CollectSet

SparkQA · 2016-06-23T05:15:51Z

Test build #61093 has finished for PR 13847 at commit 827785d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2016-06-23T08:37:17Z

I'm now checking failed tests...

SparkQA · 2016-06-24T10:48:58Z

Test build #61168 has finished for PR 13847 at commit 431a3fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-06-24T11:24:26Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala

      new GenericArrayData(Seq.fill(length)(true))))

-    if (!checkResult(actual, expected)) {
+    if (!actual.zip(expected).forall { case (data, answer) => checkResult(data, answer)}) {


can we turn map data in actual to scala map and compare it with expected? (also use scala map in expected)

okay, I'll try to replace it.

cloud-fan · 2016-06-24T11:25:03Z

looks pretty good, thanks for working on it!

SparkQA · 2016-06-24T11:28:59Z

Test build #61173 has finished for PR 13847 at commit 78b57c6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-24T12:47:15Z

Test build #61174 has finished for PR 13847 at commit 2feb984.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-24T13:46:53Z

Test build #61175 has finished for PR 13847 at commit 8ada73c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-24T16:18:18Z

Test build #61178 has finished for PR 13847 at commit d95824b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-06-24T17:18:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MapData.scala

    }
  }
+
+  // `MapData` should not implement `equals` and `hashCode` because the type cannot be used as join


Shall we put this in the class header?

hvanhovell · 2016-06-24T17:30:08Z

LGTM - @cloud-fan?

SparkQA · 2016-06-25T00:40:15Z

Test build #61208 has finished for PR 13847 at commit e4b1384.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-06-25T11:15:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/MapData.scala

 import org.apache.spark.sql.types.DataType

+/**
+ * `MapData` should not implement `equals` and `hashCode` because the type cannot be used as join


it's not your fault, but I think we need to add some comment for MapData itself, and then follows this comment as a note. We can simply say: An internal data representation for map type in Spark SQL.

okay, how about this?

SparkQA · 2016-06-25T17:31:16Z

Test build #61237 has finished for PR 13847 at commit 902fe5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-06-27T09:23:42Z

retest this please

cloud-fan · 2016-06-27T09:24:20Z

LGTM, retest this as the last test pass is 2 days ago.

SparkQA · 2016-06-27T12:56:17Z

Test build #61298 has finished for PR 13847 at commit 902fe5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? This pr is to remove `hashCode` and `equals` in `ArrayBasedMapData` because the type cannot be used as join keys, grouping keys, or in equality tests. ## How was this patch tested? Add a new test suite `MapDataSuite` for comparison tests. Author: Takeshi YAMAMURO <[email protected]> Closes #13847 from maropu/UnsafeMapTest. (cherry picked from commit 3e4e868) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2016-06-27T13:49:38Z

thanks, merging to master/2.0!

…ion in NormalizePlan ### What changes were proposed in this pull request? Substitute `LocalRelation` with `ComparableLocalRelation` in `NormalizePlan`. `ComparableLocalRelation` has `Seq[Seq[Expression]]` instead of `Seq[InternalRow]`. The conversion happens through `Literal`s. ### Why are the changes needed? `LocalRelation`'s data field is incomparable if it contains maps, because `ArrayBasedMapData` doesn't define `equals`: #13847 ### Does this PR introduce _any_ user-facing change? No. This is to compare logical plans in the single-pass Analyzer. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? copilot.nvim. Closes #49287 from vladimirg-db/vladimirg-db/normalize-local-relation. Authored-by: Vladimir Golubev <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…izer Rule ### What changes were proposed in this pull request? In the current version of Spark, its possible to use `MapType` as column for repartitioning. But `MapData` does not implement `equals` and `hashCode` (in according to [SPARK-9415](https://issues.apache.org/jira/browse/SPARK-9415) and [[SPARK-16135][SQL] Remove hashCode and equals in ArrayBasedMapData](#13847)). Considering that, hash value for same Maps can be different. In an attempt to run `xxhash64` or `hash` function on `MapType`, ```org.apache.spark.sql.catalyst.ExtendedAnalysisException: [DATATYPE_MISMATCH.HASH_MAP_TYPE] Cannot resolve "xxhash64(value)" due to data type mismatch: Input to the function `xxhash64` cannot contain elements of the "MAP" type. In Spark, same maps may have different hashcode, thus hash expressions are prohibited on "MAP" elements. To restore previous behavior set "spark.sql.legacy.allowHashOnMapType" to "true".;``` will be thrown. Also, when trying to run `ds.distinct(col("value"))`, where `value` has `MapType`, the following exception is thrown: ```org.apache.spark.sql.catalyst.ExtendedAnalysisException: [UNSUPPORTED_FEATURE.SET_OPERATION_ON_MAP_TYPE] The feature is not supported: Cannot have MAP type columns in DataFrame which calls set operations (INTERSECT, EXCEPT, etc.), but the type of column `value` is "MAP<INT, STRING>".;``` With the above consideration, a new `InsertMapSortInRepartitionExpressions` `Rule[LogicalPlan]` was implemented to insert `mapsort` for every `MapType` in `RepartitionByExpression.partitionExpressions`. ### Why are the changes needed? To keep `repartition` API for MapType consistent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #49144 from ostronaut/features/map_repartition. Authored-by: Dima <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Add tests for UnsafeMapData

50f8be3

maropu mentioned this pull request Jun 22, 2016

[SPARK-16094][SQL] Support HashAggregateExec for non-partial aggregates #13802

Closed

Apply comments

30d28bc

Add comments

e8eaf70

hvanhovell reviewed Jun 22, 2016
View reviewed changes

Fix comments

b6bac43

Apply comments

827785d

cloud-fan reviewed Jun 24, 2016
View reviewed changes

Apply comments

d95824b

maropu force-pushed the UnsafeMapTest branch from 8ada73c to d95824b Compare June 24, 2016 14:34

hvanhovell reviewed Jun 24, 2016
View reviewed changes

Move comments

e4b1384

cloud-fan reviewed Jun 25, 2016
View reviewed changes

Update comments

902fe5f

asfgit closed this in 3e4e868 Jun 27, 2016

maropu deleted the UnsafeMapTest branch July 5, 2017 11:47

maropu mentioned this pull request May 17, 2021

[SPARK-34819][SQL] MapType supports comparable semantics #32552

Closed

This was referenced Sep 6, 2022

[SPARK-40315][SQL] Add equals() and hashCode() to ArrayBasedMapData #37771

Closed

[SPARK-40315][SQL] Add hashCode() for Literal of ArrayBasedMapData #37807

Closed

ostronaut mentioned this pull request Dec 11, 2024

[SPARK-50525][SQL] Define InsertMapSortInRepartitionExpressions Optimizer Rule #49144

Closed

vladimirg-db mentioned this pull request Dec 24, 2024

[SPARK-50665][SQL] Substitute LocalRelation with ComparableLocalRelation in NormalizePlan #49287

Closed

[SPARK-16135][SQL] Remove hashCode and euqals in ArrayBasedMapData #13847

[SPARK-16135][SQL] Remove hashCode and euqals in ArrayBasedMapData #13847

Uh oh!

Conversation

maropu commented Jun 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen commented Jun 22, 2016

Uh oh!

hvanhovell commented Jun 22, 2016

Uh oh!

hvanhovell commented Jun 22, 2016

Uh oh!

maropu commented Jun 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maropu commented Jun 22, 2016

Uh oh!

hvanhovell commented Jun 22, 2016

Uh oh!

maropu commented Jun 22, 2016

Uh oh!

maropu commented Jun 22, 2016

Uh oh!

SparkQA commented Jun 22, 2016

Uh oh!

SparkQA commented Jun 22, 2016

Uh oh!

mengxr commented Jun 22, 2016

Uh oh!

maropu commented Jun 22, 2016

Uh oh!

hvanhovell commented Jun 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maropu commented Jun 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 22, 2016

Uh oh!

SparkQA commented Jun 22, 2016

Uh oh!

SparkQA commented Jun 22, 2016

Uh oh!

cloud-fan commented Jun 23, 2016

Uh oh!

maropu commented Jun 23, 2016

Uh oh!

cloud-fan commented Jun 23, 2016

Uh oh!

SparkQA commented Jun 23, 2016

Uh oh!

maropu commented Jun 23, 2016

Uh oh!

SparkQA commented Jun 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 24, 2016

Uh oh!

SparkQA commented Jun 24, 2016

Uh oh!

SparkQA commented Jun 24, 2016

Uh oh!

SparkQA commented Jun 24, 2016

Uh oh!

SparkQA commented Jun 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Jun 24, 2016

maropu commented Jun 22, 2016 •

edited

Loading

maropu commented Jun 22, 2016 •

edited

Loading

hvanhovell commented Jun 22, 2016 •

edited

Loading