Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Jun 22, 2016

What changes were proposed in this pull request?

This pr is to remove hashCode and equals in ArrayBasedMapData because the type cannot be used as join keys, grouping keys, or in equality tests.

How was this patch tested?

Add a new test suite MapDataSuite for comparison tests.

@srowen
Copy link
Member

srowen commented Jun 22, 2016

LGTM

@hvanhovell
Copy link
Contributor

@maropu I think you also need to add these methods to UnsafeArray for this to work.

Where in the spark code base do we compare two (unsafe) MapData objects? Or are you comparing these in your own code?

@hvanhovell
Copy link
Contributor

You could also just also use the approach taken in UnsafeRow.

@maropu
Copy link
Member Author

maropu commented Jun 22, 2016

It seems UnsafeArrayData already has its own equals and hashCode`.
Currently, spark doesn't compare unsafe MapData though, I think this might cause implicit bugs in future codes.

@maropu
Copy link
Member Author

maropu commented Jun 22, 2016

aha, yes. It'd better to take the same approach in UnsafeRow?

@hvanhovell
Copy link
Contributor

Yeah you are right about UnsafeArrayData (my bad).

I would take the same approach as UnsafeRow.

@maropu
Copy link
Member Author

maropu commented Jun 22, 2016

okay, I'm fixing now.

@maropu
Copy link
Member Author

maropu commented Jun 22, 2016

okay, done.

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #61041 has started for PR 13847 at commit 30d28bc.

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #61038 has finished for PR 13847 at commit 50f8be3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Jun 22, 2016

Do we need to hash all values? This could be a performance issue if hashCode is called frequently on very large arrays.

Story: MLlib had some performance issues caused by Vector.hashCode, which is called during Pyrolite serialization. It saves the Vector as the key in a hash map to avoid re-serialization of the same object. But the hashCode costs almost the same as re-serialization.

@maropu
Copy link
Member Author

maropu commented Jun 22, 2016

Does the current implementation of Vector.hashCode have enough performance? If so, it's okay to follow the impl. to me.

@hvanhovell
Copy link
Contributor

hvanhovell commented Jun 22, 2016

The performance of hashCode() should be pretty good in this case, and this implementation is in line with the ones used in all other Unsafe* objects (MurMurHash). I'd rather be consistent. If this turns out to be a problem, we could always use the first n bytes (similar to the Vector.hashCode) for hashCode(), if it turns out to be a problem.

@maropu
Copy link
Member Author

maropu commented Jun 22, 2016

At least, we'd be better to leave comments for that.

// This `hashCode` computation could consume much processor time for large data.
// If the computation becomes a bottleneck, we can use a light-weight logic; the first fixed bytes
// are used to compute `hashCode` (See `Vector.hashCode`).
// The same issue exists in `UnsafeMapData.hashCode`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same issue also exists for UnsafeRow...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I'll add now.

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #61052 has finished for PR 13847 at commit e8eaf70.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #61053 has finished for PR 13847 at commit b6bac43.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #3126 has finished for PR 13847 at commit b6bac43.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

I think we don't need to implement equals and hashCode for map type, as map type doesn't support equality and ordering by design, see https://issues.apache.org/jira/browse/SPARK-9415

We should remove the equals and hashCode on ArrayBasedMapData and fix tests if needed

@maropu
Copy link
Member Author

maropu commented Jun 23, 2016

Thx, good direction. The current master doesn't throw any exception in an analyzer when map-typed data are passed into collect_set/collect_list. Probably, should we check the case in there?
https://github.com/apache/spark/pull/13802/files#diff-4b06b5fe0cedf425de14469d1356d6ecR474

@cloud-fan
Copy link
Contributor

yea we should improve the type check of CollectSet

@SparkQA
Copy link

SparkQA commented Jun 23, 2016

Test build #61093 has finished for PR 13847 at commit 827785d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jun 23, 2016

I'm now checking failed tests...

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61168 has finished for PR 13847 at commit 431a3fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

new GenericArrayData(Seq.fill(length)(true))))

if (!checkResult(actual, expected)) {
if (!actual.zip(expected).forall { case (data, answer) => checkResult(data, answer)}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we turn map data in actual to scala map and compare it with expected? (also use scala map in expected)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I'll try to replace it.

@cloud-fan
Copy link
Contributor

looks pretty good, thanks for working on it!

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61173 has finished for PR 13847 at commit 78b57c6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61174 has finished for PR 13847 at commit 2feb984.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61175 has finished for PR 13847 at commit 8ada73c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 24, 2016

Test build #61178 has finished for PR 13847 at commit d95824b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

// `MapData` should not implement `equals` and `hashCode` because the type cannot be used as join
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we put this in the class header?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

@hvanhovell
Copy link
Contributor

LGTM - @cloud-fan?

@SparkQA
Copy link

SparkQA commented Jun 25, 2016

Test build #61208 has finished for PR 13847 at commit e4b1384.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import org.apache.spark.sql.types.DataType

/**
* `MapData` should not implement `equals` and `hashCode` because the type cannot be used as join
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not your fault, but I think we need to add some comment for MapData itself, and then follows this comment as a note. We can simply say: An internal data representation for map type in Spark SQL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, how about this?

@SparkQA
Copy link

SparkQA commented Jun 25, 2016

Test build #61237 has finished for PR 13847 at commit 902fe5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@cloud-fan
Copy link
Contributor

LGTM, retest this as the last test pass is 2 days ago.

@SparkQA
Copy link

SparkQA commented Jun 27, 2016

Test build #61298 has finished for PR 13847 at commit 902fe5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jun 27, 2016
## What changes were proposed in this pull request?
This pr is to remove `hashCode` and `equals` in `ArrayBasedMapData` because the type cannot be used as join keys, grouping keys, or in equality tests.

## How was this patch tested?
Add a new test suite `MapDataSuite` for comparison tests.

Author: Takeshi YAMAMURO <[email protected]>

Closes #13847 from maropu/UnsafeMapTest.

(cherry picked from commit 3e4e868)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in 3e4e868 Jun 27, 2016
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.0!

@maropu maropu deleted the UnsafeMapTest branch July 5, 2017 11:47
cloud-fan pushed a commit that referenced this pull request Dec 26, 2024
…ion in NormalizePlan

### What changes were proposed in this pull request?

Substitute `LocalRelation` with `ComparableLocalRelation` in `NormalizePlan`.

`ComparableLocalRelation` has `Seq[Seq[Expression]]` instead of `Seq[InternalRow]`. The conversion happens through `Literal`s.

### Why are the changes needed?

`LocalRelation`'s data field is incomparable if it contains maps, because `ArrayBasedMapData` doesn't define `equals`: #13847

### Does this PR introduce _any_ user-facing change?

No. This is to compare logical plans in the single-pass Analyzer.

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

copilot.nvim.

Closes #49287 from vladimirg-db/vladimirg-db/normalize-local-relation.

Authored-by: Vladimir Golubev <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Jan 10, 2025
…izer Rule

### What changes were proposed in this pull request?
In the current version of Spark, its possible to use `MapType` as column for repartitioning. But `MapData` does not implement `equals` and `hashCode` (in according to [SPARK-9415](https://issues.apache.org/jira/browse/SPARK-9415) and [[SPARK-16135][SQL] Remove hashCode and equals in ArrayBasedMapData](#13847)). Considering that, hash value for same Maps can be different.

In an attempt to run `xxhash64` or `hash` function on `MapType`, ```org.apache.spark.sql.catalyst.ExtendedAnalysisException: [DATATYPE_MISMATCH.HASH_MAP_TYPE] Cannot resolve "xxhash64(value)" due to data type mismatch: Input to the function `xxhash64` cannot contain elements of the "MAP" type. In Spark, same maps may have different hashcode, thus hash expressions are prohibited on "MAP" elements. To restore previous behavior set "spark.sql.legacy.allowHashOnMapType" to "true".;``` will be thrown.

Also, when trying to run `ds.distinct(col("value"))`, where `value` has `MapType`, the following exception is thrown: ```org.apache.spark.sql.catalyst.ExtendedAnalysisException: [UNSUPPORTED_FEATURE.SET_OPERATION_ON_MAP_TYPE] The feature is not supported: Cannot have MAP type columns in DataFrame which calls set operations (INTERSECT, EXCEPT, etc.), but the type of column `value` is "MAP<INT, STRING>".;```

With the above consideration, a new `InsertMapSortInRepartitionExpressions` `Rule[LogicalPlan]` was implemented to insert `mapsort` for every `MapType` in `RepartitionByExpression.partitionExpressions`.

### Why are the changes needed?

To keep `repartition` API for MapType consistent.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #49144 from ostronaut/features/map_repartition.

Authored-by: Dima <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants