Skip to content

Conversation

@mihailotim-db
Copy link
Contributor

What changes were proposed in this pull request?

Reverting to the old way of handling DeduplicateRelations in order to improve performance. Instead of checking attribute IDs linearly, we use HashSet:contains()

Why are the changes needed?

Improving perfomance

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Sep 10, 2024
@mihailotim-db mihailotim-db changed the title Improve DeduplicateRelations performance [SPARK-43838][SQL][FOLLOWUP] Improve DeduplicateRelations performance Sep 10, 2024
@cloud-fan
Copy link
Contributor

We need to re-generate the golden files

[info] *** 4 TESTS FAILED ***
[error] Failed: Total 3559, Failed 4, Errors 0, Passed 3555, Ignored 4
[error] Failed tests:
[error] 	org.apache.spark.sql.TPCDSV2_7_PlanStabilityWithStatsSuite
[error] 	org.apache.spark.sql.TPCDSV2_7_PlanStabilitySuite

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in d72e8f9 Sep 11, 2024
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?
Reverting to the old way of handling DeduplicateRelations in order to improve performance.  Instead of checking attribute IDs linearly, we use `HashSet:contains()`

### Why are the changes needed?
Improving perfomance

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#48053 from mihailotim-db/deduplicate_relations_perf_improvement.

Authored-by: Mihailo Timotic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Oct 9, 2024
…ove performance of `DeduplicateRelations`

### What changes were proposed in this pull request?
This PR replaces `HashSet` that is currently used with a `HashMap` to improve `DeduplicateRelations` performance.
Additionally, this PR reverts #48053 as that change is no longer needed

### Why are the changes needed?
Current implementation doesn't utilize `HashSet` properly, but instead performs multiple linear searches on the set creating a O(n^2) complexity

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Existing tests

### Was this patch authored or co-authored using generative AI tooling?

Closes #48392 from mihailotim-db/mihailotim-db/master.

Authored-by: Mihailo Timotic <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
### What changes were proposed in this pull request?
Reverting to the old way of handling DeduplicateRelations in order to improve performance.  Instead of checking attribute IDs linearly, we use `HashSet:contains()`

### Why are the changes needed?
Improving perfomance

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#48053 from mihailotim-db/deduplicate_relations_perf_improvement.

Authored-by: Mihailo Timotic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…ove performance of `DeduplicateRelations`

### What changes were proposed in this pull request?
This PR replaces `HashSet` that is currently used with a `HashMap` to improve `DeduplicateRelations` performance.
Additionally, this PR reverts apache#48053 as that change is no longer needed

### Why are the changes needed?
Current implementation doesn't utilize `HashSet` properly, but instead performs multiple linear searches on the set creating a O(n^2) complexity

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Existing tests

### Was this patch authored or co-authored using generative AI tooling?

Closes apache#48392 from mihailotim-db/mihailotim-db/master.

Authored-by: Mihailo Timotic <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants