[SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend #35460

WangGuangxin · 2022-02-09T10:04:04Z

What changes were proposed in this pull request?

When we do shuffle on indeterminate expressions such as rand, and ShuffleFetchFailed happend, we may get incorrent result since it only retries failed map tasks.

To illustrate this, suppose we have a dataset with two columns (range(1, 5) as a, rand() as b), we shuffle by b using two map tasks and two reduce tasks.

When there is a fetch failed and we need to rerun map task 2, the generated partitions maybe different compared with last attempt, and finally we get a duplicate record with a = 4

This is very similary to the bug in Repartition+Shuffle, which is fixed by #22112.
This PR try to fix this by reuse current machenism.

Why are the changes needed?

Fix data inconsistent issue.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

WangGuangxin · 2022-02-10T03:10:22Z

@cloud-fan @maropu Can you please help review this?

AmplabJenkins · 2022-02-10T06:48:28Z

Can one of the admins verify this patch?

WangGuangxin · 2022-02-11T05:42:37Z

also cc @srowen @viirya @yaooqinn I found there is a similar report before https://issues.apache.org/jira/browse/SPARK-24607

srowen · 2022-02-15T15:03:30Z

Isn't this why you shouldn't partition, shuffle, etc on some random value? use a hash?

WangGuangxin · 2022-02-15T15:25:32Z

Isn't this why you shouldn't partition, shuffle, etc on some random value? use a hash?

The data analyst may always have various needs such as distributed by rand() to redistribute data evenly
or select * from (select concat(key1, rand()) as key1 from tbl1) a join (select key2 from tbl2) b on a.key1 = b.key2 to work around skew data, which is a valid SQL in Spark.

Both of these sqls will generate a HashPartitioning with non-deterministic expressions.

If we don't support shuffle by random value, we should disable this.

srowen · 2022-02-15T15:29:08Z

Right, shouldn't we reject it? distributing by "hash(ID)" or similar makes more sense, not least of which because it is reproducible and deterministic across runs and environments

WangGuangxin · 2022-02-16T05:07:08Z

Right, shouldn't we reject it? distributing by "hash(ID)" or similar makes more sense, not least of which because it is reproducible and deterministic across runs and environments

Reject it maybe not a good idea.

Both Hive/Presto support patterns like distributed by rand or Join/GroupBy by rand. And seems Spark is intent to support groupby by rand, refer [SPARK-18969][SQL] Support grouping by nondeterministic expressions #16404. And also some udfs or java_methods is marked as indeterminated, if we reject it, it means users cannot join/groupby a column generated by udf/java_methods.
The root cause of data inconsist when shuffle by rand expression is Spark only retry partially map tasks when shuffle fetch failed. If we retry the whole stage, there is no problem. We can utilize current logic in DAGScheduler to achieve this.

mridulm · 2022-02-16T08:09:55Z

If there is a fetch failure and the parent stage is INDETERMINATE, both the parent and child stage are recomputed.
Custom RDD's can extend getOutputDeterministicLevel and return the right DeterministicLevel.
See #22112 for more details

WangGuangxin · 2022-02-16T12:06:10Z

If there is a fetch failure and the parent stage is INDETERMINATE, both the parent and child stage are recomputed. Custom RDD's can extend getOutputDeterministicLevel and return the right DeterministicLevel. See #22112 for more details

Thanks for your reference. That's what I do in this MR for SparkSQL.
When we shuffle by rand with sqls like distributed by rand, the RDD's DeterministicLevel generated by SparkSQL is INDETERMINATE after this MR

mridulm · 2022-02-16T18:29:25Z

I was not referring to the SQL changes per se @WangGuangxin, will let @srowen or @cloud-fan, etc review that.
Specifically for changes in core, there is already a means to provide deterministic level - we dont need isPartitionKeyIndeterminate, related changes.

cloud-fan · 2022-02-22T07:14:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala

+  /**
+   * Checks if the shuffle partitioning contains indeterminate expression/reference.
+   */
+  private def isPartitioningIndeterminate(partitioning: Partitioning, plan: SparkPlan): Boolean = {


I think we need to build a framework to properly propagate the column-level nondeterministic information. This function looks quite fragile

For example, Filter(rand_cond, Project(a, b, c, ...)). I think all the columns are nondeterministic after Filter, even though attributes a, b and c are deterministic.

You mean the QueryPlan's deterministic? #34470

that's plan level, not column level. We need something more fine-grained

ok, I'll first try to add a column level nondeterministic information before this pr

github-actions · 2022-06-05T00:22:58Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Retry all map tasks if partition key is indeterminate

65b831a

github-actions bot added CORE SQL labels Feb 9, 2022

cloud-fan reviewed Feb 22, 2022

View reviewed changes

github-actions bot added the Stale label Jun 5, 2022

github-actions bot closed this Jun 6, 2022

[SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend #35460

[SPARK-38160][SQL] Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend #35460

Uh oh!

Conversation

WangGuangxin commented Feb 9, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

WangGuangxin commented Feb 10, 2022

Uh oh!

AmplabJenkins commented Feb 10, 2022

Uh oh!

WangGuangxin commented Feb 11, 2022

Uh oh!

srowen commented Feb 15, 2022

Uh oh!

WangGuangxin commented Feb 15, 2022

Uh oh!

srowen commented Feb 15, 2022

Uh oh!

WangGuangxin commented Feb 16, 2022

Uh oh!

mridulm commented Feb 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WangGuangxin commented Feb 16, 2022

Uh oh!

mridulm commented Feb 16, 2022

Uh oh!

cloud-fan Feb 22, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 22, 2022

Choose a reason for hiding this comment

Uh oh!

WangGuangxin Feb 23, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 23, 2022

Choose a reason for hiding this comment

Uh oh!

WangGuangxin Feb 24, 2022

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mridulm commented Feb 16, 2022 •

edited

Loading