[SPARK-12125][SQL] pull out nondeterministic expressions from Join #10128

zhonghaihua · 2015-12-03T10:20:59Z

Currently,nondeterministic expressions are only allowed in Project or Filter,And only when we use nondeterministic expressions in UnaryNode can be pulled out.

But,Sometime in many case,we will use nondeterministic expressions to process join keys avoiding data skew.for example:

select * 
from tableA a 
join 
(select * from tableB) b 
on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( (-rand() * 10000000 ) as string ) else a.brand_code end ))  = b.brand_code

This PR introduce a mechanism to pull out nondeterministic expressions from Join,so we can use nondeterministic expression in Join appropriately.

pull out nondeterministic expressions from Join pull out nondeterministic expressions from Join

zhonghaihua · 2015-12-03T10:35:38Z

@rxin @cloud-fan @chenghao-intel @jeanlyn Could you give some suggestions on this PR?

cloud-fan · 2015-12-03T12:52:29Z

I think it's not a good example to show that we need to allow nondeterministic expressions in join codition. We can use Repartition operator to fix data skew, like sqlContext.table(tblName).repartition(numPartitions).registerTempTable, which looks better than your random join approach. Do you find other cases that need to use nondeterministic expressions in join codition?

chenghao-intel · 2015-12-03T15:27:05Z

I can understand the motivation of this change, we do have workaround for relieving the data skew, but we probably don't want to change the existing SQL queries based on legacy system (like Hive).

jeanlyn · 2015-12-04T02:16:38Z

@cloud-fan I think your case is different from @zhonghaihua 's. The sql only deal with some join keys ('' and null) before shuffle to handle those pointless key cause skew during join operator, while repartition deal with all data before some map operator. This is two type data skew, right?

zhonghaihua · 2015-12-04T13:35:30Z

@cloud-fan Thanks for your advice. But, as @jeanlyn said,Repartition will deal with all data, and this PR will deal with join keys cause data skew.
Because in some situations, we will use this operator to avoid data skew in SQL, then I think maybe we should support this. What do you think?

cloud-fan · 2015-12-05T03:27:15Z

This makes sense, but I still wanna argue that spark SQL has RepartitionByExpression that deal with several columns...

I think the main point is to be compatiable with existing SQL queries, but I'm not sure if it worth.
cc @rxin @marmbrus

zhonghaihua · 2016-01-05T04:26:38Z

@rxin @yhuai Could you verify this patch?

marmbrus · 2016-01-05T18:28:07Z

This seems like a reasonable thing to do, but the implementation seems unnecessarily complex. Why not just:

transform the condition, matching on non deterministic subtrees with no references.
when you find one create an alias and add it to an ArrayBuffer, replacing the tree with alias.toAttribute
if the array buffer is empty, return the same tree. otherwise, add the two projections as you do now and use the transformed condition for the join.

marmbrus · 2016-01-05T18:28:12Z

ok to test

SparkQA · 2016-01-05T18:44:26Z

Test build #48778 has finished for PR 10128 at commit 6e16657.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

zhonghaihua · 2016-01-06T05:29:51Z

@marmbrus Thanks for your suggestions. I think your idea can simply solve problem. But in some situations, this seems not very suitable.
For example:
Join(testRelation, testRelation2, Inner,Some(And(EqualTo(a, b), EqualTo(Rand(33) * c, d)))) If c is an attribute belong to testRelation2, I think Rand(33) is more appropriately pulled out to the right child tree of Join, otherwise, if belong to testRelation, it is appropriately pulled out to left child tree.

When nondeterministic expressions is used with table attribute, I think pull out it should depend on the attribute. What do you think?

SparkQA · 2016-01-06T07:22:39Z

Test build #48833 has finished for PR 10128 at commit b2a4a6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-01-06T18:43:07Z

When nondeterministic expressions is used with table attribute, I think pull out it should depend on the attribute. What do you think?

Why? Multiplying by c returns the same result wherever you do it. Its not worth complicating the code unless its going to change the answer.

jeanlyn · 2016-01-07T02:15:17Z

@marmbrus you are right. But i think @zhonghaihua 's solution is try to reduce cartesian product possibility, right?

marmbrus · 2016-01-07T02:18:38Z

How does this make a difference in join selection? I think the logic in ExtractEquiJoinKeys is already smart enough to do this correctly. Even if that was not the case we should not mix concerns here. Rules should be kept as simple as possible to avoid bugs.

marmbrus · 2016-01-07T02:21:07Z

Oh, I see what you are trying to do. Hmm, this feels like a hack to me. If we want to fix skewed joins I'm thinking we need a more principled solution.

jeanlyn · 2016-01-07T06:01:46Z

It's difference from join selection, it just pull out nondeterministic expressions of join condition to the left or right children, but it seems that it can reuse the code of ExtractEquiJoinKeys.

rxin · 2016-06-15T22:18:45Z

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one.

pull out nondeterministic expressions from Join

6e16657

pull out nondeterministic expressions from Join pull out nondeterministic expressions from Join

zhonghaihua changed the title ~~pull out nondeterministic expressions from Join~~ [SPARK-12125]pull out nondeterministic expressions from Join Dec 3, 2015

zhonghaihua changed the title ~~[SPARK-12125]pull out nondeterministic expressions from Join~~ [SPARK-12125][SQL] pull out nondeterministic expressions from Join Dec 3, 2015

fix the Scala style

b2a4a6b

asfgit closed this in 1a33f2e Jun 15, 2016

[SPARK-12125][SQL] pull out nondeterministic expressions from Join #10128

[SPARK-12125][SQL] pull out nondeterministic expressions from Join #10128

Uh oh!

Conversation

zhonghaihua commented Dec 3, 2015

Uh oh!

zhonghaihua commented Dec 3, 2015

Uh oh!

cloud-fan commented Dec 3, 2015

Uh oh!

chenghao-intel commented Dec 3, 2015

Uh oh!

jeanlyn commented Dec 4, 2015

Uh oh!

zhonghaihua commented Dec 4, 2015

Uh oh!

cloud-fan commented Dec 5, 2015

Uh oh!

zhonghaihua commented Jan 5, 2016

Uh oh!

marmbrus commented Jan 5, 2016

Uh oh!

marmbrus commented Jan 5, 2016

Uh oh!

SparkQA commented Jan 5, 2016

Uh oh!

zhonghaihua commented Jan 6, 2016

Uh oh!

SparkQA commented Jan 6, 2016

Uh oh!

marmbrus commented Jan 6, 2016

Uh oh!

jeanlyn commented Jan 7, 2016

Uh oh!

marmbrus commented Jan 7, 2016

Uh oh!

marmbrus commented Jan 7, 2016

Uh oh!

jeanlyn commented Jan 7, 2016

Uh oh!

rxin commented Jun 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants