-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21520][SQL][FOLLOW-UP]fix a special case for non-deterministic projects in optimizer #18969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
cc @gatorsmile @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still consider whether the fields are non-deterministic. It makes sense only when the non-deterministic fields are not referencing any attribute. Thus, your use case is pretty rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks,
I have modify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @gatorsmile .
Could you review again?
e84425f to
6ab7f15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha, :(
I misunderstood you.
Is not that we have added a condition.
case p @ Project(fields, child: LeafNode) if p.references.nonEmpty =>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might not worth to fix. This only covers a rare case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be a rare case. but in business scenario, there are still a lot of scenes to use the rare case. similar business scenarios:
1.Random grouping, add a random factor to each row of data before grouping.
2.Use the random value to fill a field, easy to follow algorithm for calculation or prevents querying data anomalies.
3.Data skew, data are discretized using random values.
thanks.
6ab7f15 to
28a7ffa
Compare
7b53c97 to
55af92b
Compare
55af92b to
4ebabc7
Compare
|
@heary-cao Maybe you can close this PR first? @jiangxb1987 will handle it in the previous PR. |
What changes were proposed in this pull request?
This is a follow-up of #18892 :
Currently, Did a lot of special handling for non-deterministic projects in optimizer. but not good enough. this patch add a new special case for non-deterministic projects. the application logic is as follows, in my spark-shell, Simple creates a 22 column Orc data table.
sql Query statement:
Before modifiy, executed Plan:
FileScanRDD read a row of userdata: [0,0,2,3,0,4,2,0,8,7,b800000001,c000000001,c800000001,d000000001,d800000001,e000000001,0,0,4000000000000000,4008000000000000,0,4010000000000000,0,30,30,32,33,30,34]
we read all columns of userdata from orc table. there is such a change, Because the fields of projects is non-deterministic, contains nondeterministic function(rand function). PhysicalOperation excluded that the fields of projects is non-deterministic.
After modifiy, executed Plan:
FileScanRDD read a row of userdata: [0,0,2,3,0]
we read only need to userdata from orc table.
In addition, HiveTableScans also scan more columns according to the execution plan:
and it will affect the performance of task.
Similar query statements as follows:
select k,k,sum(id) from (select d004 as id, floor(rand() * 10000) as k, ceil(c010) as cceila from t054) a group by k,k;In a spark cluster environment with 1 master and 2 work nodes, the performance before and after modification is 556s vs 5996s.
this PR describe ways to solve the problem.
How was this patch tested?
Should be covered existing test cases and add test cases.