-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-53066][SQL] Improve EXPLAIN output for DSv2 Join pushdown #51781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-53066][SQL] Improve EXPLAIN output for DSv2 Join pushdown #51781
Conversation
3563693 to
897ce34
Compare
|
cc: @cloud-fan |
|
@PetarVasiljevic-DB can you fix the merge conflicts? |
9aa996a to
c3d304d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's always 0 and 1, shall we use LEFT and RIGHT? or L and R?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is always 0 and 1 yes. I can change to L and R as the indentation will stay the same. With LEFT and RIGHT we would have some indentation issues.
c3d304d to
9a05c5f
Compare
|
@cloud-fan can we merge? |
| * The exmaple of resulting string is the following: | ||
| * | ||
| * PushedJoins: | ||
| * [0]: [PushedFilters: [ID_1 = (ID_2 + 1)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's update the comment
| * [1]: [Relation: join_pushdown_catalog.tbl4, PushedFilters: [ID IS NOT NULL]] | ||
| */ | ||
| private def getPushedJoinString( | ||
| joinedPushedDownOperators: Seq[PushedDownOperators], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's always two elements? shall we just have two parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can have 2 parameters only.
| } | ||
|
|
||
| // TODO: test currently doesn't assert on anything. We can assert on pushedJoins from explain | ||
| // when https://github.com/apache/spark/pull/51686 is checked in, since we would have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's merged already, can we update?
| checkJoinPushed( | ||
| df, | ||
| s"""PushedFilters: [id_3 = (id_4 + 1)], PushedJoins: [ | ||
| |[L]: [PushedFilters: [${caseConvert("id_1")} = (id_3 + 1)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the string looks better if there is no wrapping [] around pushed filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so [L]: PushedFilters: [${caseConvert("id_1")} = (id_3 + 1)?
what about the bottom one with relation and pushed filters:
[L]: [Relation: $catalogAndNamespace.${caseConvert(joinTableName1)}, PushedFilters: [${caseConvert("id")} IS NOT NULL]],
Would you prefer omitting brackets here as well, or should we keep them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we unify the leaf and non-leaf nodes a bit more
[L]: PushedFilters: ...
PushedJoins:
[L]: PushedFilters: ...
Relation: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The topmost filter can't really by put in new line, because we are doing metadata.toSeq.sorted when generating the string from metadata Map, so the order would be different if I put Map("\nPushedJoins" -> ...) instead of Map("PushedJoins" -> ...).
I would still put [] in pushed filters because that is what we did before this PR for pushed filters. Also, semantically, this is an array of filters so it makes sense to wrap it into brackets.
Before committing anything, are you fine with the following structure? Because of the formatting, there is a weird formatting at the end (, ReadSchema) but it is due to how we generate the string from metadata Map
== Physical Plan ==
*(1) Project [ID#x, AMOUNT#x, ADDRESS#x, ID_1#x AS ID#x, NEXT_ID#x, SALARY#x, SURNAME#x, id_3#x AS id#x, id_1_2#x AS id_1#x, id_2#x, id_1_1#x, sid#x, id_4#x AS id#x, id_1_3#x AS id_1#x, id_2_2#x AS id_2#x, id_2_1#x, Sid_1#x AS Sid#x]
+- *(1) Scan JDBC v1 Relation from v2 scan join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_1[ID#x,AMOUNT#x,ADDRESS#x,ID_1#x,NEXT_ID#x,SALARY#x,SURNAME#x,id_3#x,id_1_2#x,id_2#x,id_1_1#x,sid#x,id_4#x,id_1_3#x,id_2_2#x,id_2_1#x,Sid_1#x] PushedFilters: [id_3 = (id_4 + 1)], PushedJoins: [
[L]: PushedFilters: [ID_1 = (id_3 + 1)]
PushedJoins:
[L]: PushedFilters: [ID = (ID_1 + 1)]
PushedJoins:
[L]: Relation: join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_1
PushedFilters: [ID IS NOT NULL]
[R]: Relation: join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_2
PushedFilters: [ID IS NOT NULL]
[R]: Relation: join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_3
PushedFilters: [id IS NOT NULL]
[R]: Relation: join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_4
PushedFilters: [id IS NOT NULL]]
, ReadSchema: struct<ID:int,AMOUNT:decimal(10,2),ADDRESS:string,ID_1:int,NEXT_ID:int,SALARY:decimal(10,2),SURNAME:string,id_3:int,id_1_2:int,id_2:int,id_1_1:int,sid:int,id_4:int,id_1_3:int,id_2_2:int,id_2_1:int,Sid_1:int>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah this looks better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Pushed the commit
|
|
||
| withSQLConf(SQLConf.DATA_SOURCE_V2_JOIN_PUSHDOWN.key -> "true") { | ||
| val df = sql(sqlQuery) | ||
| checkJoinPushed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we remove all these checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find these useless now. There is a separate test for testing the output of explain command. I can get them back if it's a blocker.
411e8c2 to
bd25ef4
Compare
bd25ef4 to
0531a70
Compare
|
The failed test passed locally, thanks, merging to master! |
What changes were proposed in this pull request?
Prior to this change,
EXPLAIN FORMATTEDon query e.g.:looked like:
With the change from PR, the output of
EXPLAIN FORMATTEDwould be:PushedFilters on top of PushedJoins are actually join conditions.
It can be seen that the name of
Scan JDBC v1 Relation from v2 scaniscatalog.tbl1. This should be fixed as well, but it won't be a part of this PR.Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?