Skip to content

Conversation

@PetarVasiljevic-DB
Copy link
Contributor

@PetarVasiljevic-DB PetarVasiljevic-DB commented Aug 1, 2025

What changes were proposed in this pull request?

Prior to this change, EXPLAIN FORMATTED on query e.g.:

SELECT *
FROM catalog.tbl1 t1
    JOIN catalog.tbl2 t2 ON t1.id1 = t2.id2
    JOIN catalog.tbl3 t3 ON t2.id2 = t3.id3
    JOIN catalog.tbl4 t4 ON t3.id3 = t4.id4;

looked like:

PushedJoins: [join_pushdown_catalog.tbl1, join_pushdown_catalog.tbl2, join_pushdown_catalog.tlb3, join_pushdown_catalog.tlb4]

With the change from PR, the output of EXPLAIN FORMATTED would be:

 == Physical Plan ==
   *(1) Project [ID#x, AMOUNT#x, ADDRESS#x, ID_1#x AS ID#x, NEXT_ID#x, SALARY#x, SURNAME#x, id_3#x AS id#x, id_1_2#x AS id_1#x, id_2#x, id_1_1#x, sid#x, id_4#x AS id#x, id_1_3#x AS id_1#x, id_2_2#x AS id_2#x, id_2_1#x, Sid_1#x AS Sid#x]
   +- *(1) Scan JDBC v1 Relation from v2 scan join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_1[ID#x,AMOUNT#x,ADDRESS#x,ID_1#x,NEXT_ID#x,SALARY#x,SURNAME#x,id_3#x,id_1_2#x,id_2#x,id_1_1#x,sid#x,id_4#x,id_1_3#x,id_2_2#x,id_2_1#x,Sid_1#x] PushedFilters: [id_3 = (id_4 + 1)], PushedJoins: 
   [L]: PushedFilters: [ID_1 = (id_3 + 1)]
        PushedJoins:
        [L]: PushedFilters: [ID = (ID_1 + 1)]
             PushedJoins:
             [L]: Relation: join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_1
                  PushedFilters: [ID IS NOT NULL]
             [R]: Relation: join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_2
                  PushedFilters: [ID IS NOT NULL]
        [R]: Relation: join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_3
             PushedFilters: [id IS NOT NULL]
   [R]: Relation: join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_4
        PushedFilters: [id IS NOT NULL]
   , ReadSchema: struct<ID:int,AMOUNT:decimal(10,2),ADDRESS:string,ID_1:int,NEXT_ID:int,SALARY:decimal(10,2),SURNAME:string,id_3:int,id_1_2:int,id_2:int,id_1_1:int,sid:int,id_4:int,id_1_3:int,id_2_2:int,id_2_1:int,Sid_1:int>

PushedFilters on top of PushedJoins are actually join conditions.
It can be seen that the name of Scan JDBC v1 Relation from v2 scan is catalog.tbl1. This should be fixed as well, but it won't be a part of this PR.

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

@github-actions github-actions bot added the SQL label Aug 1, 2025
@PetarVasiljevic-DB PetarVasiljevic-DB force-pushed the improve_explain_command_for_dsv2_join_pushdown branch 4 times, most recently from 3563693 to 897ce34 Compare August 1, 2025 18:03
@PetarVasiljevic-DB
Copy link
Contributor Author

cc: @cloud-fan

@cloud-fan
Copy link
Contributor

@PetarVasiljevic-DB can you fix the merge conflicts?

@PetarVasiljevic-DB PetarVasiljevic-DB force-pushed the improve_explain_command_for_dsv2_join_pushdown branch 2 times, most recently from 9aa996a to c3d304d Compare August 4, 2025 14:43
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's always 0 and 1, shall we use LEFT and RIGHT? or L and R?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is always 0 and 1 yes. I can change to L and R as the indentation will stay the same. With LEFT and RIGHT we would have some indentation issues.

@PetarVasiljevic-DB PetarVasiljevic-DB force-pushed the improve_explain_command_for_dsv2_join_pushdown branch from c3d304d to 9a05c5f Compare August 4, 2025 16:27
@PetarVasiljevic-DB
Copy link
Contributor Author

@cloud-fan can we merge?

* The exmaple of resulting string is the following:
*
* PushedJoins:
* [0]: [PushedFilters: [ID_1 = (ID_2 + 1)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's update the comment

* [1]: [Relation: join_pushdown_catalog.tbl4, PushedFilters: [ID IS NOT NULL]]
*/
private def getPushedJoinString(
joinedPushedDownOperators: Seq[PushedDownOperators],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's always two elements? shall we just have two parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can have 2 parameters only.

}

// TODO: test currently doesn't assert on anything. We can assert on pushedJoins from explain
// when https://github.com/apache/spark/pull/51686 is checked in, since we would have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's merged already, can we update?

checkJoinPushed(
df,
s"""PushedFilters: [id_3 = (id_4 + 1)], PushedJoins: [
|[L]: [PushedFilters: [${caseConvert("id_1")} = (id_3 + 1)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the string looks better if there is no wrapping [] around pushed filters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so [L]: PushedFilters: [${caseConvert("id_1")} = (id_3 + 1)?
what about the bottom one with relation and pushed filters:

[L]: [Relation: $catalogAndNamespace.${caseConvert(joinTableName1)}, PushedFilters: [${caseConvert("id")} IS NOT NULL]],

Would you prefer omitting brackets here as well, or should we keep them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we unify the leaf and non-leaf nodes a bit more

[L]: PushedFilters: ...
     PushedJoins:
     [L]: PushedFilters: ...
          Relation: ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The topmost filter can't really by put in new line, because we are doing metadata.toSeq.sorted when generating the string from metadata Map, so the order would be different if I put Map("\nPushedJoins" -> ...) instead of Map("PushedJoins" -> ...).

I would still put [] in pushed filters because that is what we did before this PR for pushed filters. Also, semantically, this is an array of filters so it makes sense to wrap it into brackets.

Before committing anything, are you fine with the following structure? Because of the formatting, there is a weird formatting at the end (, ReadSchema) but it is due to how we generate the string from metadata Map

   == Physical Plan ==
   *(1) Project [ID#x, AMOUNT#x, ADDRESS#x, ID_1#x AS ID#x, NEXT_ID#x, SALARY#x, SURNAME#x, id_3#x AS id#x, id_1_2#x AS id_1#x, id_2#x, id_1_1#x, sid#x, id_4#x AS id#x, id_1_3#x AS id_1#x, id_2_2#x AS id_2#x, id_2_1#x, Sid_1#x AS Sid#x]
   +- *(1) Scan JDBC v1 Relation from v2 scan join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_1[ID#x,AMOUNT#x,ADDRESS#x,ID_1#x,NEXT_ID#x,SALARY#x,SURNAME#x,id_3#x,id_1_2#x,id_2#x,id_1_1#x,sid#x,id_4#x,id_1_3#x,id_2_2#x,id_2_1#x,Sid_1#x] PushedFilters: [id_3 = (id_4 + 1)], PushedJoins: [
   [L]: PushedFilters: [ID_1 = (id_3 + 1)]
        PushedJoins:
        [L]: PushedFilters: [ID = (ID_1 + 1)]
             PushedJoins:
             [L]: Relation: join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_1
                  PushedFilters: [ID IS NOT NULL]
             [R]: Relation: join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_2
                  PushedFilters: [ID IS NOT NULL]
        [R]: Relation: join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_3
             PushedFilters: [id IS NOT NULL]
   [R]: Relation: join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_4
        PushedFilters: [id IS NOT NULL]]
   , ReadSchema: struct<ID:int,AMOUNT:decimal(10,2),ADDRESS:string,ID_1:int,NEXT_ID:int,SALARY:decimal(10,2),SURNAME:string,id_3:int,id_1_2:int,id_2:int,id_1_1:int,sid:int,id_4:int,id_1_3:int,id_2_2:int,id_2_1:int,Sid_1:int>
   

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah this looks better!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Pushed the commit


withSQLConf(SQLConf.DATA_SOURCE_V2_JOIN_PUSHDOWN.key -> "true") {
val df = sql(sqlQuery)
checkJoinPushed(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we remove all these checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find these useless now. There is a separate test for testing the output of explain command. I can get them back if it's a blocker.

@PetarVasiljevic-DB PetarVasiljevic-DB force-pushed the improve_explain_command_for_dsv2_join_pushdown branch 2 times, most recently from 411e8c2 to bd25ef4 Compare August 6, 2025 17:53
@PetarVasiljevic-DB PetarVasiljevic-DB force-pushed the improve_explain_command_for_dsv2_join_pushdown branch from bd25ef4 to 0531a70 Compare August 6, 2025 23:12
@cloud-fan
Copy link
Contributor

The failed test passed locally, thanks, merging to master!

@cloud-fan cloud-fan closed this in 88dbe42 Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants