[SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with expressions) #37320

beliefer · 2022-07-27T11:47:57Z

What changes were proposed in this pull request?

Currently, DS V2 aggregate push-down cannot work with DS V2 Top N push-down (ORDER BY col LIMIT m) or DS V2 Paging push-down (ORDER BY col LIMIT m OFFSET n).
If we can push down aggregate with Top N or Paging, it will be better performance.

This PR only let aggregate pushed down with ORDER BY expressions which must be GROUP BY expressions.

The idea of this PR are:

When we give an expectation outputs of ScanBuilderHolder, holding the map from expectation outputs to origin expressions (contains origin columns).
When we try to push down Top N or Paging, we need restore the origin expressions for SortOrder.

Why are the changes needed?

Let DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions), then users can get the better performance.

Does this PR introduce any user-facing change?

'No'.
New feature.

How was this patch tested?

New test cases.

beliefer · 2022-07-28T02:03:40Z

ping @huaxingao cc @cloud-fan

cloud-fan · 2022-07-28T12:54:14Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

We can use AttributeMap

Thank you for the reminder.

cloud-fan · 2022-07-28T12:55:02Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

Suggested change

var pushedAggregateExpectedOutputMap: Map[AttributeReference, Expression] =

var pushedAggOutputMap: Map[AttributeReference, Expression] =

cloud-fan · 2022-07-28T12:55:30Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

I think we can just do something similar with replaceAlias

sortOrder transform { case a: Attribute => sHolder.pushedAggOutputMap.getOrElse(a, a) }

cloud-fan · 2022-07-28T12:58:25Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

I don't need to test the same query with both DataFrame and SQL

This test case is used to test group by column with alias.

It's a bit hard to review the tests when mixed SQL and DataFrame API. Can we be consistent?

cloud-fan · 2022-07-29T11:52:05Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

let's check one more case: ... GROUP BY a, b ORDER BY a + b

cloud-fan · 2022-08-02T02:34:51Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

I think we can remove all these comments here.

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

cloud-fan · 2022-08-02T02:41:36Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

We already have tests to verify the alias, we don't need to test it again with 2 columns.

cloud-fan · 2022-08-02T02:42:41Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

can we add some comments to explain why top n can't be pushed here?

I added the comments into V2ScanRelationPushDown directly.

We can translate agg expressions now, why this test still can't trigger top-n pushdown?

cloud-fan · 2022-08-02T02:43:23Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

cloud-fan · 2022-08-02T02:44:15Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

This seems like a copy of the test case above but adding .offset(1)?

can we reduce the code duplication?

Seq(true, false).foreach { hasOffset => var df = ... if (hasOffset) df = df.offset(1) df = df.limit(1) ... }

I tried and found it cannot reduce code size.

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

cloud-fan · 2022-08-02T05:09:10Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

can we use some functions that don't need ansi mode to simplify the test? e.g. log(10, dept)

BTW, can we make the grouping key an expression? like this https://github.com/apache/spark/pull/37320/files#diff-1496378d9e7817c45c962f1af48e5e765cb475bd01d58edec118d98225e02ef3R887

H2 doesn't support push down the SQL

org.h2.jdbc.JdbcSQLSyntaxErrorException: Column "SALARY" must be in the GROUP BY list; SQL statement: SELECT CASE WHEN ("SALARY" > 8000.00) AND ("SALARY" < 10000.00) THEN "SALARY" ELSE 0.00 END,SUM("SALARY") FROM "test"."employee" GROUP BY CASE WHEN ("SALARY" > 8000.00) AND ("SALARY" < 10000.00) THEN "SALARY" ELSE 0.00 END ORDER BY (CASE WHEN ("SALARY" > 8000.00) AND ("SALARY" < 10000.00) THEN "SALARY" ELSE 0.00 END) > 1.00 ASC NULLS FIRST LIMIT 1 [90016-214] at org.h2.message.DbException.getJdbcSQLException(DbException.java:632)

But postgreSQL works well.

log(10, dept)

log(10, dept) still add Cast implicitly.

cloud-fan · 2022-08-05T13:02:12Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

// `ScanBuilderHolder` has different output columns after aggregate pushdown. Here we // replace the attributes in ordering expressions with the original table output columns.

cloud-fan · 2022-08-05T13:02:27Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

we can remove this TODO now.

and let's add tests for it.

cloud-fan · 2022-08-05T13:03:54Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

let's add a cast instead of just a simple alias

.select($"DEPT".cast("string").as("my_dept"), $"SALARY")

cloud-fan · 2022-08-05T13:04:23Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

we can remove this test after you address https://github.com/apache/spark/pull/37320/files#r938791655

cloud-fan · 2022-08-05T13:04:49Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

let's use cast now. group by a predicate is super weird.

cloud-fan · 2022-08-05T13:05:28Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala

let's use case instead of case when, which is simpler and easier to review.

cloud-fan · 2022-08-08T05:55:37Z

sorry there are conflicts...

beliefer · 2022-08-08T09:50:16Z

sorry there are conflicts...

Let's fix it. :)

…oup expressions)

cloud-fan · 2022-08-09T03:45:26Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Cast.java

  public Expression[] children() { return new Expression[]{ expression() }; }
+
+  @Override
+  public String toString() {


is this the only missing one? maybe we should add a default implementation in the base class Expression using ToStringSQLBuilder?

No, some V2 expressions missing toString too. e.g. Extract.
Add a default implementation in the base class Expression using ToStringSQLBuilder is a good idea.

I will fix it.

cloud-fan · 2022-08-09T03:49:16Z

thanks, merging to master!

beliefer · 2022-08-09T04:05:06Z

@cloud-fan Thank you very much !

…aging (Sort with expressions) ### What changes were proposed in this pull request? Currently, DS V2 aggregate push-down cannot work with DS V2 Top N push-down (`ORDER BY col LIMIT m`) or DS V2 Paging push-down (`ORDER BY col LIMIT m OFFSET n`). If we can push down aggregate with Top N or Paging, it will be better performance. This PR only let aggregate pushed down with ORDER BY expressions which must be GROUP BY expressions. The idea of this PR are: 1. When we give an expectation outputs of `ScanBuilderHolder`, holding the map from expectation outputs to origin expressions (contains origin columns). 2. When we try to push down Top N or Paging, we need restore the origin expressions for `SortOrder`. ### Why are the changes needed? Let DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions), then users can get the better performance. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes apache#37320 from beliefer/SPARK-39819_new. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…aging (Sort with expressions) (#525) * [SPARK-39784][SQL] Put Literal values on the right side of the data source filter after translating Catalyst Expression to data source filter ### What changes were proposed in this pull request? Even though the literal value could be on both sides of the filter, e.g. both `a > 1` and `1 < a` are valid, after translating Catalyst Expression to data source filter, we want the literal value on the right side so it's easier for the data source to handle these filters. We do this kind of normalization for V1 Filter. We should have the same behavior for V2 Filter. Before this PR, for the filters that have literal values on the right side, e.g. `1 > a`, we keep it as is. After this PR, we will normalize it to `a < 1` so the data source doesn't need to check each of the filters (and do the flip). ### Why are the changes needed? I think we should follow V1 Filter's behavior, normalize the filters during catalyst Expression to DS Filter translation time to make the literal values on the right side, so later on, data source doesn't need to check every single filter to figure out if it needs to flip the sides. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes apache#37197 from huaxingao/flip. Authored-by: huaxingao <[email protected]> Signed-off-by: huaxingao <[email protected]> * [SPARK-39836][SQL] Simplify V2ExpressionBuilder by extract common method ### What changes were proposed in this pull request? Currently, `V2ExpressionBuilder` have a lot of similar code, we can extract them as one common method. We can simplify the implement with the common method. ### Why are the changes needed? Simplify `V2ExpressionBuilder` by extract common method. ### Does this PR introduce _any_ user-facing change? 'No'. Just update inner implementation. ### How was this patch tested? N/A Closes apache#37249 from beliefer/SPARK-39836. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39858][SQL] Remove unnecessary `AliasHelper` or `PredicateHelper` for some rules ### What changes were proposed in this pull request? When I using `AliasHelper`, I find that some rules inherit it instead of using it. This PR removes unnecessary `AliasHelper` or `PredicateHelper` in the following cases: - The rule inherit `AliasHelper` instead of using it. In this case, we can remove `AliasHelper` directly. - The rule inherit `PredicateHelper` instead of using it. In this case, we can remove `PredicateHelper` directly. - The rule inherit `AliasHelper` and `PredicateHelper`. In fact, `PredicateHelper` already extends `AliasHelper`. In this case, we can remove `AliasHelper`. - The rule inherit `OperationHelper` and `PredicateHelper`. In fact, `OperationHelper` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. - The rule inherit `PlanTest` and `PredicateHelper`. In fact, `PlanTest` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. - The rule inherit `QueryTest` and `PredicateHelper`. In fact, `QueryTest` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. ### Why are the changes needed? Remove unnecessary `AliasHelper` or `PredicateHelper` for some rules ### Does this PR introduce _any_ user-facing change? 'No'. Just improve the inner implementation. ### How was this patch tested? N/A Closes apache#37272 from beliefer/SPARK-39858. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39784][SQL][FOLLOW-UP] Use BinaryComparison instead of Predicate (if) for type check ### What changes were proposed in this pull request? follow up this [comment](apache#37197 (comment)) ### Why are the changes needed? code simplification ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test Closes apache#37278 from huaxingao/followup. Authored-by: huaxingao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-39909] Organize the check of push down information for JDBCV2Suite ### What changes were proposed in this pull request? This PR changes the check method from `check(one_large_string)` to `check(small_string1, small_string2, ...)` ### Why are the changes needed? It can help us check the results individually and make the code more clearer. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#37342 from yabola/fix. Authored-by: chenliang.lu <[email protected]> Signed-off-by: huaxingao <[email protected]> * [SPARK-39961][SQL] DS V2 push-down translate Cast if the cast is safe ### What changes were proposed in this pull request? Currently, DS V2 push-down translate `Cast` only if the ansi mode is true. In fact, if the cast is safe(e.g. cast number to string, cast int to long), we can translate it too. This PR will call `Cast.canUpCast` so as we can translate `Cast` to V2 `Cast` safely. Note: The rule `SimplifyCasts` optimize some safe cast, e.g. cast int to long, so we may not see the `Cast`. ### Why are the changes needed? Add the range for DS V2 push down `Cast`. ### Does this PR introduce _any_ user-facing change? 'Yes'. `Cast` could be pushed down to data source in more cases. ### How was this patch tested? Test cases updated. Closes apache#37388 from beliefer/SPARK-39961. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-38901][SQL] DS V2 supports push down misc functions ### What changes were proposed in this pull request? Currently, Spark have some misc functions. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L688 These functions show below: `AES_ENCRYPT,` `AES_DECRYPT`, `SHA1`, `SHA2`, `MD5`, `CRC32` Function|PostgreSQL|ClickHouse|H2|MySQL|Oracle|Redshift|Snowflake|DB2|Vertica|Exasol|SqlServer|Yellowbrick|Mariadb|Singlestore| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- `AesEncrypt`|Yes|Yes|Yes|Yes|Yes|NO|Yes|Yes|NO|NO|NO|Yes|Yes|Yes| `AesDecrypt`|Yes|Yes|Yes|Yes|Yes|NO|Yes|Yes|NO|NO|NO|Yes|Yes|Yes| `Sha1`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Sha2`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Md5`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Crc32`|No|Yes|No|Yes|NO|Yes|NO|Yes|NO|NO|NO|NO|NO|Yes| DS V2 should supports push down these misc functions. ### Why are the changes needed? DS V2 supports push down misc functions. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes apache#37169 from chenzhx/misc. Authored-by: chenzhx <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39964][SQL] DS V2 pushdown should unify the translate path ### What changes were proposed in this pull request? Currently, DS V2 pushdown have two translate path `DataSourceStrategy.translateAggregate` used to translate aggregate functions and `V2ExpressionBuilder` used to translate other functions and expressions, we can unify them. After this PR, the translate have only one code path, developers will easy to coding and reading. ### Why are the changes needed? Unify the translate path for DS V2 pushdown. ### Does this PR introduce _any_ user-facing change? 'No'. Just update the inner implementation. ### How was this patch tested? N/A Closes apache#37391 from beliefer/SPARK-39964. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with expressions) ### What changes were proposed in this pull request? Currently, DS V2 aggregate push-down cannot work with DS V2 Top N push-down (`ORDER BY col LIMIT m`) or DS V2 Paging push-down (`ORDER BY col LIMIT m OFFSET n`). If we can push down aggregate with Top N or Paging, it will be better performance. This PR only let aggregate pushed down with ORDER BY expressions which must be GROUP BY expressions. The idea of this PR are: 1. When we give an expectation outputs of `ScanBuilderHolder`, holding the map from expectation outputs to origin expressions (contains origin columns). 2. When we try to push down Top N or Paging, we need restore the origin expressions for `SortOrder`. ### Why are the changes needed? Let DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions), then users can get the better performance. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes apache#37320 from beliefer/SPARK-39819_new. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39929][SQL] DS V2 supports push down string functions(non ANSI) **What changes were proposed in this pull request?** support more commonly used string functions BIT_LENGTH CHAR_LENGTH CONCAT The mainstream databases support these functions show below. Function | PostgreSQL | ClickHouse | H2 | MySQL | Oracle | Redshift | Presto | Teradata | Snowflake | DB2 | Vertica | Exasol | SqlServer | Yellowbrick | Impala | Mariadb | Druid | Pig | SQLite | Influxdata | Singlestore | ElasticSearch -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- BIT_LENGTH | Yes | Yes | Yes | Yes | Yes | no | no | no | no | Yes | Yes | Yes | no | Yes | no | Yes | no | no | no | no | no | Yes CHAR_LENGTH | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | no | Yes | Yes | Yes | Yes CONCAT | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | no | no | no | Yes | Yes **Why are the changes needed?** DS V2 supports push down string functions **Does this PR introduce any user-facing change?** 'No'. New feature. How was this patch tested? New tests. Closes apache#37427 from zheniantoushipashi/SPARK-39929. Authored-by: biaobiao.sun <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-38899][SQL][FOLLOWUP] Fix bug extract datetime in DS V2 pushdown ### What changes were proposed in this pull request? [SPARK-38899](apache#36663) supports extract function in JDBC data source. But the implement is incorrect. This PR just add a test case and it will be failed ! The test case show below. ``` test("scan with filter push-down with date time functions") { val df9 = sql("SELECT name FROM h2.test.datetime WHERE " + "dayofyear(date1) > 100 order by dayofyear(date1) limit 1") checkFiltersRemoved(df9) val expectedPlanFragment9 = "PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], " + "PushedTopN: ORDER BY [EXTRACT(DAY_OF_YEAR FROM DATE1) ASC NULLS FIRST] LIMIT 1," checkPushedInfo(df9, expectedPlanFragment9) checkAnswer(df9, Seq(Row("alex"))) } ``` The test case output failure show below. ``` "== Parsed Logical Plan == 'GlobalLimit 1 +- 'LocalLimit 1 +- 'Sort ['dayofyear('date1) ASC NULLS FIRST], true +- 'Project ['name] +- 'Filter ('dayofyear('date1) > 100) +- 'UnresolvedRelation [h2, test, datetime], [], false == Analyzed Logical Plan == name: string GlobalLimit 1 +- LocalLimit 1 +- Project [name#x] +- Sort [dayofyear(date1#x) ASC NULLS FIRST], true +- Project [name#x, date1#x] +- Filter (dayofyear(date1#x) > 100) +- SubqueryAlias h2.test.datetime +- RelationV2[NAME#x, DATE1#x, TIME1#x] h2.test.datetime test.datetime == Optimized Logical Plan == Project [name#x] +- RelationV2[NAME#x] test.datetime == Physical Plan == *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$145f6181a [NAME#x] PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], PushedTopN: ORDER BY [org.apache.spark.sql.connector.expressions.Extract3b95fce9 ASC NULLS FIRST] LIMIT 1, ReadSchema: struct<NAME:string> " did not contain "PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], PushedTopN: ORDER BY [EXTRACT(DAY_OF_YEAR FROM DATE1) ASC NULLS FIRST] LIMIT 1," ``` ### Why are the changes needed? Fix an implement bug. The reason of the bug is the Extract the function does not implement the toString method when pushing down the JDBC data source. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test case. Closes apache#37469 from chenzhx/spark-master. Authored-by: chenzhx <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * code update Signed-off-by: huaxingao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: huaxingao <[email protected]> Co-authored-by: Jiaan Geng <[email protected]> Co-authored-by: chenliang.lu <[email protected]> Co-authored-by: biaobiao.sun <[email protected]>

…aging (Sort with expressions) (Kyligence#525) * [SPARK-39784][SQL] Put Literal values on the right side of the data source filter after translating Catalyst Expression to data source filter ### What changes were proposed in this pull request? Even though the literal value could be on both sides of the filter, e.g. both `a > 1` and `1 < a` are valid, after translating Catalyst Expression to data source filter, we want the literal value on the right side so it's easier for the data source to handle these filters. We do this kind of normalization for V1 Filter. We should have the same behavior for V2 Filter. Before this PR, for the filters that have literal values on the right side, e.g. `1 > a`, we keep it as is. After this PR, we will normalize it to `a < 1` so the data source doesn't need to check each of the filters (and do the flip). ### Why are the changes needed? I think we should follow V1 Filter's behavior, normalize the filters during catalyst Expression to DS Filter translation time to make the literal values on the right side, so later on, data source doesn't need to check every single filter to figure out if it needs to flip the sides. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes apache#37197 from huaxingao/flip. Authored-by: huaxingao <[email protected]> Signed-off-by: huaxingao <[email protected]> * [SPARK-39836][SQL] Simplify V2ExpressionBuilder by extract common method ### What changes were proposed in this pull request? Currently, `V2ExpressionBuilder` have a lot of similar code, we can extract them as one common method. We can simplify the implement with the common method. ### Why are the changes needed? Simplify `V2ExpressionBuilder` by extract common method. ### Does this PR introduce _any_ user-facing change? 'No'. Just update inner implementation. ### How was this patch tested? N/A Closes apache#37249 from beliefer/SPARK-39836. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39858][SQL] Remove unnecessary `AliasHelper` or `PredicateHelper` for some rules ### What changes were proposed in this pull request? When I using `AliasHelper`, I find that some rules inherit it instead of using it. This PR removes unnecessary `AliasHelper` or `PredicateHelper` in the following cases: - The rule inherit `AliasHelper` instead of using it. In this case, we can remove `AliasHelper` directly. - The rule inherit `PredicateHelper` instead of using it. In this case, we can remove `PredicateHelper` directly. - The rule inherit `AliasHelper` and `PredicateHelper`. In fact, `PredicateHelper` already extends `AliasHelper`. In this case, we can remove `AliasHelper`. - The rule inherit `OperationHelper` and `PredicateHelper`. In fact, `OperationHelper` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. - The rule inherit `PlanTest` and `PredicateHelper`. In fact, `PlanTest` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. - The rule inherit `QueryTest` and `PredicateHelper`. In fact, `QueryTest` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. ### Why are the changes needed? Remove unnecessary `AliasHelper` or `PredicateHelper` for some rules ### Does this PR introduce _any_ user-facing change? 'No'. Just improve the inner implementation. ### How was this patch tested? N/A Closes apache#37272 from beliefer/SPARK-39858. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39784][SQL][FOLLOW-UP] Use BinaryComparison instead of Predicate (if) for type check ### What changes were proposed in this pull request? follow up this [comment](apache#37197 (comment)) ### Why are the changes needed? code simplification ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test Closes apache#37278 from huaxingao/followup. Authored-by: huaxingao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-39909] Organize the check of push down information for JDBCV2Suite ### What changes were proposed in this pull request? This PR changes the check method from `check(one_large_string)` to `check(small_string1, small_string2, ...)` ### Why are the changes needed? It can help us check the results individually and make the code more clearer. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#37342 from yabola/fix. Authored-by: chenliang.lu <[email protected]> Signed-off-by: huaxingao <[email protected]> * [SPARK-39961][SQL] DS V2 push-down translate Cast if the cast is safe ### What changes were proposed in this pull request? Currently, DS V2 push-down translate `Cast` only if the ansi mode is true. In fact, if the cast is safe(e.g. cast number to string, cast int to long), we can translate it too. This PR will call `Cast.canUpCast` so as we can translate `Cast` to V2 `Cast` safely. Note: The rule `SimplifyCasts` optimize some safe cast, e.g. cast int to long, so we may not see the `Cast`. ### Why are the changes needed? Add the range for DS V2 push down `Cast`. ### Does this PR introduce _any_ user-facing change? 'Yes'. `Cast` could be pushed down to data source in more cases. ### How was this patch tested? Test cases updated. Closes apache#37388 from beliefer/SPARK-39961. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-38901][SQL] DS V2 supports push down misc functions ### What changes were proposed in this pull request? Currently, Spark have some misc functions. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L688 These functions show below: `AES_ENCRYPT,` `AES_DECRYPT`, `SHA1`, `SHA2`, `MD5`, `CRC32` Function|PostgreSQL|ClickHouse|H2|MySQL|Oracle|Redshift|Snowflake|DB2|Vertica|Exasol|SqlServer|Yellowbrick|Mariadb|Singlestore| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- `AesEncrypt`|Yes|Yes|Yes|Yes|Yes|NO|Yes|Yes|NO|NO|NO|Yes|Yes|Yes| `AesDecrypt`|Yes|Yes|Yes|Yes|Yes|NO|Yes|Yes|NO|NO|NO|Yes|Yes|Yes| `Sha1`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Sha2`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Md5`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Crc32`|No|Yes|No|Yes|NO|Yes|NO|Yes|NO|NO|NO|NO|NO|Yes| DS V2 should supports push down these misc functions. ### Why are the changes needed? DS V2 supports push down misc functions. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes apache#37169 from chenzhx/misc. Authored-by: chenzhx <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39964][SQL] DS V2 pushdown should unify the translate path ### What changes were proposed in this pull request? Currently, DS V2 pushdown have two translate path `DataSourceStrategy.translateAggregate` used to translate aggregate functions and `V2ExpressionBuilder` used to translate other functions and expressions, we can unify them. After this PR, the translate have only one code path, developers will easy to coding and reading. ### Why are the changes needed? Unify the translate path for DS V2 pushdown. ### Does this PR introduce _any_ user-facing change? 'No'. Just update the inner implementation. ### How was this patch tested? N/A Closes apache#37391 from beliefer/SPARK-39964. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with expressions) ### What changes were proposed in this pull request? Currently, DS V2 aggregate push-down cannot work with DS V2 Top N push-down (`ORDER BY col LIMIT m`) or DS V2 Paging push-down (`ORDER BY col LIMIT m OFFSET n`). If we can push down aggregate with Top N or Paging, it will be better performance. This PR only let aggregate pushed down with ORDER BY expressions which must be GROUP BY expressions. The idea of this PR are: 1. When we give an expectation outputs of `ScanBuilderHolder`, holding the map from expectation outputs to origin expressions (contains origin columns). 2. When we try to push down Top N or Paging, we need restore the origin expressions for `SortOrder`. ### Why are the changes needed? Let DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions), then users can get the better performance. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes apache#37320 from beliefer/SPARK-39819_new. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39929][SQL] DS V2 supports push down string functions(non ANSI) **What changes were proposed in this pull request?** support more commonly used string functions BIT_LENGTH CHAR_LENGTH CONCAT The mainstream databases support these functions show below. Function | PostgreSQL | ClickHouse | H2 | MySQL | Oracle | Redshift | Presto | Teradata | Snowflake | DB2 | Vertica | Exasol | SqlServer | Yellowbrick | Impala | Mariadb | Druid | Pig | SQLite | Influxdata | Singlestore | ElasticSearch -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- BIT_LENGTH | Yes | Yes | Yes | Yes | Yes | no | no | no | no | Yes | Yes | Yes | no | Yes | no | Yes | no | no | no | no | no | Yes CHAR_LENGTH | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | no | Yes | Yes | Yes | Yes CONCAT | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | no | no | no | Yes | Yes **Why are the changes needed?** DS V2 supports push down string functions **Does this PR introduce any user-facing change?** 'No'. New feature. How was this patch tested? New tests. Closes apache#37427 from zheniantoushipashi/SPARK-39929. Authored-by: biaobiao.sun <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-38899][SQL][FOLLOWUP] Fix bug extract datetime in DS V2 pushdown ### What changes were proposed in this pull request? [SPARK-38899](apache#36663) supports extract function in JDBC data source. But the implement is incorrect. This PR just add a test case and it will be failed ! The test case show below. ``` test("scan with filter push-down with date time functions") { val df9 = sql("SELECT name FROM h2.test.datetime WHERE " + "dayofyear(date1) > 100 order by dayofyear(date1) limit 1") checkFiltersRemoved(df9) val expectedPlanFragment9 = "PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], " + "PushedTopN: ORDER BY [EXTRACT(DAY_OF_YEAR FROM DATE1) ASC NULLS FIRST] LIMIT 1," checkPushedInfo(df9, expectedPlanFragment9) checkAnswer(df9, Seq(Row("alex"))) } ``` The test case output failure show below. ``` "== Parsed Logical Plan == 'GlobalLimit 1 +- 'LocalLimit 1 +- 'Sort ['dayofyear('date1) ASC NULLS FIRST], true +- 'Project ['name] +- 'Filter ('dayofyear('date1) > 100) +- 'UnresolvedRelation [h2, test, datetime], [], false == Analyzed Logical Plan == name: string GlobalLimit 1 +- LocalLimit 1 +- Project [name#x] +- Sort [dayofyear(date1#x) ASC NULLS FIRST], true +- Project [name#x, date1#x] +- Filter (dayofyear(date1#x) > 100) +- SubqueryAlias h2.test.datetime +- RelationV2[NAME#x, DATE1#x, TIME1#x] h2.test.datetime test.datetime == Optimized Logical Plan == Project [name#x] +- RelationV2[NAME#x] test.datetime == Physical Plan == *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$145f6181a [NAME#x] PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], PushedTopN: ORDER BY [org.apache.spark.sql.connector.expressions.Extract3b95fce9 ASC NULLS FIRST] LIMIT 1, ReadSchema: struct<NAME:string> " did not contain "PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], PushedTopN: ORDER BY [EXTRACT(DAY_OF_YEAR FROM DATE1) ASC NULLS FIRST] LIMIT 1," ``` ### Why are the changes needed? Fix an implement bug. The reason of the bug is the Extract the function does not implement the toString method when pushing down the JDBC data source. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test case. Closes apache#37469 from chenzhx/spark-master. Authored-by: chenzhx <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * code update Signed-off-by: huaxingao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: huaxingao <[email protected]> Co-authored-by: Jiaan Geng <[email protected]> Co-authored-by: chenliang.lu <[email protected]> Co-authored-by: biaobiao.sun <[email protected]>

…aging (Sort with expressions) (#525) * [SPARK-39784][SQL] Put Literal values on the right side of the data source filter after translating Catalyst Expression to data source filter ### What changes were proposed in this pull request? Even though the literal value could be on both sides of the filter, e.g. both `a > 1` and `1 < a` are valid, after translating Catalyst Expression to data source filter, we want the literal value on the right side so it's easier for the data source to handle these filters. We do this kind of normalization for V1 Filter. We should have the same behavior for V2 Filter. Before this PR, for the filters that have literal values on the right side, e.g. `1 > a`, we keep it as is. After this PR, we will normalize it to `a < 1` so the data source doesn't need to check each of the filters (and do the flip). ### Why are the changes needed? I think we should follow V1 Filter's behavior, normalize the filters during catalyst Expression to DS Filter translation time to make the literal values on the right side, so later on, data source doesn't need to check every single filter to figure out if it needs to flip the sides. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes apache#37197 from huaxingao/flip. Authored-by: huaxingao <[email protected]> Signed-off-by: huaxingao <[email protected]> * [SPARK-39836][SQL] Simplify V2ExpressionBuilder by extract common method ### What changes were proposed in this pull request? Currently, `V2ExpressionBuilder` have a lot of similar code, we can extract them as one common method. We can simplify the implement with the common method. ### Why are the changes needed? Simplify `V2ExpressionBuilder` by extract common method. ### Does this PR introduce _any_ user-facing change? 'No'. Just update inner implementation. ### How was this patch tested? N/A Closes apache#37249 from beliefer/SPARK-39836. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39858][SQL] Remove unnecessary `AliasHelper` or `PredicateHelper` for some rules ### What changes were proposed in this pull request? When I using `AliasHelper`, I find that some rules inherit it instead of using it. This PR removes unnecessary `AliasHelper` or `PredicateHelper` in the following cases: - The rule inherit `AliasHelper` instead of using it. In this case, we can remove `AliasHelper` directly. - The rule inherit `PredicateHelper` instead of using it. In this case, we can remove `PredicateHelper` directly. - The rule inherit `AliasHelper` and `PredicateHelper`. In fact, `PredicateHelper` already extends `AliasHelper`. In this case, we can remove `AliasHelper`. - The rule inherit `OperationHelper` and `PredicateHelper`. In fact, `OperationHelper` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. - The rule inherit `PlanTest` and `PredicateHelper`. In fact, `PlanTest` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. - The rule inherit `QueryTest` and `PredicateHelper`. In fact, `QueryTest` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. ### Why are the changes needed? Remove unnecessary `AliasHelper` or `PredicateHelper` for some rules ### Does this PR introduce _any_ user-facing change? 'No'. Just improve the inner implementation. ### How was this patch tested? N/A Closes apache#37272 from beliefer/SPARK-39858. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39784][SQL][FOLLOW-UP] Use BinaryComparison instead of Predicate (if) for type check ### What changes were proposed in this pull request? follow up this [comment](apache#37197 (comment)) ### Why are the changes needed? code simplification ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test Closes apache#37278 from huaxingao/followup. Authored-by: huaxingao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-39909] Organize the check of push down information for JDBCV2Suite ### What changes were proposed in this pull request? This PR changes the check method from `check(one_large_string)` to `check(small_string1, small_string2, ...)` ### Why are the changes needed? It can help us check the results individually and make the code more clearer. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#37342 from yabola/fix. Authored-by: chenliang.lu <[email protected]> Signed-off-by: huaxingao <[email protected]> * [SPARK-39961][SQL] DS V2 push-down translate Cast if the cast is safe ### What changes were proposed in this pull request? Currently, DS V2 push-down translate `Cast` only if the ansi mode is true. In fact, if the cast is safe(e.g. cast number to string, cast int to long), we can translate it too. This PR will call `Cast.canUpCast` so as we can translate `Cast` to V2 `Cast` safely. Note: The rule `SimplifyCasts` optimize some safe cast, e.g. cast int to long, so we may not see the `Cast`. ### Why are the changes needed? Add the range for DS V2 push down `Cast`. ### Does this PR introduce _any_ user-facing change? 'Yes'. `Cast` could be pushed down to data source in more cases. ### How was this patch tested? Test cases updated. Closes apache#37388 from beliefer/SPARK-39961. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-38901][SQL] DS V2 supports push down misc functions ### What changes were proposed in this pull request? Currently, Spark have some misc functions. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L688 These functions show below: `AES_ENCRYPT,` `AES_DECRYPT`, `SHA1`, `SHA2`, `MD5`, `CRC32` Function|PostgreSQL|ClickHouse|H2|MySQL|Oracle|Redshift|Snowflake|DB2|Vertica|Exasol|SqlServer|Yellowbrick|Mariadb|Singlestore| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- `AesEncrypt`|Yes|Yes|Yes|Yes|Yes|NO|Yes|Yes|NO|NO|NO|Yes|Yes|Yes| `AesDecrypt`|Yes|Yes|Yes|Yes|Yes|NO|Yes|Yes|NO|NO|NO|Yes|Yes|Yes| `Sha1`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Sha2`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Md5`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Crc32`|No|Yes|No|Yes|NO|Yes|NO|Yes|NO|NO|NO|NO|NO|Yes| DS V2 should supports push down these misc functions. ### Why are the changes needed? DS V2 supports push down misc functions. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes apache#37169 from chenzhx/misc. Authored-by: chenzhx <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39964][SQL] DS V2 pushdown should unify the translate path ### What changes were proposed in this pull request? Currently, DS V2 pushdown have two translate path `DataSourceStrategy.translateAggregate` used to translate aggregate functions and `V2ExpressionBuilder` used to translate other functions and expressions, we can unify them. After this PR, the translate have only one code path, developers will easy to coding and reading. ### Why are the changes needed? Unify the translate path for DS V2 pushdown. ### Does this PR introduce _any_ user-facing change? 'No'. Just update the inner implementation. ### How was this patch tested? N/A Closes apache#37391 from beliefer/SPARK-39964. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with expressions) ### What changes were proposed in this pull request? Currently, DS V2 aggregate push-down cannot work with DS V2 Top N push-down (`ORDER BY col LIMIT m`) or DS V2 Paging push-down (`ORDER BY col LIMIT m OFFSET n`). If we can push down aggregate with Top N or Paging, it will be better performance. This PR only let aggregate pushed down with ORDER BY expressions which must be GROUP BY expressions. The idea of this PR are: 1. When we give an expectation outputs of `ScanBuilderHolder`, holding the map from expectation outputs to origin expressions (contains origin columns). 2. When we try to push down Top N or Paging, we need restore the origin expressions for `SortOrder`. ### Why are the changes needed? Let DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions), then users can get the better performance. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test cases. Closes apache#37320 from beliefer/SPARK-39819_new. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39929][SQL] DS V2 supports push down string functions(non ANSI) **What changes were proposed in this pull request?** support more commonly used string functions BIT_LENGTH CHAR_LENGTH CONCAT The mainstream databases support these functions show below. Function | PostgreSQL | ClickHouse | H2 | MySQL | Oracle | Redshift | Presto | Teradata | Snowflake | DB2 | Vertica | Exasol | SqlServer | Yellowbrick | Impala | Mariadb | Druid | Pig | SQLite | Influxdata | Singlestore | ElasticSearch -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- BIT_LENGTH | Yes | Yes | Yes | Yes | Yes | no | no | no | no | Yes | Yes | Yes | no | Yes | no | Yes | no | no | no | no | no | Yes CHAR_LENGTH | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | no | Yes | Yes | Yes | Yes CONCAT | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | no | no | no | Yes | Yes **Why are the changes needed?** DS V2 supports push down string functions **Does this PR introduce any user-facing change?** 'No'. New feature. How was this patch tested? New tests. Closes apache#37427 from zheniantoushipashi/SPARK-39929. Authored-by: biaobiao.sun <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-38899][SQL][FOLLOWUP] Fix bug extract datetime in DS V2 pushdown ### What changes were proposed in this pull request? [SPARK-38899](apache#36663) supports extract function in JDBC data source. But the implement is incorrect. This PR just add a test case and it will be failed ! The test case show below. ``` test("scan with filter push-down with date time functions") { val df9 = sql("SELECT name FROM h2.test.datetime WHERE " + "dayofyear(date1) > 100 order by dayofyear(date1) limit 1") checkFiltersRemoved(df9) val expectedPlanFragment9 = "PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], " + "PushedTopN: ORDER BY [EXTRACT(DAY_OF_YEAR FROM DATE1) ASC NULLS FIRST] LIMIT 1," checkPushedInfo(df9, expectedPlanFragment9) checkAnswer(df9, Seq(Row("alex"))) } ``` The test case output failure show below. ``` "== Parsed Logical Plan == 'GlobalLimit 1 +- 'LocalLimit 1 +- 'Sort ['dayofyear('date1) ASC NULLS FIRST], true +- 'Project ['name] +- 'Filter ('dayofyear('date1) > 100) +- 'UnresolvedRelation [h2, test, datetime], [], false == Analyzed Logical Plan == name: string GlobalLimit 1 +- LocalLimit 1 +- Project [name#x] +- Sort [dayofyear(date1#x) ASC NULLS FIRST], true +- Project [name#x, date1#x] +- Filter (dayofyear(date1#x) > 100) +- SubqueryAlias h2.test.datetime +- RelationV2[NAME#x, DATE1#x, TIME1#x] h2.test.datetime test.datetime == Optimized Logical Plan == Project [name#x] +- RelationV2[NAME#x] test.datetime == Physical Plan == *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$145f6181a [NAME#x] PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], PushedTopN: ORDER BY [org.apache.spark.sql.connector.expressions.Extract3b95fce9 ASC NULLS FIRST] LIMIT 1, ReadSchema: struct<NAME:string> " did not contain "PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], PushedTopN: ORDER BY [EXTRACT(DAY_OF_YEAR FROM DATE1) ASC NULLS FIRST] LIMIT 1," ``` ### Why are the changes needed? Fix an implement bug. The reason of the bug is the Extract the function does not implement the toString method when pushing down the JDBC data source. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test case. Closes apache#37469 from chenzhx/spark-master. Authored-by: chenzhx <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * code update Signed-off-by: huaxingao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: huaxingao <[email protected]> Co-authored-by: Jiaan Geng <[email protected]> Co-authored-by: chenliang.lu <[email protected]> Co-authored-by: biaobiao.sun <[email protected]>

…aging (Sort with expressions) (#525) * [SPARK-39784][SQL] Put Literal values on the right side of the data source filter after translating Catalyst Expression to data source filter Even though the literal value could be on both sides of the filter, e.g. both `a > 1` and `1 < a` are valid, after translating Catalyst Expression to data source filter, we want the literal value on the right side so it's easier for the data source to handle these filters. We do this kind of normalization for V1 Filter. We should have the same behavior for V2 Filter. Before this PR, for the filters that have literal values on the right side, e.g. `1 > a`, we keep it as is. After this PR, we will normalize it to `a < 1` so the data source doesn't need to check each of the filters (and do the flip). I think we should follow V1 Filter's behavior, normalize the filters during catalyst Expression to DS Filter translation time to make the literal values on the right side, so later on, data source doesn't need to check every single filter to figure out if it needs to flip the sides. no new test Closes apache#37197 from huaxingao/flip. Authored-by: huaxingao <[email protected]> Signed-off-by: huaxingao <[email protected]> * [SPARK-39836][SQL] Simplify V2ExpressionBuilder by extract common method Currently, `V2ExpressionBuilder` have a lot of similar code, we can extract them as one common method. We can simplify the implement with the common method. Simplify `V2ExpressionBuilder` by extract common method. 'No'. Just update inner implementation. N/A Closes apache#37249 from beliefer/SPARK-39836. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39858][SQL] Remove unnecessary `AliasHelper` or `PredicateHelper` for some rules When I using `AliasHelper`, I find that some rules inherit it instead of using it. This PR removes unnecessary `AliasHelper` or `PredicateHelper` in the following cases: - The rule inherit `AliasHelper` instead of using it. In this case, we can remove `AliasHelper` directly. - The rule inherit `PredicateHelper` instead of using it. In this case, we can remove `PredicateHelper` directly. - The rule inherit `AliasHelper` and `PredicateHelper`. In fact, `PredicateHelper` already extends `AliasHelper`. In this case, we can remove `AliasHelper`. - The rule inherit `OperationHelper` and `PredicateHelper`. In fact, `OperationHelper` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. - The rule inherit `PlanTest` and `PredicateHelper`. In fact, `PlanTest` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. - The rule inherit `QueryTest` and `PredicateHelper`. In fact, `QueryTest` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. Remove unnecessary `AliasHelper` or `PredicateHelper` for some rules 'No'. Just improve the inner implementation. N/A Closes apache#37272 from beliefer/SPARK-39858. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39784][SQL][FOLLOW-UP] Use BinaryComparison instead of Predicate (if) for type check follow up this [comment](apache#37197 (comment)) code simplification No Existing test Closes apache#37278 from huaxingao/followup. Authored-by: huaxingao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-39909] Organize the check of push down information for JDBCV2Suite This PR changes the check method from `check(one_large_string)` to `check(small_string1, small_string2, ...)` It can help us check the results individually and make the code more clearer. no existing tests Closes apache#37342 from yabola/fix. Authored-by: chenliang.lu <[email protected]> Signed-off-by: huaxingao <[email protected]> * [SPARK-39961][SQL] DS V2 push-down translate Cast if the cast is safe Currently, DS V2 push-down translate `Cast` only if the ansi mode is true. In fact, if the cast is safe(e.g. cast number to string, cast int to long), we can translate it too. This PR will call `Cast.canUpCast` so as we can translate `Cast` to V2 `Cast` safely. Note: The rule `SimplifyCasts` optimize some safe cast, e.g. cast int to long, so we may not see the `Cast`. Add the range for DS V2 push down `Cast`. 'Yes'. `Cast` could be pushed down to data source in more cases. Test cases updated. Closes apache#37388 from beliefer/SPARK-39961. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-38901][SQL] DS V2 supports push down misc functions Currently, Spark have some misc functions. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L688 These functions show below: `AES_ENCRYPT,` `AES_DECRYPT`, `SHA1`, `SHA2`, `MD5`, `CRC32` Function|PostgreSQL|ClickHouse|H2|MySQL|Oracle|Redshift|Snowflake|DB2|Vertica|Exasol|SqlServer|Yellowbrick|Mariadb|Singlestore| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- `AesEncrypt`|Yes|Yes|Yes|Yes|Yes|NO|Yes|Yes|NO|NO|NO|Yes|Yes|Yes| `AesDecrypt`|Yes|Yes|Yes|Yes|Yes|NO|Yes|Yes|NO|NO|NO|Yes|Yes|Yes| `Sha1`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Sha2`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Md5`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Crc32`|No|Yes|No|Yes|NO|Yes|NO|Yes|NO|NO|NO|NO|NO|Yes| DS V2 should supports push down these misc functions. DS V2 supports push down misc functions. 'No'. New feature. New tests. Closes apache#37169 from chenzhx/misc. Authored-by: chenzhx <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39964][SQL] DS V2 pushdown should unify the translate path Currently, DS V2 pushdown have two translate path `DataSourceStrategy.translateAggregate` used to translate aggregate functions and `V2ExpressionBuilder` used to translate other functions and expressions, we can unify them. After this PR, the translate have only one code path, developers will easy to coding and reading. Unify the translate path for DS V2 pushdown. 'No'. Just update the inner implementation. N/A Closes apache#37391 from beliefer/SPARK-39964. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with expressions) Currently, DS V2 aggregate push-down cannot work with DS V2 Top N push-down (`ORDER BY col LIMIT m`) or DS V2 Paging push-down (`ORDER BY col LIMIT m OFFSET n`). If we can push down aggregate with Top N or Paging, it will be better performance. This PR only let aggregate pushed down with ORDER BY expressions which must be GROUP BY expressions. The idea of this PR are: 1. When we give an expectation outputs of `ScanBuilderHolder`, holding the map from expectation outputs to origin expressions (contains origin columns). 2. When we try to push down Top N or Paging, we need restore the origin expressions for `SortOrder`. Let DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions), then users can get the better performance. 'No'. New feature. New test cases. Closes apache#37320 from beliefer/SPARK-39819_new. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39929][SQL] DS V2 supports push down string functions(non ANSI) **What changes were proposed in this pull request?** support more commonly used string functions BIT_LENGTH CHAR_LENGTH CONCAT The mainstream databases support these functions show below. Function | PostgreSQL | ClickHouse | H2 | MySQL | Oracle | Redshift | Presto | Teradata | Snowflake | DB2 | Vertica | Exasol | SqlServer | Yellowbrick | Impala | Mariadb | Druid | Pig | SQLite | Influxdata | Singlestore | ElasticSearch -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- BIT_LENGTH | Yes | Yes | Yes | Yes | Yes | no | no | no | no | Yes | Yes | Yes | no | Yes | no | Yes | no | no | no | no | no | Yes CHAR_LENGTH | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | no | Yes | Yes | Yes | Yes CONCAT | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | no | no | no | Yes | Yes **Why are the changes needed?** DS V2 supports push down string functions **Does this PR introduce any user-facing change?** 'No'. New feature. How was this patch tested? New tests. Closes apache#37427 from zheniantoushipashi/SPARK-39929. Authored-by: biaobiao.sun <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-38899][SQL][FOLLOWUP] Fix bug extract datetime in DS V2 pushdown [SPARK-38899](apache#36663) supports extract function in JDBC data source. But the implement is incorrect. This PR just add a test case and it will be failed ! The test case show below. ``` test("scan with filter push-down with date time functions") { val df9 = sql("SELECT name FROM h2.test.datetime WHERE " + "dayofyear(date1) > 100 order by dayofyear(date1) limit 1") checkFiltersRemoved(df9) val expectedPlanFragment9 = "PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], " + "PushedTopN: ORDER BY [EXTRACT(DAY_OF_YEAR FROM DATE1) ASC NULLS FIRST] LIMIT 1," checkPushedInfo(df9, expectedPlanFragment9) checkAnswer(df9, Seq(Row("alex"))) } ``` The test case output failure show below. ``` "== Parsed Logical Plan == 'GlobalLimit 1 +- 'LocalLimit 1 +- 'Sort ['dayofyear('date1) ASC NULLS FIRST], true +- 'Project ['name] +- 'Filter ('dayofyear('date1) > 100) +- 'UnresolvedRelation [h2, test, datetime], [], false == Analyzed Logical Plan == name: string GlobalLimit 1 +- LocalLimit 1 +- Project [name#x] +- Sort [dayofyear(date1#x) ASC NULLS FIRST], true +- Project [name#x, date1#x] +- Filter (dayofyear(date1#x) > 100) +- SubqueryAlias h2.test.datetime +- RelationV2[NAME#x, DATE1#x, TIME1#x] h2.test.datetime test.datetime == Optimized Logical Plan == Project [name#x] +- RelationV2[NAME#x] test.datetime == Physical Plan == *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$145f6181a [NAME#x] PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], PushedTopN: ORDER BY [org.apache.spark.sql.connector.expressions.Extract3b95fce9 ASC NULLS FIRST] LIMIT 1, ReadSchema: struct<NAME:string> " did not contain "PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], PushedTopN: ORDER BY [EXTRACT(DAY_OF_YEAR FROM DATE1) ASC NULLS FIRST] LIMIT 1," ``` Fix an implement bug. The reason of the bug is the Extract the function does not implement the toString method when pushing down the JDBC data source. 'No'. New feature. New test case. Closes apache#37469 from chenzhx/spark-master. Authored-by: chenzhx <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * code update Signed-off-by: huaxingao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: huaxingao <[email protected]> Co-authored-by: Jiaan Geng <[email protected]> Co-authored-by: chenliang.lu <[email protected]> Co-authored-by: biaobiao.sun <[email protected]>

…aging (Sort with expressions) (Kyligence#525) * [SPARK-39784][SQL] Put Literal values on the right side of the data source filter after translating Catalyst Expression to data source filter Even though the literal value could be on both sides of the filter, e.g. both `a > 1` and `1 < a` are valid, after translating Catalyst Expression to data source filter, we want the literal value on the right side so it's easier for the data source to handle these filters. We do this kind of normalization for V1 Filter. We should have the same behavior for V2 Filter. Before this PR, for the filters that have literal values on the right side, e.g. `1 > a`, we keep it as is. After this PR, we will normalize it to `a < 1` so the data source doesn't need to check each of the filters (and do the flip). I think we should follow V1 Filter's behavior, normalize the filters during catalyst Expression to DS Filter translation time to make the literal values on the right side, so later on, data source doesn't need to check every single filter to figure out if it needs to flip the sides. no new test Closes apache#37197 from huaxingao/flip. Authored-by: huaxingao <[email protected]> Signed-off-by: huaxingao <[email protected]> * [SPARK-39836][SQL] Simplify V2ExpressionBuilder by extract common method Currently, `V2ExpressionBuilder` have a lot of similar code, we can extract them as one common method. We can simplify the implement with the common method. Simplify `V2ExpressionBuilder` by extract common method. 'No'. Just update inner implementation. N/A Closes apache#37249 from beliefer/SPARK-39836. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39858][SQL] Remove unnecessary `AliasHelper` or `PredicateHelper` for some rules When I using `AliasHelper`, I find that some rules inherit it instead of using it. This PR removes unnecessary `AliasHelper` or `PredicateHelper` in the following cases: - The rule inherit `AliasHelper` instead of using it. In this case, we can remove `AliasHelper` directly. - The rule inherit `PredicateHelper` instead of using it. In this case, we can remove `PredicateHelper` directly. - The rule inherit `AliasHelper` and `PredicateHelper`. In fact, `PredicateHelper` already extends `AliasHelper`. In this case, we can remove `AliasHelper`. - The rule inherit `OperationHelper` and `PredicateHelper`. In fact, `OperationHelper` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. - The rule inherit `PlanTest` and `PredicateHelper`. In fact, `PlanTest` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. - The rule inherit `QueryTest` and `PredicateHelper`. In fact, `QueryTest` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. Remove unnecessary `AliasHelper` or `PredicateHelper` for some rules 'No'. Just improve the inner implementation. N/A Closes apache#37272 from beliefer/SPARK-39858. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39784][SQL][FOLLOW-UP] Use BinaryComparison instead of Predicate (if) for type check follow up this [comment](apache#37197 (comment)) code simplification No Existing test Closes apache#37278 from huaxingao/followup. Authored-by: huaxingao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-39909] Organize the check of push down information for JDBCV2Suite This PR changes the check method from `check(one_large_string)` to `check(small_string1, small_string2, ...)` It can help us check the results individually and make the code more clearer. no existing tests Closes apache#37342 from yabola/fix. Authored-by: chenliang.lu <[email protected]> Signed-off-by: huaxingao <[email protected]> * [SPARK-39961][SQL] DS V2 push-down translate Cast if the cast is safe Currently, DS V2 push-down translate `Cast` only if the ansi mode is true. In fact, if the cast is safe(e.g. cast number to string, cast int to long), we can translate it too. This PR will call `Cast.canUpCast` so as we can translate `Cast` to V2 `Cast` safely. Note: The rule `SimplifyCasts` optimize some safe cast, e.g. cast int to long, so we may not see the `Cast`. Add the range for DS V2 push down `Cast`. 'Yes'. `Cast` could be pushed down to data source in more cases. Test cases updated. Closes apache#37388 from beliefer/SPARK-39961. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-38901][SQL] DS V2 supports push down misc functions Currently, Spark have some misc functions. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L688 These functions show below: `AES_ENCRYPT,` `AES_DECRYPT`, `SHA1`, `SHA2`, `MD5`, `CRC32` Function|PostgreSQL|ClickHouse|H2|MySQL|Oracle|Redshift|Snowflake|DB2|Vertica|Exasol|SqlServer|Yellowbrick|Mariadb|Singlestore| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- `AesEncrypt`|Yes|Yes|Yes|Yes|Yes|NO|Yes|Yes|NO|NO|NO|Yes|Yes|Yes| `AesDecrypt`|Yes|Yes|Yes|Yes|Yes|NO|Yes|Yes|NO|NO|NO|Yes|Yes|Yes| `Sha1`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Sha2`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Md5`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Crc32`|No|Yes|No|Yes|NO|Yes|NO|Yes|NO|NO|NO|NO|NO|Yes| DS V2 should supports push down these misc functions. DS V2 supports push down misc functions. 'No'. New feature. New tests. Closes apache#37169 from chenzhx/misc. Authored-by: chenzhx <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39964][SQL] DS V2 pushdown should unify the translate path Currently, DS V2 pushdown have two translate path `DataSourceStrategy.translateAggregate` used to translate aggregate functions and `V2ExpressionBuilder` used to translate other functions and expressions, we can unify them. After this PR, the translate have only one code path, developers will easy to coding and reading. Unify the translate path for DS V2 pushdown. 'No'. Just update the inner implementation. N/A Closes apache#37391 from beliefer/SPARK-39964. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with expressions) Currently, DS V2 aggregate push-down cannot work with DS V2 Top N push-down (`ORDER BY col LIMIT m`) or DS V2 Paging push-down (`ORDER BY col LIMIT m OFFSET n`). If we can push down aggregate with Top N or Paging, it will be better performance. This PR only let aggregate pushed down with ORDER BY expressions which must be GROUP BY expressions. The idea of this PR are: 1. When we give an expectation outputs of `ScanBuilderHolder`, holding the map from expectation outputs to origin expressions (contains origin columns). 2. When we try to push down Top N or Paging, we need restore the origin expressions for `SortOrder`. Let DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions), then users can get the better performance. 'No'. New feature. New test cases. Closes apache#37320 from beliefer/SPARK-39819_new. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39929][SQL] DS V2 supports push down string functions(non ANSI) **What changes were proposed in this pull request?** support more commonly used string functions BIT_LENGTH CHAR_LENGTH CONCAT The mainstream databases support these functions show below. Function | PostgreSQL | ClickHouse | H2 | MySQL | Oracle | Redshift | Presto | Teradata | Snowflake | DB2 | Vertica | Exasol | SqlServer | Yellowbrick | Impala | Mariadb | Druid | Pig | SQLite | Influxdata | Singlestore | ElasticSearch -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- BIT_LENGTH | Yes | Yes | Yes | Yes | Yes | no | no | no | no | Yes | Yes | Yes | no | Yes | no | Yes | no | no | no | no | no | Yes CHAR_LENGTH | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | no | Yes | Yes | Yes | Yes CONCAT | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | no | no | no | Yes | Yes **Why are the changes needed?** DS V2 supports push down string functions **Does this PR introduce any user-facing change?** 'No'. New feature. How was this patch tested? New tests. Closes apache#37427 from zheniantoushipashi/SPARK-39929. Authored-by: biaobiao.sun <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-38899][SQL][FOLLOWUP] Fix bug extract datetime in DS V2 pushdown [SPARK-38899](apache#36663) supports extract function in JDBC data source. But the implement is incorrect. This PR just add a test case and it will be failed ! The test case show below. ``` test("scan with filter push-down with date time functions") { val df9 = sql("SELECT name FROM h2.test.datetime WHERE " + "dayofyear(date1) > 100 order by dayofyear(date1) limit 1") checkFiltersRemoved(df9) val expectedPlanFragment9 = "PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], " + "PushedTopN: ORDER BY [EXTRACT(DAY_OF_YEAR FROM DATE1) ASC NULLS FIRST] LIMIT 1," checkPushedInfo(df9, expectedPlanFragment9) checkAnswer(df9, Seq(Row("alex"))) } ``` The test case output failure show below. ``` "== Parsed Logical Plan == 'GlobalLimit 1 +- 'LocalLimit 1 +- 'Sort ['dayofyear('date1) ASC NULLS FIRST], true +- 'Project ['name] +- 'Filter ('dayofyear('date1) > 100) +- 'UnresolvedRelation [h2, test, datetime], [], false == Analyzed Logical Plan == name: string GlobalLimit 1 +- LocalLimit 1 +- Project [name#x] +- Sort [dayofyear(date1#x) ASC NULLS FIRST], true +- Project [name#x, date1#x] +- Filter (dayofyear(date1#x) > 100) +- SubqueryAlias h2.test.datetime +- RelationV2[NAME#x, DATE1#x, TIME1#x] h2.test.datetime test.datetime == Optimized Logical Plan == Project [name#x] +- RelationV2[NAME#x] test.datetime == Physical Plan == *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$145f6181a [NAME#x] PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], PushedTopN: ORDER BY [org.apache.spark.sql.connector.expressions.Extract3b95fce9 ASC NULLS FIRST] LIMIT 1, ReadSchema: struct<NAME:string> " did not contain "PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], PushedTopN: ORDER BY [EXTRACT(DAY_OF_YEAR FROM DATE1) ASC NULLS FIRST] LIMIT 1," ``` Fix an implement bug. The reason of the bug is the Extract the function does not implement the toString method when pushing down the JDBC data source. 'No'. New feature. New test case. Closes apache#37469 from chenzhx/spark-master. Authored-by: chenzhx <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * code update Signed-off-by: huaxingao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: huaxingao <[email protected]> Co-authored-by: Jiaan Geng <[email protected]> Co-authored-by: chenliang.lu <[email protected]> Co-authored-by: biaobiao.sun <[email protected]>

…aging (Sort with expressions) (#525) * [SPARK-39784][SQL] Put Literal values on the right side of the data source filter after translating Catalyst Expression to data source filter Even though the literal value could be on both sides of the filter, e.g. both `a > 1` and `1 < a` are valid, after translating Catalyst Expression to data source filter, we want the literal value on the right side so it's easier for the data source to handle these filters. We do this kind of normalization for V1 Filter. We should have the same behavior for V2 Filter. Before this PR, for the filters that have literal values on the right side, e.g. `1 > a`, we keep it as is. After this PR, we will normalize it to `a < 1` so the data source doesn't need to check each of the filters (and do the flip). I think we should follow V1 Filter's behavior, normalize the filters during catalyst Expression to DS Filter translation time to make the literal values on the right side, so later on, data source doesn't need to check every single filter to figure out if it needs to flip the sides. no new test Closes apache#37197 from huaxingao/flip. Authored-by: huaxingao <[email protected]> Signed-off-by: huaxingao <[email protected]> * [SPARK-39836][SQL] Simplify V2ExpressionBuilder by extract common method Currently, `V2ExpressionBuilder` have a lot of similar code, we can extract them as one common method. We can simplify the implement with the common method. Simplify `V2ExpressionBuilder` by extract common method. 'No'. Just update inner implementation. N/A Closes apache#37249 from beliefer/SPARK-39836. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39858][SQL] Remove unnecessary `AliasHelper` or `PredicateHelper` for some rules When I using `AliasHelper`, I find that some rules inherit it instead of using it. This PR removes unnecessary `AliasHelper` or `PredicateHelper` in the following cases: - The rule inherit `AliasHelper` instead of using it. In this case, we can remove `AliasHelper` directly. - The rule inherit `PredicateHelper` instead of using it. In this case, we can remove `PredicateHelper` directly. - The rule inherit `AliasHelper` and `PredicateHelper`. In fact, `PredicateHelper` already extends `AliasHelper`. In this case, we can remove `AliasHelper`. - The rule inherit `OperationHelper` and `PredicateHelper`. In fact, `OperationHelper` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. - The rule inherit `PlanTest` and `PredicateHelper`. In fact, `PlanTest` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. - The rule inherit `QueryTest` and `PredicateHelper`. In fact, `QueryTest` already extends `PredicateHelper`. In this case, we can remove `PredicateHelper`. Remove unnecessary `AliasHelper` or `PredicateHelper` for some rules 'No'. Just improve the inner implementation. N/A Closes apache#37272 from beliefer/SPARK-39858. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39784][SQL][FOLLOW-UP] Use BinaryComparison instead of Predicate (if) for type check follow up this [comment](apache#37197 (comment)) code simplification No Existing test Closes apache#37278 from huaxingao/followup. Authored-by: huaxingao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-39909] Organize the check of push down information for JDBCV2Suite This PR changes the check method from `check(one_large_string)` to `check(small_string1, small_string2, ...)` It can help us check the results individually and make the code more clearer. no existing tests Closes apache#37342 from yabola/fix. Authored-by: chenliang.lu <[email protected]> Signed-off-by: huaxingao <[email protected]> * [SPARK-39961][SQL] DS V2 push-down translate Cast if the cast is safe Currently, DS V2 push-down translate `Cast` only if the ansi mode is true. In fact, if the cast is safe(e.g. cast number to string, cast int to long), we can translate it too. This PR will call `Cast.canUpCast` so as we can translate `Cast` to V2 `Cast` safely. Note: The rule `SimplifyCasts` optimize some safe cast, e.g. cast int to long, so we may not see the `Cast`. Add the range for DS V2 push down `Cast`. 'Yes'. `Cast` could be pushed down to data source in more cases. Test cases updated. Closes apache#37388 from beliefer/SPARK-39961. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> * [SPARK-38901][SQL] DS V2 supports push down misc functions Currently, Spark have some misc functions. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L688 These functions show below: `AES_ENCRYPT,` `AES_DECRYPT`, `SHA1`, `SHA2`, `MD5`, `CRC32` Function|PostgreSQL|ClickHouse|H2|MySQL|Oracle|Redshift|Snowflake|DB2|Vertica|Exasol|SqlServer|Yellowbrick|Mariadb|Singlestore| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- `AesEncrypt`|Yes|Yes|Yes|Yes|Yes|NO|Yes|Yes|NO|NO|NO|Yes|Yes|Yes| `AesDecrypt`|Yes|Yes|Yes|Yes|Yes|NO|Yes|Yes|NO|NO|NO|Yes|Yes|Yes| `Sha1`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Sha2`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Md5`|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes|Yes| `Crc32`|No|Yes|No|Yes|NO|Yes|NO|Yes|NO|NO|NO|NO|NO|Yes| DS V2 should supports push down these misc functions. DS V2 supports push down misc functions. 'No'. New feature. New tests. Closes apache#37169 from chenzhx/misc. Authored-by: chenzhx <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39964][SQL] DS V2 pushdown should unify the translate path Currently, DS V2 pushdown have two translate path `DataSourceStrategy.translateAggregate` used to translate aggregate functions and `V2ExpressionBuilder` used to translate other functions and expressions, we can unify them. After this PR, the translate have only one code path, developers will easy to coding and reading. Unify the translate path for DS V2 pushdown. 'No'. Just update the inner implementation. N/A Closes apache#37391 from beliefer/SPARK-39964. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with expressions) Currently, DS V2 aggregate push-down cannot work with DS V2 Top N push-down (`ORDER BY col LIMIT m`) or DS V2 Paging push-down (`ORDER BY col LIMIT m OFFSET n`). If we can push down aggregate with Top N or Paging, it will be better performance. This PR only let aggregate pushed down with ORDER BY expressions which must be GROUP BY expressions. The idea of this PR are: 1. When we give an expectation outputs of `ScanBuilderHolder`, holding the map from expectation outputs to origin expressions (contains origin columns). 2. When we try to push down Top N or Paging, we need restore the origin expressions for `SortOrder`. Let DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions), then users can get the better performance. 'No'. New feature. New test cases. Closes apache#37320 from beliefer/SPARK-39819_new. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-39929][SQL] DS V2 supports push down string functions(non ANSI) **What changes were proposed in this pull request?** support more commonly used string functions BIT_LENGTH CHAR_LENGTH CONCAT The mainstream databases support these functions show below. Function | PostgreSQL | ClickHouse | H2 | MySQL | Oracle | Redshift | Presto | Teradata | Snowflake | DB2 | Vertica | Exasol | SqlServer | Yellowbrick | Impala | Mariadb | Druid | Pig | SQLite | Influxdata | Singlestore | ElasticSearch -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- BIT_LENGTH | Yes | Yes | Yes | Yes | Yes | no | no | no | no | Yes | Yes | Yes | no | Yes | no | Yes | no | no | no | no | no | Yes CHAR_LENGTH | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | no | Yes | Yes | Yes | Yes CONCAT | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | no | no | no | Yes | Yes **Why are the changes needed?** DS V2 supports push down string functions **Does this PR introduce any user-facing change?** 'No'. New feature. How was this patch tested? New tests. Closes apache#37427 from zheniantoushipashi/SPARK-39929. Authored-by: biaobiao.sun <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * [SPARK-38899][SQL][FOLLOWUP] Fix bug extract datetime in DS V2 pushdown [SPARK-38899](apache#36663) supports extract function in JDBC data source. But the implement is incorrect. This PR just add a test case and it will be failed ! The test case show below. ``` test("scan with filter push-down with date time functions") { val df9 = sql("SELECT name FROM h2.test.datetime WHERE " + "dayofyear(date1) > 100 order by dayofyear(date1) limit 1") checkFiltersRemoved(df9) val expectedPlanFragment9 = "PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], " + "PushedTopN: ORDER BY [EXTRACT(DAY_OF_YEAR FROM DATE1) ASC NULLS FIRST] LIMIT 1," checkPushedInfo(df9, expectedPlanFragment9) checkAnswer(df9, Seq(Row("alex"))) } ``` The test case output failure show below. ``` "== Parsed Logical Plan == 'GlobalLimit 1 +- 'LocalLimit 1 +- 'Sort ['dayofyear('date1) ASC NULLS FIRST], true +- 'Project ['name] +- 'Filter ('dayofyear('date1) > 100) +- 'UnresolvedRelation [h2, test, datetime], [], false == Analyzed Logical Plan == name: string GlobalLimit 1 +- LocalLimit 1 +- Project [name#x] +- Sort [dayofyear(date1#x) ASC NULLS FIRST], true +- Project [name#x, date1#x] +- Filter (dayofyear(date1#x) > 100) +- SubqueryAlias h2.test.datetime +- RelationV2[NAME#x, DATE1#x, TIME1#x] h2.test.datetime test.datetime == Optimized Logical Plan == Project [name#x] +- RelationV2[NAME#x] test.datetime == Physical Plan == *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$145f6181a [NAME#x] PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], PushedTopN: ORDER BY [org.apache.spark.sql.connector.expressions.Extract3b95fce9 ASC NULLS FIRST] LIMIT 1, ReadSchema: struct<NAME:string> " did not contain "PushedFilters: [DATE1 IS NOT NULL, EXTRACT(DAY_OF_YEAR FROM DATE1) > 100], PushedTopN: ORDER BY [EXTRACT(DAY_OF_YEAR FROM DATE1) ASC NULLS FIRST] LIMIT 1," ``` Fix an implement bug. The reason of the bug is the Extract the function does not implement the toString method when pushing down the JDBC data source. 'No'. New feature. New test case. Closes apache#37469 from chenzhx/spark-master. Authored-by: chenzhx <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> * code update Signed-off-by: huaxingao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: huaxingao <[email protected]> Co-authored-by: Jiaan Geng <[email protected]> Co-authored-by: chenliang.lu <[email protected]> Co-authored-by: biaobiao.sun <[email protected]>

github-actions bot added the SQL label Jul 27, 2022

beliefer changed the title ~~[WIP][SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)~~ [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions) Jul 28, 2022

cloud-fan reviewed Jul 28, 2022

View reviewed changes

cloud-fan reviewed Jul 29, 2022

View reviewed changes

beliefer force-pushed the SPARK-39819_new branch from f7c4ccf to da1b843 Compare August 2, 2022 02:31

cloud-fan reviewed Aug 2, 2022

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 2, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala Outdated

Copy link

Contributor

cloud-fan Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

cloud-fan reviewed Aug 2, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 2, 2022

View reviewed changes

beliefer force-pushed the SPARK-39819_new branch from f28cce3 to 49e6d16 Compare August 5, 2022 02:19

cloud-fan reviewed Aug 5, 2022

View reviewed changes

beliefer force-pushed the SPARK-39819_new branch from 49e6d16 to b0e8d84 Compare August 6, 2022 05:53

beliefer added 3 commits August 8, 2022 18:08

DS V2 aggregate push down can work with Top N or Paging (Sort with gr…

9351d1d

…oup expressions)

Update code

a920f45

Update code

b417cfc

beliefer added 7 commits August 8, 2022 18:10

Update code

3653fa1

Update code

c509e85

Update cod

e3cd5cb

Update code

e4d4eb5

Update code

515e95d

Update code

ecd7dcc

Update code

c6b20b0

beliefer force-pushed the SPARK-39819_new branch from b0e8d84 to c6b20b0 Compare August 8, 2022 10:20

beliefer changed the title ~~[SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group expressions)~~ [SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with expressions) Aug 8, 2022

Update code

293590b

cloud-fan reviewed Aug 9, 2022

View reviewed changes

cloud-fan approved these changes Aug 9, 2022

View reviewed changes

cloud-fan closed this in 07a5e5f Aug 9, 2022

beliefer mentioned this pull request Aug 10, 2022

[SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with group column) #37238

Closed

	var pushedAggregateExpectedOutputMap: Map[AttributeReference, Expression] =
	var pushedAggOutputMap: Map[AttributeReference, Expression] =

[SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with expressions) #37320

[SPARK-39819][SQL] DS V2 aggregate push down can work with Top N or Paging (Sort with expressions) #37320

Uh oh!

Conversation

beliefer commented Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

beliefer commented Jul 28, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer Aug 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer Aug 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

beliefer commented Jul 27, 2022 •

edited

Loading

beliefer Aug 2, 2022 •

edited

Loading

beliefer Aug 2, 2022 •

edited

Loading