[SPARK-31365][SQL] Enable nested predicate pushdown per data sources #28366

viirya · 2020-04-27T08:26:07Z

What changes were proposed in this pull request?

This patch proposes to replace NESTED_PREDICATE_PUSHDOWN_ENABLED with NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST which can configure which v1 data sources are enabled with nested predicate pushdown.

Why are the changes needed?

We added nested predicate pushdown feature that is configured by NESTED_PREDICATE_PUSHDOWN_ENABLED. However, this config is all or nothing config, and applies on all data sources.

In order to not introduce API breaking change after enabling nested predicate pushdown, we'd like to set nested predicate pushdown per data sources. Please also refer to the comments #27728 (comment).

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added/Modified unit tests.

SparkQA · 2020-04-27T13:46:31Z

Test build #121891 has finished for PR 28366 at commit 6feaaa4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AbortableRpcFuture[T: ClassTag](val future: Future[T], onAbort: Throwable => Unit)
class AvroDeserializer(rootAvroType: Schema, rootCatalystType: DataType, rebaseDateTime: Boolean)
class AvroSerializer(
class ExecutorResourceRequest(object):
class ExecutorResourceRequests(object):
class ResourceProfile(object):
class ResourceProfileBuilder(object):
class TaskResourceRequest(object):
class TaskResourceRequests(object):
abstract class CurrentTimestampLike() extends LeafExpression with CodegenFallback
case class CurrentTimestamp() extends CurrentTimestampLike
case class Now() extends CurrentTimestampLike
case class YearOfWeek(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
case class DateAddInterval(
class CacheManager extends Logging with AdaptiveSparkPlanHelper
case class AdaptiveExecutionContext(session: SparkSession, qe: QueryExecution)
class ParquetReadSupport(

SparkQA · 2020-04-28T14:14:21Z

Test build #121981 has finished for PR 28366 at commit e555a1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-29T13:31:17Z

Test build #122051 has finished for PR 28366 at commit 84bc8dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-04-29T17:07:40Z

cc @cloud-fan @dbtsai @dongjoon-hyun @HyukjinKwon @gatorsmile @maropu

maropu · 2020-04-29T23:52:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .version("3.0.0")
-      .booleanConf
-      .createWithDefault(true)
+      .stringConf


We need .transform(_.toUpperCase(Locale.ROOT))? Also, could we validate input by checkValues? btw, is this feature expected to cover custom data sources except for the prebuilt ones (parquet, orc, ...)?

We compare this list with toLowerCase when we need it. So seems to be fine to leave it here. Another similar example is spark.sql.sources.useV1SourceList. And as useV1SourceList too, seems checkValues is not needed.

Currently I think it is safer to assume custom data sources don't support this feature. I actually also think if custom data source wants to support it, it is better to adapt data source v2.

We don't have a common API for v1 data sources that tells if it supports nested predicate pushdown. If we really want to allow custom v1 data sources have that, we can consider adding one common v1 API for the purpose. But, again, seems to me that we will encourage adapting v2 instead adding new things to v1.

Currently I think it is safer to assume custom data sources don't support this feature.

Yea, +1 on your thought.

Currently I think it is safer to assume custom data sources don't support this feature.

Looks fine. @dbtsai are you good with it? Do you have use cases that need nested predicate pushdown for non-file-source?

In v1, we don't have any use-case for supporting it in custom data source. I'm good with it.

maropu · 2020-04-29T23:53:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

-        "while ORC only supports predicates for names containing `dots`. The other data sources" +
-        "don't support this feature yet.")
+  val NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST =
+    buildConf("spark.sql.optimizer.nestedPredicatePushdown.v1sourceList")


nit: How about v1sourceList -> supportedV1Sources?

maropu · 2020-04-29T23:55:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .internal()
+      .doc("A comma-separated list of data source short names or fully qualified data source " +
+        "implementation class names for which Spark tries to push down predicates for nested " +
+        "columns and or names containing `dots` to data sources. Currently, Parquet implements " +


nit: and or -> and?

Actually I coped the wordings. I think it means and/or. I will modify it.

maropu · 2020-04-29T23:57:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "implementation class names for which Spark tries to push down predicates for nested " +
+        "columns and or names containing `dots` to data sources. Currently, Parquet implements " +
+        "both optimizations while ORC only supports predicates for names containing `dots`. The " +
+        "other data sources don't support this feature yet.")


How about listing up a valid set of sources like The value can be 'parquet', 'orc', .... The default value is 'parquet,orc'.?

maropu · 2020-04-30T00:04:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala

+        val supportedDatasources =
+          SQLConf.get.getConf(SQLConf.NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST)
+            .toLowerCase(Locale.ROOT)
+            .split(",").map(_.trim)


Could we use Utils.stringToSeq?

maropu · 2020-04-30T00:14:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

   *                               translated [[Filter]]. The map is used for rebuilding
   *                               [[Expression]] from [[Filter]].
+   * @param nestedPredicatePushdownEnabled Whether nested predicate pushdown is enabled. Default is
+   *                                       disabled.


What does Default is disabled means? we should add a default value in the argument like nestedPredicatePushdownEnabled: Boolean = false?

Oh, forgot to change it. I was adding default value but removed it later.

maropu · 2020-04-30T00:28:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

+          PushableColumnAndNestedColumn
+        } else {
+          PushableColumnWithoutNestedColumn
+        }


How about moving this check to thePushableColumn object?

object PushableColumn { def apply(nestedPredicatePushdownEnabled: Boolean) = { if (nestedPredicatePushdownEnabled) { PushableColumnAndNestedColumn } else { PushableColumnWithoutNestedColumn } } }

maropu · 2020-04-30T00:29:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

+        DataSourceUtils.supportNestedPredicatePushdown(fsRelation)
+      val pushedFilters = dataFilters
+        .flatMap(DataSourceStrategy.translateFilter(_, supportNestedPredicatePushdown))
+      logInfo(s"Pushed Filters: " + s"${pushedFilters.mkString(",")}")


nit: logInfo(s"Pushed Filters: ${pushedFilters.mkString(",")}")

cloud-fan · 2020-05-01T07:02:23Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

        case v1 if v1.supports(TableCapability.V1_BATCH_WRITE) =>
-          OverwriteByExpressionExecV1(v1, filters, writeOptions.asOptions, query) :: Nil
+          OverwriteByExpressionExecV1(
+            v1, transferFilters(filters, false), writeOptions.asOptions, query) :: Nil


This is v1 fallback API, which is new in DS v2. I think we can always support nested filter pushdown.

ok. got it. thanks.

SparkQA · 2020-05-01T11:00:08Z

Test build #122162 has finished for PR 28366 at commit a49b73c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-01T21:05:23Z

Test build #122175 has finished for PR 28366 at commit 17d1094.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu

No more comment now and looks okay to me.

dbtsai · 2020-05-04T22:36:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

+        DataSourceUtils.supportNestedPredicatePushdown(fsRelation)
+      val pushedFilters = dataFilters
+        .flatMap(DataSourceStrategy.translateFilter(_, supportNestedPredicatePushdown))
+      logInfo(s"Pushed Filters: ${pushedFilters.mkString(",")}")


Is it possible to have it propagated back so when an user does explain(true), the filters that are pushed down can be shown?

In FileSourceScanExec, this pushed down filters are shown there.

dbtsai · 2020-05-04T22:48:04Z

LGTM. Thanks.

cloud-fan · 2020-05-05T05:52:14Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala

-      }.toArray
+      val filters = splitConjunctivePredicates(deleteExpr)
+      def transferFilters =
+        (filters: Seq[Expression], supportNestedPredicatePushdown: Boolean) => {


Do we need the supportNestedPredicatePushdown parameter here as the caller side always pass true?

cloud-fan · 2020-05-05T05:53:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

-        "while ORC only supports predicates for names containing `dots`. The other data sources" +
-        "don't support this feature yet.")
+  val NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST =
+    buildConf("spark.sql.optimizer.nestedPredicatePushdown.supportedV1Sources")


supportedV1Sources -> supportedFileSources?

DS v1 and file source are different APIs and have different planner rules/physical nodes.

SparkQA · 2020-05-05T07:05:02Z

Test build #122305 has finished for PR 28366 at commit 00b9d47.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-05T07:05:02Z

Test build #122304 has finished for PR 28366 at commit aa32dcc.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-05-05T07:19:25Z

retest this please

SparkQA · 2020-05-05T12:10:41Z

Test build #122310 has finished for PR 28366 at commit 00b9d47.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-05-06T04:50:04Z

thanks, merging to master/3.0!

### What changes were proposed in this pull request? This patch proposes to replace `NESTED_PREDICATE_PUSHDOWN_ENABLED` with `NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST` which can configure which v1 data sources are enabled with nested predicate pushdown. ### Why are the changes needed? We added nested predicate pushdown feature that is configured by `NESTED_PREDICATE_PUSHDOWN_ENABLED`. However, this config is all or nothing config, and applies on all data sources. In order to not introduce API breaking change after enabling nested predicate pushdown, we'd like to set nested predicate pushdown per data sources. Please also refer to the comments #27728 (comment). ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added/Modified unit tests. Closes #28366 from viirya/SPARK-31365. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 4952f1a) Signed-off-by: Wenchen Fan <[email protected]>

HyukjinKwon

LGTM too, one comment. Thanks for working on this @viirya.

HyukjinKwon · 2020-05-06T08:04:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "implementation class names for which Spark tries to push down predicates for nested " +
+        "columns and/or names containing `dots` to data sources. Currently, Parquet implements " +
+        "both optimizations while ORC only supports predicates for names containing `dots`. The " +
+        "other data sources don't support this feature yet. So the default value is 'parquet,orc'.")


Seems we decided to only make this configuration effective against DSv1, which seems okay because only DSv1 will have compatibility issues.

But shell we at least explicitly mention that this configuration is only effective with DSv1, (or make this configuration effective against DSv2)? Seems like it's going to be confusing to both end users or developers.

I think DSv2 API supposes nested column capacity like pushdown and pruning, so we only need to deal with DSv1 compatibility issues here. Precisely, file source.

I will create a simple followup to refine the doc of this configuration for this point. Thanks.

…ate pushdown ### What changes were proposed in this pull request? This is a followup to address the #28366 (comment) by refining the SQL config document. ### Why are the changes needed? Make developers less confusing. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Only doc change. Closes #28468 from viirya/SPARK-31365-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

…ate pushdown ### What changes were proposed in this pull request? This is a followup to address the #28366 (comment) by refining the SQL config document. ### Why are the changes needed? Make developers less confusing. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Only doc change. Closes #28468 from viirya/SPARK-31365-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit 9bf7387) Signed-off-by: Takeshi Yamamuro <[email protected]>

viirya added 2 commits April 27, 2020 01:07

Add NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST.

9659699

Merge remote-tracking branch 'upstream/master' into SPARK-31365

6feaaa4

probot-autolabeler bot added the SQL label Apr 27, 2020

Add one test.

e555a1c

Add test.

84bc8dd

viirya changed the title ~~[WIP][SPARK-31365][SQL] Enable nested predicate pushdown per data sources~~ [SPARK-31365][SQL] Enable nested predicate pushdown per data sources Apr 29, 2020

maropu reviewed Apr 30, 2020

View reviewed changes

Address comments.

a49b73c

cloud-fan reviewed May 1, 2020

View reviewed changes

v1 fallback API can support nested predicate pushdown.

17d1094

maropu approved these changes May 1, 2020

View reviewed changes

dbtsai reviewed May 4, 2020

View reviewed changes

cloud-fan reviewed May 5, 2020

View reviewed changes

cloud-fan approved these changes May 5, 2020

View reviewed changes

viirya added 2 commits May 4, 2020 22:57

Address comments.

aa32dcc

Restore previous style.

00b9d47

viirya force-pushed the SPARK-31365 branch from 8fd933c to 00b9d47 Compare May 5, 2020 06:13

cloud-fan closed this in 4952f1a May 6, 2020

HyukjinKwon reviewed May 6, 2020

View reviewed changes

viirya mentioned this pull request May 6, 2020

[SPARK-31365][SQL][FOLLOWUP] Refine config document for nested predicate pushdown #28468

Closed

Tagar mentioned this pull request Jun 14, 2020

Does array search support push down? elastic/elasticsearch-hadoop#1076

Closed

tedyu mentioned this pull request Jan 10, 2021

[SPARK-33915][SQL] Allow json expression to be pushable column #30984

Closed

viirya deleted the SPARK-31365 branch December 27, 2023 18:23

[SPARK-31365][SQL] Enable nested predicate pushdown per data sources #28366

[SPARK-31365][SQL] Enable nested predicate pushdown per data sources #28366

Uh oh!

Conversation

viirya commented Apr 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Apr 27, 2020

Uh oh!

SparkQA commented Apr 28, 2020

Uh oh!

SparkQA commented Apr 29, 2020

Uh oh!

viirya commented Apr 29, 2020

Uh oh!

maropu Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 1, 2020

Uh oh!

SparkQA commented May 1, 2020

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai commented May 4, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 5, 2020

Uh oh!

SparkQA commented May 5, 2020

Uh oh!

viirya commented Apr 27, 2020 •

edited

Loading

maropu Apr 29, 2020 •

edited

Loading