Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Jun 15, 2021

What changes were proposed in this pull request?

This PR implemented the proposal per design doc for SPARK-35779.

Why are the changes needed?

Spark supports dynamic partition filtering that enables reusing parts of the query to skip unnecessary partitions in the larger table during joins. This optimization has proven to be beneficial for star-schema queries which are common in the industry. Unfortunately, dynamic pruning is currently limited to partition pruning during joins and is only supported for built-in v1 sources. As more and more Spark users migrate to Data Source V2, it is important to generalize dynamic filtering and expose it to all v2 connectors.

Please, see the design doc for more information on this effort.

Does this PR introduce any user-facing change?

Yes, this PR adds a new optional mix-in interface for Scan in Data Source V2.

How was this patch tested?

This PR comes with tests.

@dongjoon-hyun
Copy link
Member

Thank you for making a PR, @aokolnychyi !

@sunchao
Copy link
Member

sunchao commented Jun 15, 2021

cc @wangyum - this may be related to the runtime filtering that you guys are working on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design doc has alternative ways to represent dynamic filters. It would be great to get feedback on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds v2 API prefers to use the native data classes in Spark, e.g. InternalRow, UTF8String, etc. However, we keep using the v1 Filter API which uses external data classes. Shall we consider adding a v2 Filter API which uses v2 Expression and native data classes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be best, @cloud-fan. Has there been any discussion on how the new API should look like? Since the old API has been exposed in SupportsPushDownFilters, what is the plan for introducing the new API? Will we introduce a new method with a default implementation that would translate into to the old API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbtsai used to have a PR on this: dbtsai#10 but hasn't been updated for a while now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer, @sunchao. I think @rdblue's comment here matches my proposal above about adding v2 filters in parallel and having a default implementation that converts v2 to v1.

I can pick up the v2 filter API but I'd like to do that independently of this PR if everyone is on board.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it alright with everybody to consider v1 filters in the scope of this PR? I'll take over @dbtsai's PR later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we target this PR for 3.3, then I'm fine to use v1 Filter here and replace it with v2 Filter later, as there is plenty of time. Otherwise, I'd like to have v2 Filter first, to avoid releasing this API with v1 Filter and breaking it later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should use v1 filters. The new filters don't exist yet and the DSv2 API uses v1 in other places. There is no need to block this on adding v2 filters. If the release for this is the same for v2 filters, we can consider removing support. But at this point I think we should not assume that will happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since dynamic filtering can provide substantial performance improvements for v2 tables, I'd love to get this feature into 3.2. As we already use v1 filters in other Data Source V2 interfaces, I feel it should be alright to use them here too.

As I noted above, we don't really have to break this API once we have v2 filters. We can follow whatever we decide to do with SupportsPushDownFilters: introduce a separate interface or just add a method to the existing interface with a default implementation that would convert v2 filters into v1 filters.

Does this seem reasonable, @cloud-fan?

Copy link
Member

@viirya viirya Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we target this PR for 3.3, then I'm fine to use v1 Filter here and replace it with v2 Filter later, as there is plenty of time. Otherwise, I'd like to have v2 Filter first, to avoid releasing this API with v1 Filter and breaking it later.

Do we plan to remove v1 Filter soon? Otherwise, we still can keep v1 Filter support in this API even we decide to add v2 Filter later. So it seems that we don't actually break it quickly (at least not in next release).

So basically seems we don't need to block this for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to implement stats as tests rely on them.

@aokolnychyi
Copy link
Contributor Author

cc @huaxingao @dongjoon-hyun @sunchao @cloud-fan @maryannxue @viirya @rdblue @HyukjinKwon

It would be great to hear your feedback on this WIP PR. Some tests are expected to fail as we don't support ANALYZE statements for v2 tables yet.

@SparkQA
Copy link

SparkQA commented Jun 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44357/

@SparkQA
Copy link

SparkQA commented Jun 15, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44357/

@viirya
Copy link
Member

viirya commented Jun 15, 2021

Thanks @aokolnychyi for the PR!

@SparkQA
Copy link

SparkQA commented Jun 16, 2021

Test build #139828 has finished for PR 32921 at commit 2a777fa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@aokolnychyi aokolnychyi force-pushed the dynamic-filtering-wip branch from 04ae0e3 to 202be14 Compare June 16, 2021 03:17
@SparkQA
Copy link

SparkQA commented Jun 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44363/

@SparkQA
Copy link

SparkQA commented Jun 16, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44363/

@SparkQA
Copy link

SparkQA commented Jun 16, 2021

Test build #139835 has finished for PR 32921 at commit 202be14.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 16, 2021

Test build #139834 has finished for PR 32921 at commit 04ae0e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@sunchao sunchao Jun 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to mention related configs for this to kick in? e.g., spark.sql.optimizer.dynamicPartitionPruning.enabled, spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly? in general I think we should link this to DPP since it can be used for that purpose in future.

Does the V2 MergeInto use case allow this to be optional and controlled via existing DPP flags?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just say that Spark will push runtime filters if they are beneficial. No need to mention too many details. e.g. in the future spark may have a more advanced cost model to decide to push down runtime filters or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will add the details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with Wenchen. It is better to just state that Spark may use this to further refine the filters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if some attributes in scan.filterAttributes cannot be resolved? should we skip and continue?
Also you may want to use DataSourceV2ScanRelation.relation since the output could be pruned by column pruning. It seems we currently run PartitionPruning after V2ScanRelationPushDown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on skipping rather than failing if we cannot resolve the filter attrs.

W.r.t. which columns to use, I did use the scan output to resolve on purpose as we cannot derive a filter on an attribute that hasn't been projected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao, it looks like we always fail if partition columns cannot be resolved in the branch above this one. While it does seem safer not to fail the query if a filter attribute cannot be resolved, shall we be consistent with v1 tables? What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, SGTM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we might need to update the method name and doc because they are no longer accurate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means data source will need to plan input partitions twice - not sure if it could be expensive. Another idea is Spark provide the input partitions and ask data source to filter on top of them, like:

InputPartition[] filter(Filter[] filters, InputPartition[] partitions);

but I guess this may be too restrictive, so feel free to ignore.

Also may worth checking whether the original input partitions is used for making decisions before it's updated to filteredPartitions (and whether that would change the original decision). I can see that supportsColumnar is using it but not sure if there are other places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess for scan implementation, if it can support this filtering, maybe it can cache original input partitions and the second planning is cheaper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya is correct. Usually, there is no second planning. Instead, existing input partitions that has been already planned are filtered using dynamic filters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case maybe it's useful to make this a bit more clear for data source implementors (I'm not sure if there's enough signal for them that planInputPartitions will be called twice).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note on caching state to Scan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the TODO items, could you file a JIRA issue and make a IDed TODO like TODO(SPARK-XXX)? Otherwise, it's difficult to be picked up by the other contributors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. I am not sure about this point so we will need to discuss it a little bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the TODO for now. It is not a blocker, we can reconsider that separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created SPARK-35900 to think about in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can spin off this test case expanding contributions? It looks like we can merge this independently first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: runtime filter is a better name I think, as the filter is generated after the query compilation phase, and query runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the naming.

Copy link
Contributor Author

@aokolnychyi aokolnychyi Jun 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did use dynamicXXX as it is common throughout the code but SupportsRuntimeFiltering does sound more accurate. I'll update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proposal is not limited to only partition pruning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we do filter input partitions but the filtering can be using a metadata column (e.g. file name).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya, InputPartition is referring to Spark's partition, not a storage partition. These are actually tasks.

This is another reason why Wenchen's suggestion is a good one. No need to mention what gets filtered or imply that you should produce InputPartition instances and then filter those. This only needs to state that additional filters may be added through this interface.

Comment on lines 47 to 48
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall we cannot change partition arbitrarily for streaming case due to stateful tasks. So I'm wondering if the dynamic filtering applies for streaming scan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it works for streaming plans right now. Shall I just refer to toBatch in the doc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only document supported case. Otherwise it might mislead developers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'll update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: @param filters the data source filter expressions used to dynamically filter the scan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess for scan implementation, if it can support this filtering, maybe it can cache original input partitions and the second planning is cheaper?

@aokolnychyi
Copy link
Contributor Author

Thanks for the initial review, @sunchao @dongjoon-hyun @cloud-fan @viirya! There are a couple of points like here and here that I'd like to discuss before updating the PR.

@cloud-fan
Copy link
Contributor

Right now, we don't have a dedicated phase for executing DPP subqueries. They are treated like normal subqueries and are executed right before we execute the main query.

Let's think about non-AQE first. We need to run EnsureRequirements after DPP in case the output partitioning changes. And we need to execute the DPP subqueries first. Before that, we need to optimize the main query and apply exchange/subquery reuse first.

That said, I think we should execute DPP subqueries after the query plan is fully optimized and ready to execute. For safety I think we should run the rule that triggers DPP subquery execution and apply DS v2 pushdown after all the existing physical rules are run. i.e.

      CoalesceBucketsInJoin,
      PlanDynamicPruningFilters(sparkSession),
      PlanSubqueries(sparkSession),
      ...
      ReuseExchangeAndSubquery,
      // Above are existing physical rules
      PushRuntimeFiltersToDataSource,
      EnsureRequirements // This is the second run

AQE would be more complicated as the fully optimized query plan is only available at the query stage optimization phase, where it's not allowed to change stage boundaries anymore.

I agree that it's better to allow the v2 source to change its output partitioning after runtime filter pushdown, but I'm not quite sure we should allow it if it introduces extra shuffles. The cost of extra shuffles can be large.

I think we can simplify the design if we don't allow runtime filter pushdown to introduce extra shuffles. Spark can give v2 source both the runtime filter and the required distribution, so that the v2 source can handle it properly and change output partitioning as long as it can still satisfy the required distribution.


override lazy val inputRDD: RDD[InternalRow] = {
new DataSourceRDD(sparkContext, partitions, readerFactory, supportsColumnar, customMetrics)
if (filteredPartitions.isEmpty && outputPartitioning == SinglePartition) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this possible if we already check the number of partition in originalPartitioning must match new partition number?

Copy link
Contributor Author

@aokolnychyi aokolnychyi Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We check the number of partitions before and after filtering match only if the source reported a specific partitioning through SupportsReportPartitioning. Only in that case we have DataSourcePartitioning. This situation, on the other hand, can happen if we inferred SinglePartition but the source did not report anything.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay. Is there any concern or blocking comment for this?

@viirya
Copy link
Member

viirya commented Jul 1, 2021

This WIP PR has a working prototype for SPARK-35779 per design doc.

This is not WIP PR anymore. @aokolnychyi Could you update the description? Thanks.

@aokolnychyi
Copy link
Contributor Author

@viirya, missed to update the PR description when updated the title. Done.

@viirya
Copy link
Member

viirya commented Jul 1, 2021

Thanks @aokolnychyi!

I am not sure if we still can merge this in after branch cut? If not, maybe we can have this in first, if there is no major comments/concerns, and continue to address minor comments later before release?

Any idea? @dongjoon-hyun @sunchao @cloud-fan @dbtsai @holdenk @rdblue?

@viirya
Copy link
Member

viirya commented Jul 1, 2021

also cc @gengliangwang

@sunchao
Copy link
Member

sunchao commented Jul 1, 2021

This PR looks good to me now. Also curious if this can be merged after branch-cut. It'd also be great if @cloud-fan can take one more look.

@dbtsai
Copy link
Member

dbtsai commented Jul 1, 2021

+1 to merge it as it now if there is no major issue, and we can work on the followup later to reduce the scope.

protected[sql] def translateRuntimeFilter(expr: Expression): Option[Filter] = expr match {
case in @ InSubqueryExec(e @ PushableColumnAndNestedColumn(name), _, _, _) =>
val values = in.values().getOrElse {
throw new AnalysisException(s"Can't translate $in to source filter, no subquery result")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we throw IllegalStateException? Seems only bug can lead to this branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will only translate the runtime filter when we executing the physical plan, and at that time subqueries must be all evaluated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll switch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for two small comments. The new solution is much simpler!

@viirya
Copy link
Member

viirya commented Jul 1, 2021

Thank you, @cloud-fan!

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45048/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45048/

@viirya
Copy link
Member

viirya commented Jul 1, 2021

In GA, all tests are passed. Only "Hadoop 2 build with SBT" failed which seems unrelated:

[error] /home/runner/work/spark/spark/sql/catalyst/src/test/scala-2.12/org/apache/spark/sql/catalyst/analysis/ExtractGeneratorSuite.scala:29:64: exception during macro expansion: 
[error] java.util.MissingResourceException: Can't find bundle for base name org.scalactic.ScalacticBundle, locale en
[error] 	at java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1581)

@aokolnychyi and I verified ran "Hadoop 2 build with SBT" locally and it worked. So it seems a flaky issue only on GA. I don't want to block branch cut too long so going to merge this now. If we see any error in Jenkins later, we can address them quickly.

@viirya
Copy link
Member

viirya commented Jul 1, 2021

Thanks @aokolnychyi for this work and all for the review! Merging to master!

@viirya viirya closed this in fceabe2 Jul 2, 2021
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @viirya @cloud-fan @sunchao @rdblue @dongjoon-hyun @holdenk!

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Test build #140535 has finished for PR 32921 at commit 881d2b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@GithubZhitao
Copy link

GithubZhitao commented Oct 15, 2021

Nice feature for users ! But I have a little doubt about it.
Does the DataSourceV1 support Dynamic Filtering?
After testing the mongo-spark project[https://www.mongodb.com/products/spark-connector](Maybe datasourceV1 example ?), which failed to trigger the dynamicFiltering. After debugging I found it defined its MongoRelation and failed to match the case HadoopFsRelation.
The master code update the getPartitionTableScan to getFilterableTableScan, then add the two cases : case (resExp, l: HiveTableRelation) , case (resExp, r @ DataSourceV2ScanRelation(, scan: SupportsRuntimeFiltering, )) .
As I know according to the code. DynamicFiltering can only be triggered when logical plan has the relation of HadoopFsRelation, HiveTableRelation, DataSourceV2ScanRelation.

cloud-fan pushed a commit that referenced this pull request Jul 28, 2022
### What changes were proposed in this pull request?
Use V2 Filter in run time filtering for V2 Table

### Why are the changes needed?
We should use V2 Filter in DS V2.
#32921 (comment)

### Does this PR introduce _any_ user-facing change?
Yes
new interface `SupportsRuntimeV2Filtering`

### How was this patch tested?
new test suite

Closes #36918 from huaxingao/v2filtering.

Authored-by: huaxingao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@LorenzoMartini
Copy link
Contributor

Hey @aokolnychyi

We are trying to use spark datasourceV2 and noticed that the spark v2 built-in data sources (eg parquet one, looking at ParquetScan) don't support this (SupportsRuntimeFiltering nor SupportsRuntimeV2Filtering) by default, creating a large performance difference between using v1 and v2 datasource ootb.

Is there a plan to have them support this? It would be really beneficial for the file scans to be able to do this and given they already benefit of some push downs we were wondering why the runtime filtering is not implemented. Or maybe I am missing something? And in that case it would be great to understand how to have spark file sources take advantage of dpp.

Thanks!

@aokolnychyi
Copy link
Contributor Author

Hi, @LorenzoMartini! I am not sure how much SupportsRuntimeFiltering API will be helpful for built-in sources because Spark treats them in a special way. For instance, PushDownUtils$pushFilters has a special branch that pushes Catalyst expressions directly instead of going via the public filter API. Based on that, I'd image having a special behavior for built-in scans. APIs added in this PR are supposed to work for external connectors that always rely on the public connector API.

@LorenzoMartini
Copy link
Contributor

Hey @aokolnychyi thank you for the answer. I see that those sources have special optimizations. However we do have instances of data transformations being incredibly slower using spark's v2 datasources and the only difference in the query plans compared to those same transformation ran using v1 datasources is the absence of the dynamic pruning expressions. Do you have any suggestions on how to improve those use cases if not trying to implement SupportsRuntimeFiltering for the spark sources? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.