Skip to content

Conversation

@WangGuangxin
Copy link
Contributor

@WangGuangxin WangGuangxin commented Apr 23, 2022

What changes were proposed in this pull request?

Push down StringEndsWith/Contains to Parquet so that we can leverage Parquet Dictionary Filtering

Why are the changes needed?

Improve performance.

FilterPushDownBenchmark:

================================================================================================
Pushdown benchmark for StringEndsWith
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 10.16
Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
StringEndsWith filter: (value like '%10'):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                  7666           7771         117          2.1         487.4       1.0X
Parquet Vectorized (Pushdown)                        540            554          18         29.1          34.3      14.2X
Native ORC Vectorized                               8206           8417         203          1.9         521.7       0.9X
Native ORC Vectorized (Pushdown)                    8120           8674         422          1.9         516.2       0.9X

OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 10.16
Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
StringEndsWith filter: (value like '%1000'):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                    7007           7122         224          2.2         445.5       1.0X
Parquet Vectorized (Pushdown)                          423            485          92         37.2          26.9      16.6X
Native ORC Vectorized                                 7368           7629         373          2.1         468.5       1.0X
Native ORC Vectorized (Pushdown)                      7998           8349         270          2.0         508.5       0.9X

OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 10.16
Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
StringEndsWith filter: (value like '%786432'):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                      7012           7210         238          2.2         445.8       1.0X
Parquet Vectorized (Pushdown)                            419            431          14         37.6          26.6      16.7X
Native ORC Vectorized                                   7513           7995         447          2.1         477.6       0.9X
Native ORC Vectorized (Pushdown)                        8310           8811         448          1.9         528.3       0.8X


================================================================================================
Pushdown benchmark for StringContains
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 10.16
Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
StringContains filter: (value like '%10%'):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
--------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                   7588           8125         328          2.1         482.4       1.0X
Parquet Vectorized (Pushdown)                        1029           1068          25         15.3          65.4       7.4X
Native ORC Vectorized                                7803           7859          92          2.0         496.1       1.0X
Native ORC Vectorized (Pushdown)                     8944           9443         459          1.8         568.6       0.8X

OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 10.16
Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
StringContains filter: (value like '%1000%'):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                     7476           8343         710          2.1         475.3       1.0X
Parquet Vectorized (Pushdown)                           424            427           2         37.1          27.0      17.6X
Native ORC Vectorized                                  7503           8261         818          2.1         477.0       1.0X
Native ORC Vectorized (Pushdown)                       8124           8609         548          1.9         516.5       0.9X

OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Mac OS X 10.16
Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
StringContains filter: (value like '%786432%'):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                       7070           7274         199          2.2         449.5       1.0X
Parquet Vectorized (Pushdown)                             441            478          32         35.6          28.1      16.0X
Native ORC Vectorized                                    7564           7937         323          2.1         480.9       0.9X
Native ORC Vectorized (Pushdown)                         8623           8921         228          1.8         548.2       0.8X

Does this PR introduce any user-facing change?

No

How was this patch tested?

added UT

@github-actions github-actions bot added the SQL label Apr 23, 2022
@WangGuangxin
Copy link
Contributor Author

cc @wangyum @cloud-fan

.createWithDefault(true)

val PARQUET_FILTER_PUSHDOWN_STRING_PREDICATE_ENABLED =
buildConf("spark.sql.parquet.filterPushdown.stringPredicate")
Copy link
Contributor

@jaceklaskowski jaceklaskowski Apr 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since spark.sql.parquet.filterPushdown.string.startsWith is internal why not replacing it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid exising users who have already use it.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

}

case sources.StringEndsWith(name, prefix)
if pushDownStringPredicate && canMakeFilterOn(name, prefix) =>
Copy link
Member

@viirya viirya Apr 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent, following L750.

}

case sources.StringContains(name, value)
if pushDownStringPredicate && canMakeFilterOn(name, value) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Option(prefix).map { v =>
FilterApi.userDefined(binaryColumn(nameToParquetField(name).fieldNames),
new UserDefinedPredicate[Binary] with Serializable {
private val strToBinary = Binary.fromReusedByteArray(v.getBytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just keep UTF8String.fromBytes(strToBinary.getBytes) instead doing it every time in keep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. updated

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few test failures in ParquetV2FilterSuite.

.doc("If true, enables Parquet filter push-down optimization for string predicate such " +
"as startsWith/endsWith/contains function. This configuration only has an effect when " +
"'${PARQUET_FILTER_PUSHDOWN_ENABLED.key}' is enabled.")
.version("3.3.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.4.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

}
}

test("filter pushdown - StringEndsWith/Contains") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to test StringStartsWith here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringStartsWith push down is an existing feature and has been tested at L1426

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I mean this test specially does test with testStringPredicateWithDictionaryFilter, but I don't see StringStartsWith is included here. Don't we need to do testStringPredicateWithDictionaryFilter for it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we merge the existing startswith test into the new one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@HyukjinKwon HyukjinKwon changed the title [SPARK-39002][SQL]StringEndsWith/Contains support push down to Parquet [SPARK-39002][SQL] StringEndsWith/Contains support push down to Parquet Apr 25, 2022
@HyukjinKwon
Copy link
Member

cc @wangyum

@viirya
Copy link
Member

viirya commented Apr 26, 2022

lgtm if CI can pass.

@viirya
Copy link
Member

viirya commented Apr 26, 2022

cc @huaxingao

@huaxingao
Copy link
Contributor

Looks good to me overall. Do we need a test similar to this one test("filter pushdown - StringStartsWith") to make sure the the filters for StringEndsWith and StringContains are created and pushed down to Scan OK?

@WangGuangxin
Copy link
Contributor Author

Looks good to me overall. Do we need a test similar to this one test("filter pushdown - StringStartsWith") to make sure the the filters for StringEndsWith and StringContains are created and pushed down to Scan OK?

done

@viirya
Copy link
Member

viirya commented Apr 26, 2022

@WangGuangxin can you fix the error in "Scala 2.13 build with SBT"?

[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala:1447:57: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `checkStringFilterPushdown`'s return type
[error]       sourceFilter: (String, String) => sources.Filter) {

@WangGuangxin
Copy link
Contributor Author

@WangGuangxin can you fix the error in "Scala 2.13 build with SBT"?

[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala:1447:57: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `checkStringFilterPushdown`'s return type
[error]       sourceFilter: (String, String) => sources.Filter) {

done

@viirya
Copy link
Member

viirya commented Apr 27, 2022

Thanks. Merging to master.

@viirya viirya closed this in 1b7c636 Apr 27, 2022
Seq(
"value like 'a%'", // StartsWith
"value like '%a'", // EndsWith
"value like '%a%'" // Contains
Copy link
Contributor

@sadikovi sadikovi May 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick comment. How does this verify the "keep()" test? Shouldn't it also be "canDrop()"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test assume that dictionary filtering is enabled or not?

Copy link
Contributor

@sadikovi sadikovi May 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test is buggy and does not reflect the actual implementation of the filter. NumRowGroupAcc does not actually count row groups, it counts the number of records passed through the filter. For example, for the contains filter we should still read all of the row groups.

Example of the log:

[canDrop] statistics=org.apache.parquet.filter2.predicate.Statistics@52cd90cd => false
[canDrop] statistics=org.apache.parquet.filter2.predicate.Statistics@64e59f7e => false
  [keep] statistics=Binary{1 constant bytes, [49]} => false
  [keep] statistics=Binary{1 constant bytes, [50]} => false
  [keep] statistics=Binary{1 constant bytes, [51]} => false
  [keep] statistics=Binary{1 constant bytes, [52]} => false
...

Can the author update the test to reflect the implementation? cc @cloud-fan @sunchao.
You may need to enforce things like row group/dictionary filtering as well as record level filtering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sadikovi , the NumRowGroupsAcc is the actually filtered row groups, you can find it here

if (accu.isDefined() && accu.get().getClass().getSimpleName().equals("NumRowGroupsAcc")) {
.

As to the keep() test, the dictionary filter is enabled and there are duplicated records in test data, so parquet will generate dictionary when writing data and dictionary filter is used when reading it.

When we test canDrop, the test data has no duplicate so there is no dictionary generated in parquet, statistics row group filter is used which will call canDrop.

Correct me if I'm wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my point - the test needs to be updated to make sure dictionary pages are written and the dictionary filtering is enabled. Without it, the test does not verify the implementation.

Copy link
Contributor

@sadikovi sadikovi May 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a follow-up PR to update the test? You can explicitly enable dictionary filtering in the test for the "keep" part of the test to highlight that the test passes due to dictionary filtering, otherwise it could be confusing for people.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the fourth param in testStringPredicate is what you need?

.option(ParquetOutputFormat.ENABLE_DICTIONARY, enableDictionary)

It's enabled(by default) in keep test


and disabled in canDrop test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling dictionary does not control dictionary filtering, there is a separate flag for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the discussions here, seems this is not a simple thing. @sadikovi can you open a followup PR directly to demonstrate your idea?

wangyum added a commit that referenced this pull request Jul 4, 2022
### What changes were proposed in this pull request?

This PR updates `FilterPushdownBenchmark` results.

### Why are the changes needed?

These PRs do not or do not fully update the `FilterPushdownBenchmark` results:
#36328
#36629
#36892

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A.

Closes #37022 from wangyum/SPARK-39631.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants