[SPARK-26004][SQL] InMemoryTable support StartsWith predicate push down #23004

wangyum · 2018-11-11T04:04:22Z

What changes were proposed in this pull request?

SPARK-24638 adds support for Parquet file StartsWith predicate push down.
InMemoryTable can also support this feature.

This is an example to explain how it works, Imagine that the id column stored as below:

Partition ID	lowerBound	upperBound
p1	'1'	'9'
p2	'10'	'19'
p3	'20'	'29'
p4	'30'	'39'
p5	'40'	'49'

A filter df.filter($"id".startsWith("2")) or id like '2%'
then we substr lowerBound and upperBound:

Partition ID	lowerBound.substr(0, Length("2"))	upperBound.substr(0, Length("2"))
p1	'1'	'9'
p2	'1'	'1'
p3	'2'	'2'
p4	'3'	'3'
p5	'4'	'4'

We can see that we only need to read p1 and p3.

How was this patch tested?

unit tests and benchmark tests

benchmark test result:

================================================================================================
Pushdown benchmark for StringStartsWith
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
StringStartsWith filter: (value like '10%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
InMemoryTable Vectorized                    12068 / 14198          1.3         767.3       1.0X
InMemoryTable Vectorized (Pushdown)           5457 / 8662          2.9         347.0       2.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
StringStartsWith filter: (value like '1000%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
InMemoryTable Vectorized                      5246 / 5355          3.0         333.5       1.0X
InMemoryTable Vectorized (Pushdown)           2185 / 2346          7.2         138.9       2.4X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
StringStartsWith filter: (value like '786432%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
InMemoryTable Vectorized                      5112 / 5312          3.1         325.0       1.0X
InMemoryTable Vectorized (Pushdown)           2292 / 2522          6.9         145.7       2.2X

SparkQA · 2018-11-11T07:32:59Z

Test build #98689 has finished for PR 23004 at commit 7bbdb07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-11-13T08:08:23Z

cc @cloud-fan @HyukjinKwon @kiszk

cloud-fan · 2018-11-13T08:59:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

      list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
        l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)
+
+    case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>


can you add some comment to explain it?

Added to pr description.

Can you add the comment in the line 240, too?

@maropu Done

HyukjinKwon · 2018-11-13T12:12:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

+    case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
+      statsFor(a).lowerBound.substr(0, Length(l)) <= l &&
+        l <= statsFor(a).upperBound.substr(0, Length(l))
+    case StartsWith(ExtractableLiteral(l), a: AttributeReference) =>


BTW, a.startswith(b) and b.startswith(a) are not same but why are they same here?

same question

Good question, The last one should be removed, DataSourceStrategy has the same logic:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

Lines 512 to 513 in 3d6b68b

case expressions.StartsWith(a: Attribute, Literal(v: UTF8String, StringType)) =>

Some(sources.StringStartsWith(a.name, v.toString))

HyukjinKwon · 2018-11-20T04:15:27Z

Looks fine to me

SparkQA · 2018-11-20T06:43:47Z

Test build #99040 has finished for PR 23004 at commit 0748deb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-03-07T10:04:38Z

Any update?

SparkQA · 2019-03-07T19:46:44Z

Test build #103146 has finished for PR 23004 at commit 15b43f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-03-08T03:56:05Z

cc: @cloud-fan @HyukjinKwon

cloud-fan · 2019-03-08T11:18:55Z

thanks, merging to master!

InMemoryTable support StartsWith predicate push down

7bbdb07

cloud-fan reviewed Nov 13, 2018

View reviewed changes

HyukjinKwon reviewed Nov 13, 2018

View reviewed changes

Fix error

0748deb

Add comment

15b43f9

maropu approved these changes Mar 8, 2019

View reviewed changes

cloud-fan closed this in 2036074 Mar 8, 2019

	case expressions.StartsWith(a: Attribute, Literal(v: UTF8String, StringType)) =>
	Some(sources.StringStartsWith(a.name, v.toString))

[SPARK-26004][SQL] InMemoryTable support StartsWith predicate push down #23004

[SPARK-26004][SQL] InMemoryTable support StartsWith predicate push down #23004

Uh oh!

Conversation

wangyum commented Nov 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 11, 2018

Uh oh!

wangyum commented Nov 13, 2018

Uh oh!

cloud-fan Nov 13, 2018

Choose a reason for hiding this comment

Uh oh!

wangyum Nov 20, 2018

Choose a reason for hiding this comment

Uh oh!

maropu Mar 7, 2019

Choose a reason for hiding this comment

Uh oh!

wangyum Mar 7, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 13, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 13, 2018

Choose a reason for hiding this comment

Uh oh!

wangyum Nov 14, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 20, 2018

Uh oh!

SparkQA commented Nov 20, 2018

Uh oh!

maropu commented Mar 7, 2019

Uh oh!

SparkQA commented Mar 7, 2019

Uh oh!

maropu commented Mar 8, 2019

Uh oh!

cloud-fan commented Mar 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wangyum commented Nov 11, 2018 •

edited

Loading