Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Nov 11, 2018

What changes were proposed in this pull request?

SPARK-24638 adds support for Parquet file StartsWith predicate push down.
InMemoryTable can also support this feature.

This is an example to explain how it works, Imagine that the id column stored as below:

Partition ID lowerBound upperBound
p1 '1' '9'
p2 '10' '19'
p3 '20' '29'
p4 '30' '39'
p5 '40' '49'

A filter df.filter($"id".startsWith("2")) or id like '2%'
then we substr lowerBound and upperBound:

Partition ID lowerBound.substr(0, Length("2")) upperBound.substr(0, Length("2"))
p1 '1' '9'
p2 '1' '1'
p3 '2' '2'
p4 '3' '3'
p5 '4' '4'

We can see that we only need to read p1 and p3.

How was this patch tested?

unit tests and benchmark tests

benchmark test result:

================================================================================================
Pushdown benchmark for StringStartsWith
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
StringStartsWith filter: (value like '10%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
InMemoryTable Vectorized                    12068 / 14198          1.3         767.3       1.0X
InMemoryTable Vectorized (Pushdown)           5457 / 8662          2.9         347.0       2.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
StringStartsWith filter: (value like '1000%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
InMemoryTable Vectorized                      5246 / 5355          3.0         333.5       1.0X
InMemoryTable Vectorized (Pushdown)           2185 / 2346          7.2         138.9       2.4X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
StringStartsWith filter: (value like '786432%'): Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
InMemoryTable Vectorized                      5112 / 5312          3.1         325.0       1.0X
InMemoryTable Vectorized (Pushdown)           2292 / 2522          6.9         145.7       2.2X

@SparkQA
Copy link

SparkQA commented Nov 11, 2018

Test build #98689 has finished for PR 23004 at commit 7bbdb07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Nov 13, 2018

cc @cloud-fan @HyukjinKwon @kiszk

list.map(l => statsFor(a).lowerBound <= l.asInstanceOf[Literal] &&
l.asInstanceOf[Literal] <= statsFor(a).upperBound).reduce(_ || _)

case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comment to explain it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to pr description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the comment in the line 240, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Done

case StartsWith(a: AttributeReference, ExtractableLiteral(l)) =>
statsFor(a).lowerBound.substr(0, Length(l)) <= l &&
l <= statsFor(a).upperBound.substr(0, Length(l))
case StartsWith(ExtractableLiteral(l), a: AttributeReference) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, a.startswith(b) and b.startswith(a) are not same but why are they same here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, The last one should be removed, DataSourceStrategy has the same logic:

case expressions.StartsWith(a: Attribute, Literal(v: UTF8String, StringType)) =>
Some(sources.StringStartsWith(a.name, v.toString))

@HyukjinKwon
Copy link
Member

Looks fine to me

@SparkQA
Copy link

SparkQA commented Nov 20, 2018

Test build #99040 has finished for PR 23004 at commit 0748deb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Mar 7, 2019

Any update?

@SparkQA
Copy link

SparkQA commented Mar 7, 2019

Test build #103146 has finished for PR 23004 at commit 15b43f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Mar 8, 2019

cc: @cloud-fan @HyukjinKwon

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 2036074 Mar 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants