-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-30648][SQL] Support filters pushdown in JSON datasource #27366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…s-pushdown # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonScan.scala
|
Test build #117440 has finished for PR 27366 at commit
|
|
@HyukjinKwon @dongjoon-hyun @cloud-fan Could you review this PR. I keep it as |
|
Test build #117452 has finished for PR 27366 at commit
|
|
Retest this please. |
|
Test build #117454 has finished for PR 27366 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVFilters.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonFilters.scala
Outdated
Show resolved
Hide resolved
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonFilters.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonFilters.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonFilters.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/StructFiltersSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala
Show resolved
Hide resolved
|
jenkins, retest this, please |
|
Test build #125901 has finished for PR 27366 at commit
|
|
Test build #125912 has finished for PR 27366 at commit
|
|
Test build #125920 has finished for PR 27366 at commit
|
|
jenkins, retest this, please |
|
Test build #125958 has finished for PR 27366 at commit
|
|
Let me merge this. pip test is passed in GIthub Actions. |
|
Merged to master. |
|
Thank you, @MaxGekk and @HyukjinKwon ! |
What changes were proposed in this pull request?
In the PR, I propose to support pushed down filters in JSON datasource. The reason of pushing a filter up to
JacksonParseris to apply the filter as soon as all its attributes become available i.e. converted from JSON field values to desired values according to the schema. This allows to skip parsing of the rest of JSON record and conversions of other values if the filter returnsfalse. This can improve performance when pushed filters are highly selective and conversion of JSON string fields to desired values are comparably expensive ( for example, the conversion toTIMESTAMPvalues).The main idea behind of
JsonFiltersis to group pushdown filters by their references, convert the grouped filters to expressions, and then compile to predicates. The predicates are indexed by schema field positions. Each predicate has a state with reference counter to non-set row fields. As soon as the counter reaches0, it can be applied to the row because all its dependencies has been set. Before processing new row, predicate's reference counter is reset to total number of predicate references (dependencies in a row).The common code shared between
CSVFiltersandJsonFiltersis moved to theStructFiltersclass and its companion object.Why are the changes needed?
The changes improve performance on synthetic benchmarks up to 27 times on JDK 8 and 25 times on JDK 11:
Does this PR introduce any user-facing change?
No
How was this patch tested?
JsonFiltersSuiteandJacksonParserSuite.JsonSuite.CSVFiltersSuite,UnivocityParserSuiteandCSVSuite.CSVBenchmarkandJsonBenchmarkusing Amazon EC2:sudo add-apt-repository ppa:openjdk-r/ppa&sudo apt install openjdk-11-jdkand
./dev/run-benchmarks: