-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-33915][SQL] Allow json expression to be pushable column #30984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #133570 has finished for PR 30984 at commit
|
|
From https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133570/console : @cloud-fan Thanks |
|
Looking at https://github.com/apache/spark/pull/30984/checks?check_run_id=1630709203, The above was not caused by my change. |
|
I ran ImageFileFormatSuite locally with patch: |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
Outdated
Show resolved
Hide resolved
|
Test build #133607 has started for PR 30984 at commit |
|
From @cloud-fan : My expectation is that, after this change is accepted, there would be more and more data sources which utilize this capability. I have shown in SPARK-33915 how the data source utilizes this change (for spark-cassandra-connector). If the data source doesn't push down json related Filter, it would be the same as the current status. |
|
test this please |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #133617 has finished for PR 30984 at commit
|
Looks like some cleanup is needed for the test machine. @shaneknapp |
|
@HyukjinKwon @cloud-fan Thanks |
|
@HyukjinKwon I will be happy to answer / investigate. Thanks |
|
@tedyu, the special characters are not allowed in some sources such as Hive as you tested. However, they are allowed in some other sources when you use DSLs: scala> spark.range(1).toDF("GetJsonObject(phone#37,$.phone)").write.option("header", true).mode("overwrite").csv("/tmp/foo")
scala> spark.read.option("header", true).csv("/tmp/foo").show()
+-------------------------------+
|GetJsonObject(phone#37,$.phone)|
+-------------------------------+
| 0|
+-------------------------------+In this case, the filters will still be pushed down to the datasource implementation site. We should have a way to identify if the pushed This makes me believe the current implementation based on strings are flaky, and incomplete. |
|
Was the above output obtained with or without my change ? I got the above with my change. I am also open to suggestion on alternative approach. |
|
@HyukjinKwon Looking at PushableColumnBase: It seems existing code, including GetStructField, is String oriented. Thanks |
|
For SupportsPushDownRequiredColumns, existing implementation uses regex matching to detect special functions: @HyukjinKwon |
|
I added some log in PushableColumnBase#unapply for the case of GetJsonObject. For the above call, the additional log didn't show up. If I use df.select("get_json_object(phone, '$.code')") : |
|
Ping @xuanyuanking for opinion since he made comment on SPARK-33915 |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's best to loop with @rdblue, @viirya, @dbtsai who looked into this codes. I doubt if this is a good approach relying on strings to push down other expressions but I am okay if other people think it's fine.
Hmm, I have the same feeling. This looks not a reliable approach. PushableColumnBase is used to get a pushable column name. Converting an expression like GetJsonObject(column, field) to a string to push down may not reliable and may have unexpected result.
|
@viirya Can you outline one scenario where the converted string would produce unexpected result ? I am interested in hearing alternative approach to pushing down json expressions. Thanks |
|
I don't think it's consistent with the existing implementation, otherwise we should be using strings all around like The only exception I know of is nested columns. We use string We probably need to merge the V1 |
|
Here is some background on how I came about the current approach. Canonical json expression is something like: phone->code or phone->>code where phone is the json(b) column and code is the field. I haven't spent much time investigating how the native jsonb expression can be directly supported in Spark because the lambda is a fundamental notation. bq. using strings all around like a > 1 instead of GreaterThan(a, 1) In general, I agree strong typing is better than string matching. bq. the string representation for nested columns is pretty standard. As I mentioned above, json path expression is quite standard. I tend to think that Along this line of thinking, using string matching for pushing down json path expression is tantamount to pushing down nested column. I have updated the PR by adding the nestedPredicatePushdownEnabled condition to the GetJsonObject case. The implementation of SupportsPushDownFilters can decide whether the pushed down Filter should be honored or not (since it has more knowledge about the underlying schema). If not, the Filter would be treated on driver side. @cloud-fan If you can outline in relatively more detailed steps how the merge would help expressing json path expression, that would be nice. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status success |
|
Test build #133947 has finished for PR 30984 at commit
|
|
@HyukjinKwon @cloud-fan Please also see #30984 (comment) where I outline how I got to the current formation. Thanks |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This PR adds support for json / jsonb expression as PushableColumnBase.
With this change, SupportsPushDownFilters implementation would be able to push Filter down to DB.
The activation of this enhancement is controlled by nestedPredicatePushdownEnabled flag.
Why are the changes needed?
Currently implementation of SupportsPushDownFilters doesn't have a chance to perform pushdown even if third party DB engine supports json expression pushdown.
This poses challenge when table schema uses json / jsonb to encode large amount of data.
Here is some background on how I came about the current approach.
Canonical json expression is something like: phone->code or phone->>code where phone is the json(b) column and code is the field.
However, the json expression is rejected by:
Does this PR introduce any user-facing change?
No
How was this patch tested?
With this change, plus corresponding changes in spark-cassandra-connector, Filter involving json expression can be pushed down.
Here is the plan prior to predicate pushdown: