-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19436][SQL] Add missing tests for approxQuantile #16776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #72278 has finished for PR 16776 at commit
|
|
Test build #72279 has finished for PR 16776 at commit
|
|
@zhengruifeng Could you split the test cases to multiple independent ones with meaningful titles?Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val e: IllegalArgumentException -> val e
917fd6e to
3ea1301
Compare
|
Test build #72304 has finished for PR 16776 at commit
|
|
@gatorsmile Update! Thanks for reviewing! |
|
Thank you! What is the expected output if the input dataset is empty? Could you also add a test case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to fix this line, right?
4, update some docs for javadoc8
3ea1301 to
92bcf05
Compare
|
Test build #72364 has finished for PR 16776 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should return null, instead of throwing an exception. Could you fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to return null? Seems very un Scala like to me
92bcf05 to
6262f24
Compare
|
Test build #72407 has finished for PR 16776 at commit
|
|
@gatorsmile Updated! Thanks for reviewing! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will drop the whole row if any column has null or NaN. For example,
Seq[(java.lang.Long, java.lang.Double)]((null, 1.23), (3L, null), (4L, 3.45))
.toDF("a", "b").na.drop().show()That means, users could get different results. It depends on which API they used.
df.stat.approxQuantile("col1", Array(q1), epsilon)
df.stat.approxQuantile("col2", Array(q1), epsilon)df.stat.approxQuantile(Array("col1", "col2"), Array(q1), epsilon)I am wondering if this is the expected behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not sound right to me. cc @holdenk @MLnick @jkbradley
I think that might be the reason why we did not provide such an API at the beginning.
If we want to make them consistent, it does not get any performance benefit from this API. Should we revert the PR #12135 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Good catch! Agree that this will cause confusing results.
I think there are two way to make them consistent:
1, The behavior of na-droping was included in SPARK-17219 to enhanced NaN value handling, and the single-column version of approxQuantile is only used in QuantileDiscretizer. So we can make the na-droping happen in QuantileDiscretizer, and remove the na-drop in approxQuantile.
2, modify the impl StatFunctions.multipleApproxQuantiles to deal with null and NaN, and remove the na-drop in approxQuantile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhengruifeng Sure. If we want to make them consistent, I am fine. How about reverting #12135 at first? At the same time, we can work on the new solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! I vote for modifying multipleApproxQuantiles to handle null and NaN values. As far as reverting, I'm OK either way as long as we get the fix into 2.2. I'd actually recommend going ahead and merging this PR and creating a follow-up Critical Bug targeted at 2.2.
@MLnick I think dropping NAs from the cols passed as args still will not work. Say the user passes cols "a" and "b" as args, but some rows have (a = NaN, b = 1.0). Then those rows will be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us add a TODO comment above this function and create a JIRA for tracking this issue. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, I missed that!
Ok, so #14858 added the na dropping to approxQuantile. It wasn't there in the original impl. It can work on data that has NaN in the sense that it will return NaNs as part of the quantiles, in some cases. I'm not sure if that was the intended behavior in the original impl or not cc @thunterdb.
In that sense it's the same as aggs like min or max - those don't exclude NaN automatically, it's up to the user to handle them. So it could be best to just remove the na dropping from approxQuantile (which would revert to original impl behavior) and put it in QuantileDiscretizer which was the original need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"greater" -> "greater than"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Scaladoc should describe this case
292cd02 to
4db82b4
Compare
|
Test build #72634 has started for PR 16776 at commit |
|
Jenkins, retest this please |
|
Test build #72676 has finished for PR 16776 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above five lines can be shorten to Option(res).map(_.head).orNull
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this link cause problems in doc generation? It looks like there is a missing ")" in the original link. If you add the ")" then will it generate correctly (with a hyperlink) for the Scala and Java docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbradley Do you mean @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]])? I am not sure whether it work for java docs. @HyukjinKwon Could you help review this?
4db82b4 to
a3171e4
Compare
|
Test build #72794 has finished for PR 16776 at commit
|
|
Test build #72795 has finished for PR 16776 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile
->
DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile)
??? How to generate the JAVA8 doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: DataFrameStatsFunctions -> DataFrameStatFunctions or remove it.
For example, just
`approxQuantile(String, Array[Double], Double)`
or
`approxQuantile()` with some description
We could just wrap them by backticks without [[ ... ]] in general. It seems Scaladoc specific annotation also does not work to disambiguate the argument types.
[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:43: error: unexpected content
[error] * @see {@link DataFrameStatFunctions.approxQuantile(col:Str* approxQuantile)} for
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:45: error: unexpected text
[error] * @see #approxQuantile(String, Array[Double], Double) for detailed description.
[error] ^
I guess It does not necessarily make a link if it breaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the breaks are queued up a bit. Let me sweep it soon.
|
Test build #72802 has started for PR 16776 at commit |
|
https://issues.apache.org/jira/browse/SPARK-19573 is created to track the issue on non-consistent na-droping. |
|
Test build #72809 has finished for PR 16776 at commit
|
1f07901 to
db23d11
Compare
|
|
||
| /** | ||
| * Calculates the approximate quantiles of numerical columns of a DataFrame. | ||
| * @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry. Actually, I initially meant remove DataFrameStatFunctions leaving the method because it is in the same class. Nevertheless, FWIW, I am fine with removing this @see as is given other functions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The * was me getting fancy to get the ScalaDoc to link to the correct single-arg method (I did test it at the time and it does work for Scala though there may be a mistake here somewhere).
It would still be good to provide a @see reference even if it does not link nicely (so the simple backtick method name as you suggested?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will add it back. Should it be approxQuantile(col:Str* approxQuantile) or approxQuantile(String, Array[Double], Double) ?
|
Test build #72867 has finished for PR 16776 at commit
|
|
Test build #72868 has finished for PR 16776 at commit
|
|
Test build #72983 has finished for PR 16776 at commit
|
## What changes were proposed in this pull request?
These error below seems caused by unidoc that does not understand double commented block.
```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: class, interface, or enum expected
[error] * MapGroupsWithStateFunction<String, Integer, Integer, String> mappingFunction =
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: class, interface, or enum expected
[error] * MapGroupsWithStateFunction<String, Integer, Integer, String> mappingFunction =
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: class, interface, or enum expected
[error] * new MapGroupsWithStateFunction<String, Integer, Integer, String>() {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: class, interface, or enum expected
[error] * new MapGroupsWithStateFunction<String, Integer, Integer, String>() {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: illegal character: '#'
[error] * &apache#64;Override
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: class, interface, or enum expected
[error] * &apache#64;Override
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
[error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
[error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
[error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
[error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
[error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:76: error: class, interface, or enum expected
[error] * boolean shouldRemove = ...; // Decide whether to remove the state
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:77: error: class, interface, or enum expected
[error] * if (shouldRemove) {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:79: error: class, interface, or enum expected
[error] * } else {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:81: error: class, interface, or enum expected
[error] * state.update(newState); // Set the new state
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:82: error: class, interface, or enum expected
[error] * }
[error] ^
[error] .../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:85: error: class, interface, or enum expected
[error] * state.update(initialState);
[error] ^
[error] .../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:86: error: class, interface, or enum expected
[error] * }
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:90: error: class, interface, or enum expected
[error] * </code></pre>
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:92: error: class, interface, or enum expected
[error] * tparam S User-defined type of the state to be stored for each key. Must be encodable into
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:93: error: class, interface, or enum expected
[error] * Spark SQL types (see {link Encoder} for more details).
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:94: error: class, interface, or enum expected
[error] * since 2.1.1
[error] ^
```
And another link seems unrecognisable.
```
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:16: error: reference not found
[error] * That is, in every batch of the {link streaming.StreamingQuery StreamingQuery},
[error]
```
Note that this PR does not fix the two breaks as below:
```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:43: error: unexpected content
[error] * see {link DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile} for
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:52: error: bad use of '>'
[error] * param relativeError The relative target precision to achieve (>= 0).
[error] ^
[error]
```
because these seem probably fixed soon in apache#16776 and I intended to avoid potential conflicts.
## How was this patch tested?
Manually via `jekyll build`
Author: hyukjinkwon <[email protected]>
Closes apache#16926 from HyukjinKwon/javadoc-break.
|
LGTM |
|
Thanks! Merging to master. Please continue to finish the work https://issues.apache.org/jira/browse/SPARK-19573 |
|
Sorry I missed the conversation here. LGTM. |
What changes were proposed in this pull request?
1, check the behavior with illegal
quantilesandrelativeError2, add tests for
relativeError> 13, update tests for
nulldata4, update some docs for javadoc8
How was this patch tested?
local test in spark-shell