-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15031][EXAMPLES][FOLLOW-UP] Make Python param example working with SparkSession #13135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #58638 has finished for PR 13135 at commit
|
|
Could you please take a look? @MLnick |
| # We may alternatively specify parameters using a parameter map. | ||
| # paramMap overrides all lr parameters set earlier. | ||
| paramMap = {lr.maxIter: 20, lr.thresholds: [0.45, 0.55], lr.probabilityCol: "myProbability"} | ||
| paramMap = {lr.maxIter: 20, lr.thresholds: [0.5, 0.5], lr.probabilityCol: "myProbability"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it throws exception when we make predictions because we want to find an authoritative threshold. This change is okey. Actually we use threshold more frequently than thresholds in LogisticRegression, because LR does not support multi classification currently. The community is try to find a way to harmonize the two param for LR, but did not find a final solution. You can refer SPARK-11834 and SPARK-11543 .
|
This looks good to me. It looks like the best fix currently. |
|
@yanboliang Thank you so much for taking a close look and a detailed explanation! |
|
This looks ok - though the Scala example doesn't throw the exception - why is that (since it is also setting |
|
@MLnick It seems So.. when both values are set and different, it throws an exception. |
| """ | ||
|
|
||
| if __name__ == "__main__": | ||
| if len(sys.argv) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the argv are never used in this example. So what about just removing this if segment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. Isn't it making sure of not taking arguments for this script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checking seems meaningless. And scala and java example dont have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.. Hm.. but I don't think this is meaningless but means explicitly not taking arguments.
Actually, I think all the examples (not taking arguments) should check this for consistency because some of example scripts (taking arguments) are already checking this.
Strictly, running this example with arguments might not be a proper way to run this example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanboliang @MLnick Do you mind If I ask your thoughts as well? I don't mind if I should change this example not to check the sys.argv or make another PR to check sys.argv for all other examples in this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're moving most examples towards being more simple (with a few exceptions, such as keeping the longer ML examples that show a bit how to build an app and use args parsing). As such I agree we should remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you all!
|
@HyukjinKwon ah right, of course. I forgot the params get set during |
|
Test build #58745 has finished for PR 13135 at commit
|
|
One minor issue: Since we changed |
|
I see. Thanks! I will change them tomorrow. |
|
Test build #58838 has finished for PR 13135 at commit
|
|
LGTM too. Merged to master/branch-2.0. Thanks! |
…with SparkSession ## What changes were proposed in this pull request? It seems most of Python examples were changed to use SparkSession by #12809. This PR said both examples below: - `simple_params_example.py` - `aft_survival_regression.py` are not changed because it dose not work. It seems `aft_survival_regression.py` is changed by #13050 but `simple_params_example.py` is not yet. This PR corrects the example and make this use SparkSession. In more detail, it seems `threshold` is replaced to `thresholds` here and there by 5a23213. However, when it calls `lr.fit(training, paramMap)` this overwrites the values. So, `threshold` was 5 and `thresholds` becomes 5.5 (by `1 / (1 + thresholds(0) / thresholds(1)`). According to the comment below. this is not allowed, https://github.com/apache/spark/blob/354f8f11bd4b20fa99bd67a98da3525fd3d75c81/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L58-L61. So, in this PR, it sets the equivalent value so that this does not throw an exception. ## How was this patch tested? Manully (`mvn package -DskipTests && spark-submit simple_params_example.py`) Author: hyukjinkwon <[email protected]> Closes #13135 from HyukjinKwon/SPARK-15031. (cherry picked from commit e2ec32d) Signed-off-by: Nick Pentreath <[email protected]>
What changes were proposed in this pull request?
It seems most of Python examples were changed to use SparkSession by #12809. This PR said both examples below:
simple_params_example.pyaft_survival_regression.pyare not changed because it dose not work. It seems
aft_survival_regression.pyis changed by #13050 butsimple_params_example.pyis not yet.This PR corrects the example and make this use SparkSession.
In more detail, it seems
thresholdis replaced tothresholdshere and there by 5a23213. However, when it callslr.fit(training, paramMap)this overwrites the values. So,thresholdwas 5 andthresholdsbecomes 5.5 (by1 / (1 + thresholds(0) / thresholds(1)).According to the comment below. this is not allowed,
spark/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
Lines 58 to 61 in 354f8f1
So, in this PR, it sets the equivalent value so that this does not throw an exception.
How was this patch tested?
Manully (
mvn package -DskipTests && spark-submit simple_params_example.py)