[SPARK-15031][EXAMPLES][FOLLOW-UP] Make Python param example working with SparkSession #13135

HyukjinKwon · 2016-05-16T11:12:31Z

What changes were proposed in this pull request?

It seems most of Python examples were changed to use SparkSession by #12809. This PR said both examples below:

simple_params_example.py
aft_survival_regression.py

are not changed because it dose not work. It seems aft_survival_regression.py is changed by #13050 but simple_params_example.py is not yet.

This PR corrects the example and make this use SparkSession.

In more detail, it seems threshold is replaced to thresholds here and there by 5a23213. However, when it calls lr.fit(training, paramMap) this overwrites the values. So, threshold was 5 and thresholds becomes 5.5 (by 1 / (1 + thresholds(0) / thresholds(1)).

According to the comment below. this is not allowed,

spark/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

Lines 58 to 61 in 354f8f1

    
              * Note: Calling this with threshold p is equivalent to calling `setThresholds(Array(1-p, p))`. 
        
              *       When [[setThreshold()]] is called, any user-set value for [[thresholds]] will be cleared. 
        
              *       If both [[threshold]] and [[thresholds]] are set in a ParamMap, then they must be 
        
              *       equivalent.

.

So, in this PR, it sets the equivalent value so that this does not throw an exception.

How was this patch tested?

Manully (mvn package -DskipTests && spark-submit simple_params_example.py)

SparkQA · 2016-05-16T11:25:09Z

Test build #58638 has finished for PR 13135 at commit fc49d30.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-05-16T11:35:51Z

Could you please take a look? @MLnick

yanboliang · 2016-05-16T15:44:12Z

examples/src/main/python/ml/simple_params_example.py

    # We may alternatively specify parameters using a parameter map.
    # paramMap overrides all lr parameters set earlier.
-    paramMap = {lr.maxIter: 20, lr.thresholds: [0.45, 0.55], lr.probabilityCol: "myProbability"}
+    paramMap = {lr.maxIter: 20, lr.thresholds: [0.5, 0.5], lr.probabilityCol: "myProbability"}


Oh, it throws exception when we make predictions because we want to find an authoritative threshold. This change is okey. Actually we use threshold more frequently than thresholds in LogisticRegression, because LR does not support multi classification currently. The community is try to find a way to harmonize the two param for LR, but did not find a final solution. You can refer SPARK-11834 and SPARK-11543 .

yanboliang · 2016-05-16T16:04:00Z

This looks good to me. It looks like the best fix currently.

HyukjinKwon · 2016-05-16T22:39:29Z

@yanboliang Thank you so much for taking a close look and a detailed explanation!

MLnick · 2016-05-17T18:28:26Z

This looks ok - though the Scala example doesn't throw the exception - why is that (since it is also setting thresholds)?

HyukjinKwon · 2016-05-17T22:45:20Z

@MLnick It seems threshold or thresholds is set (mutually exclusive) via set.. methods but in Python both can be set via fit method.

So.. when both values are set and different, it throws an exception.

zhengruifeng · 2016-05-18T02:47:45Z

examples/src/main/python/ml/simple_params_example.py

 """

 if __name__ == "__main__":
    if len(sys.argv) > 1:


It seems that the argv are never used in this example. So what about just removing this if segment?

Hm.. Isn't it making sure of not taking arguments for this script?

This checking seems meaningless. And scala and java example dont have it.

I see.. Hm.. but I don't think this is meaningless but means explicitly not taking arguments.

Actually, I think all the examples (not taking arguments) should check this for consistency because some of example scripts (taking arguments) are already checking this.

Strictly, running this example with arguments might not be a proper way to run this example.

@yanboliang @MLnick Do you mind If I ask your thoughts as well? I don't mind if I should change this example not to check the sys.argv or make another PR to check sys.argv for all other examples in this way.

We're moving most examples towards being more simple (with a few exceptions, such as keeping the longer ML examples that show a bit how to build an app and use args parsing). As such I agree we should remove this.

Thank you all!

MLnick · 2016-05-18T04:27:08Z

@HyukjinKwon ah right, of course. I forgot the params get set during fit in Python.

SparkQA · 2016-05-18T05:10:18Z

Test build #58745 has finished for PR 13135 at commit bb88635.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-05-18T11:34:14Z

One minor issue: Since we changed thresholds at Python example to [0.5, 0.5], should we also update Scala/Java examples to make them consistent. Although I known that Scala/Java examples work well for [0.45, 0.55]. Otherwise LGTM.

HyukjinKwon · 2016-05-18T12:22:23Z

I see. Thanks! I will change them tomorrow.

SparkQA · 2016-05-19T04:07:57Z

Test build #58838 has finished for PR 13135 at commit 9ec58e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-05-19T06:51:05Z

LGTM too. Merged to master/branch-2.0. Thanks!

…with SparkSession ## What changes were proposed in this pull request? It seems most of Python examples were changed to use SparkSession by #12809. This PR said both examples below: - `simple_params_example.py` - `aft_survival_regression.py` are not changed because it dose not work. It seems `aft_survival_regression.py` is changed by #13050 but `simple_params_example.py` is not yet. This PR corrects the example and make this use SparkSession. In more detail, it seems `threshold` is replaced to `thresholds` here and there by 5a23213. However, when it calls `lr.fit(training, paramMap)` this overwrites the values. So, `threshold` was 5 and `thresholds` becomes 5.5 (by `1 / (1 + thresholds(0) / thresholds(1)`). According to the comment below. this is not allowed, https://github.com/apache/spark/blob/354f8f11bd4b20fa99bd67a98da3525fd3d75c81/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L58-L61. So, in this PR, it sets the equivalent value so that this does not throw an exception. ## How was this patch tested? Manully (`mvn package -DskipTests && spark-submit simple_params_example.py`) Author: hyukjinkwon <[email protected]> Closes #13135 from HyukjinKwon/SPARK-15031. (cherry picked from commit e2ec32d) Signed-off-by: Nick Pentreath <[email protected]>

HyukjinKwon added 2 commits May 16, 2016 20:03

Make an python example working with SparkSession

ade614b

Remove unused imports

fc49d30

yanboliang reviewed May 16, 2016
View reviewed changes

zhengruifeng reviewed May 18, 2016
View reviewed changes

Remove sys.argv checking

bb88635

Update thresholds for consistency for Java/Scala examples

9ec58e6

asfgit closed this in e2ec32d May 19, 2016

HyukjinKwon deleted the SPARK-15031 branch January 2, 2018 03:42

	* Note: Calling this with threshold p is equivalent to calling `setThresholds(Array(1-p, p))`.
	* When [[setThreshold()]] is called, any user-set value for [[thresholds]] will be cleared.
	* If both [[threshold]] and [[thresholds]] are set in a ParamMap, then they must be
	* equivalent.

[SPARK-15031][EXAMPLES][FOLLOW-UP] Make Python param example working with SparkSession #13135

[SPARK-15031][EXAMPLES][FOLLOW-UP] Make Python param example working with SparkSession #13135

Uh oh!

Conversation

HyukjinKwon commented May 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 16, 2016

Uh oh!

HyukjinKwon commented May 16, 2016

Uh oh!

yanboliang May 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanboliang commented May 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 16, 2016

Uh oh!

MLnick commented May 17, 2016

Uh oh!

HyukjinKwon commented May 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengruifeng May 18, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 18, 2016

Choose a reason for hiding this comment

Uh oh!

zhengruifeng May 18, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 18, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick May 18, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick commented May 18, 2016

Uh oh!

SparkQA commented May 18, 2016

Uh oh!

yanboliang commented May 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 18, 2016

Uh oh!

SparkQA commented May 19, 2016

Uh oh!

MLnick commented May 19, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented May 16, 2016 •

edited

Loading

yanboliang May 16, 2016 •

edited

Loading

yanboliang commented May 16, 2016 •

edited

Loading

HyukjinKwon commented May 17, 2016 •

edited

Loading

HyukjinKwon May 18, 2016 •

edited

Loading

HyukjinKwon May 18, 2016 •

edited

Loading

yanboliang commented May 18, 2016 •

edited

Loading