Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented May 16, 2016

What changes were proposed in this pull request?

It seems most of Python examples were changed to use SparkSession by #12809. This PR said both examples below:

  • simple_params_example.py
  • aft_survival_regression.py

are not changed because it dose not work. It seems aft_survival_regression.py is changed by #13050 but simple_params_example.py is not yet.

This PR corrects the example and make this use SparkSession.

In more detail, it seems threshold is replaced to thresholds here and there by 5a23213. However, when it calls lr.fit(training, paramMap) this overwrites the values. So, threshold was 5 and thresholds becomes 5.5 (by 1 / (1 + thresholds(0) / thresholds(1)).

According to the comment below. this is not allowed,

* Note: Calling this with threshold p is equivalent to calling `setThresholds(Array(1-p, p))`.
* When [[setThreshold()]] is called, any user-set value for [[thresholds]] will be cleared.
* If both [[threshold]] and [[thresholds]] are set in a ParamMap, then they must be
* equivalent.
.

So, in this PR, it sets the equivalent value so that this does not throw an exception.

How was this patch tested?

Manully (mvn package -DskipTests && spark-submit simple_params_example.py)

@SparkQA
Copy link

SparkQA commented May 16, 2016

Test build #58638 has finished for PR 13135 at commit fc49d30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Could you please take a look? @MLnick

# We may alternatively specify parameters using a parameter map.
# paramMap overrides all lr parameters set earlier.
paramMap = {lr.maxIter: 20, lr.thresholds: [0.45, 0.55], lr.probabilityCol: "myProbability"}
paramMap = {lr.maxIter: 20, lr.thresholds: [0.5, 0.5], lr.probabilityCol: "myProbability"}
Copy link
Contributor

@yanboliang yanboliang May 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it throws exception when we make predictions because we want to find an authoritative threshold. This change is okey. Actually we use threshold more frequently than thresholds in LogisticRegression, because LR does not support multi classification currently. The community is try to find a way to harmonize the two param for LR, but did not find a final solution. You can refer SPARK-11834 and SPARK-11543 .

@yanboliang
Copy link
Contributor

yanboliang commented May 16, 2016

This looks good to me. It looks like the best fix currently.

@HyukjinKwon
Copy link
Member Author

@yanboliang Thank you so much for taking a close look and a detailed explanation!

@MLnick
Copy link
Contributor

MLnick commented May 17, 2016

This looks ok - though the Scala example doesn't throw the exception - why is that (since it is also setting thresholds)?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 17, 2016

@MLnick It seems threshold or thresholds is set (mutually exclusive) via set.. methods but in Python both can be set via fit method.

So.. when both values are set and different, it throws an exception.

"""

if __name__ == "__main__":
if len(sys.argv) > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the argv are never used in this example. So what about just removing this if segment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. Isn't it making sure of not taking arguments for this script?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checking seems meaningless. And scala and java example dont have it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.. Hm.. but I don't think this is meaningless but means explicitly not taking arguments.

Actually, I think all the examples (not taking arguments) should check this for consistency because some of example scripts (taking arguments) are already checking this.

Strictly, running this example with arguments might not be a proper way to run this example.

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanboliang @MLnick Do you mind If I ask your thoughts as well? I don't mind if I should change this example not to check the sys.argv or make another PR to check sys.argv for all other examples in this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're moving most examples towards being more simple (with a few exceptions, such as keeping the longer ML examples that show a bit how to build an app and use args parsing). As such I agree we should remove this.

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all!

@MLnick
Copy link
Contributor

MLnick commented May 18, 2016

@HyukjinKwon ah right, of course. I forgot the params get set during fit in Python.

@SparkQA
Copy link

SparkQA commented May 18, 2016

Test build #58745 has finished for PR 13135 at commit bb88635.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

yanboliang commented May 18, 2016

One minor issue: Since we changed thresholds at Python example to [0.5, 0.5], should we also update Scala/Java examples to make them consistent. Although I known that Scala/Java examples work well for [0.45, 0.55]. Otherwise LGTM.

@HyukjinKwon
Copy link
Member Author

I see. Thanks! I will change them tomorrow.

@SparkQA
Copy link

SparkQA commented May 19, 2016

Test build #58838 has finished for PR 13135 at commit 9ec58e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented May 19, 2016

LGTM too. Merged to master/branch-2.0. Thanks!

asfgit pushed a commit that referenced this pull request May 19, 2016
…with SparkSession

## What changes were proposed in this pull request?

It seems most of Python examples were changed to use SparkSession by #12809. This PR said both examples below:

- `simple_params_example.py`
- `aft_survival_regression.py`

are not changed because it dose not work. It seems `aft_survival_regression.py` is changed by #13050 but `simple_params_example.py` is not yet.

This PR corrects the example and make this use SparkSession.

In more detail, it seems `threshold` is replaced to `thresholds` here and there by 5a23213. However, when it calls `lr.fit(training, paramMap)` this overwrites the values. So, `threshold` was 5 and `thresholds` becomes 5.5 (by `1 / (1 + thresholds(0) / thresholds(1)`).

According to the comment below. this is not allowed, https://github.com/apache/spark/blob/354f8f11bd4b20fa99bd67a98da3525fd3d75c81/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L58-L61.

So, in this PR, it sets the equivalent value so that this does not throw an exception.

## How was this patch tested?

Manully (`mvn package -DskipTests && spark-submit simple_params_example.py`)

Author: hyukjinkwon <[email protected]>

Closes #13135 from HyukjinKwon/SPARK-15031.

(cherry picked from commit e2ec32d)
Signed-off-by: Nick Pentreath <[email protected]>
@asfgit asfgit closed this in e2ec32d May 19, 2016
@HyukjinKwon HyukjinKwon deleted the SPARK-15031 branch January 2, 2018 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants