-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write() #29445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
1 similar comment
|
ok to test |
| avgMetrics = self.avgMetrics | ||
| subModels = self.subModels | ||
| return CrossValidatorModel(bestModel, avgMetrics, subModels) | ||
| return self._copyValues(CrossValidatorModel(bestModel, avgMetrics, subModels), extra=extra) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use Params.copy like CrossValidator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can. However I just found one potential issue with using Params.copy (not specific to CrossValidator). It creates a shallow copy of self (i.e. the models). Hence if we run the below snippet
cvModelCopied = cvModel.copy()
cvModel.avgMetrics[0] = 'foo'
assert cvModelCopied.avgMetrics[0] != 'foo' # This will failBased on the Scala equivalent I think avgMetrics should be shallow copied and subModels should be copied with the copying actions delegated to copy() of each model. I will push a change to this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You meant avgMetrics should be or should not be shallow copied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avgMetrics should be shallow copied, as tested in https://github.com/Louiszr/spark/blob/8ae74d000ac48d6e293feb4bc3dac31088bf6c6c/python/pyspark/ml/tests/test_tuning.py#L124-L129
I think Params.copy will shallow copy the CrossValidatorModel object and thus only copy the reference to avgMetrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm? I think the test makes sure it isn't shallow copy but deep copy, isn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By shallow copy I mean copy.copy() in python, which makes re-assigning cvModel.avgMetrics[0] not being propagated to cvModelCopied.avgMetrics[0].
I am also using copy.deepcopy() as the reference for deep copy. If cvModel.avgMetrics[0] is an class instance, then shallow copy will point to the same instance, while deep copy will create a copy of the instance.
I think here it doesn't make a difference because avgMetrics is a list of float, but in the future if it does become a list of objects then a shallow copy implementation will be sufficient to pass the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I get your point above. You meant if we just shallow copy the model itself, reassigning of element in avgMetrics will be propagated to the avgMetrics in copied model. Because two models use the same avgMetrics reference.
You want to shallow copy avgMetrics itself. So two models have difference avgMetrics references. Because avgMetrics is a list of float, it is no matter shallow copy or deep copy.
By shallow copy I mean copy.copy() in python, which makes re-assigning cvModel.avgMetrics[0] not being propagated to cvModelCopied.avgMetrics[0].
No matter deep copy or shallow copy, I think reassigning avgMetrics[0] won't propagate to the avgMetrics[0] of copied model. Shallow copy copies object references, reassigning changes references, so won't propagate. Deep copy copies object instance, reassigning changes references too, of course won't propagate either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with the above.
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked Scala API but want to confirm, this is only happened at Python API, right?
|
Test build #127493 has finished for PR 29445 at commit
|
AFAIK yes. The Scala API should be covered by the |
|
@viirya I think similar issues on |
|
If it is same issue, I'm ok to have the fix in this PR too. |
|
Test build #127512 has finished for PR 29445 at commit
|
|
Test build #127659 has finished for PR 29445 at commit
|
|
Test build #127663 has finished for PR 29445 at commit
|
|
LGTM |
|
@srowen @viirya @zhengruifeng |
|
Test build #127707 has finished for PR 29445 at commit
|
|
@huaxingao I think we need to put this bugfix into RC2. |
|
+1 to put in RC2. |
|
Merged to master/3.0 |
…lidatorModel.copy(), read() and write() ### What changes were proposed in this pull request? Changed the definitions of `CrossValidatorModel.copy()/_to_java()/_from_java()` so that exposed parameters (i.e. parameters with `get()` methods) are copied in these methods. ### Why are the changes needed? Parameters are copied in the respective Scala interface for `CrossValidatorModel.copy()`. It fits the semantics to persist parameters when calling `CrossValidatorModel.save()` and `CrossValidatorModel.load()` so that the user gets the same model by saving and loading it after. Not copying across `numFolds` also causes bugs like Array index out of bound and losing sub-models because this parameters will always default to 3 (as described in the JIRA ticket). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests for `CrossValidatorModel.copy()` and `save()`/`load()` are updated so that they check parameters before and after function calls. Closes #29445 from Louiszr/master. Authored-by: Louiszr <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit d9eb06e) Signed-off-by: Sean Owen <[email protected]>
|
@Louiszr This patch seems causing test error in branch-3.0. Can you check it? https://github.com/apache/spark/pull/29521/checks?check_run_id=1017861326: |
|
Sorry I just realized |
|
@viirya @huaxingao Yep I will open a PR. Shall I fork from branch-3.0? |
|
@Louiszr Yes, because it is failed in branch-3.0, you need to create a PR against branch-3.0, instead of master branch. |
### What changes were proposed in this pull request? - Removed `foldCol` related code introduced in #29445 which is causing issues in the base branch. - Fixed `CrossValidatorModel.copy()` so that it correctly calls `.copy()` on the models instead of lists of models. ### Why are the changes needed? - `foldCol` is from 3.1 hence causing tests to fail. - `CrossValidatorModel.copy()` is supposed to shallow copy models not lists of models. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Existing tests created in #29445 ran and passed. - Updated `test_copy` to make sure `copy()` is called on models instead of lists of models. Closes #29524 from Louiszr/remove-foldcol-3.0. Authored-by: Louiszr <[email protected]> Signed-off-by: Huaxin Gao <[email protected]>
… to copy models instead of list ### What changes were proposed in this pull request? Fixed `CrossValidatorModel.copy()` so that it correctly calls `.copy()` on the models instead of lists of models. ### Why are the changes needed? `copy()` was first changed in #29445 . The issue was found in CI of #29524 and fixed. This PR introduces the exact same change so that `CrossValidatorModel.copy()` and its related tests are aligned in branch `master` and branch `branch-3.0`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated `test_copy` to make sure `copy()` is called on models instead of lists of models. Closes #29553 from Louiszr/fix-cv-copy. Authored-by: Louiszr <[email protected]> Signed-off-by: Huaxin Gao <[email protected]>
What changes were proposed in this pull request?
Changed the definitions of
CrossValidatorModel.copy()/_to_java()/_from_java()so that exposed parameters (i.e. parameters withget()methods) are copied in these methods.Why are the changes needed?
Parameters are copied in the respective Scala interface for
CrossValidatorModel.copy().It fits the semantics to persist parameters when calling
CrossValidatorModel.save()andCrossValidatorModel.load()so that the user gets the same model by saving and loading it after. Not copying acrossnumFoldsalso causes bugs like Array index out of bound and losing sub-models because this parameters will always default to 3 (as described in the JIRA ticket).Does this PR introduce any user-facing change?
No.
How was this patch tested?
Tests for
CrossValidatorModel.copy()andsave()/load()are updated so that they check parameters before and after function calls.