Skip to content

Conversation

@Louiszr
Copy link

@Louiszr Louiszr commented Aug 16, 2020

What changes were proposed in this pull request?

Changed the definitions of CrossValidatorModel.copy()/_to_java()/_from_java() so that exposed parameters (i.e. parameters with get() methods) are copied in these methods.

Why are the changes needed?

Parameters are copied in the respective Scala interface for CrossValidatorModel.copy().
It fits the semantics to persist parameters when calling CrossValidatorModel.save() and CrossValidatorModel.load() so that the user gets the same model by saving and loading it after. Not copying across numFolds also causes bugs like Array index out of bound and losing sub-models because this parameters will always default to 3 (as described in the JIRA ticket).

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tests for CrossValidatorModel.copy() and save()/load() are updated so that they check parameters before and after function calls.

@huaxingao huaxingao changed the title [SPARK-32092][MLlib] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write() [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write() Aug 17, 2020
@huaxingao
Copy link
Contributor

ok to test

1 similar comment
@viirya
Copy link
Member

viirya commented Aug 17, 2020

ok to test

avgMetrics = self.avgMetrics
subModels = self.subModels
return CrossValidatorModel(bestModel, avgMetrics, subModels)
return self._copyValues(CrossValidatorModel(bestModel, avgMetrics, subModels), extra=extra)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use Params.copy like CrossValidator?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can. However I just found one potential issue with using Params.copy (not specific to CrossValidator). It creates a shallow copy of self (i.e. the models). Hence if we run the below snippet

cvModelCopied = cvModel.copy()
cvModel.avgMetrics[0] = 'foo'
assert cvModelCopied.avgMetrics[0] != 'foo'  # This will fail

Based on the Scala equivalent I think avgMetrics should be shallow copied and subModels should be copied with the copying actions delegated to copy() of each model. I will push a change to this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You meant avgMetrics should be or should not be shallow copied?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avgMetrics should be shallow copied, as tested in https://github.com/Louiszr/spark/blob/8ae74d000ac48d6e293feb4bc3dac31088bf6c6c/python/pyspark/ml/tests/test_tuning.py#L124-L129
I think Params.copy will shallow copy the CrossValidatorModel object and thus only copy the reference to avgMetrics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm? I think the test makes sure it isn't shallow copy but deep copy, isn't?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By shallow copy I mean copy.copy() in python, which makes re-assigning cvModel.avgMetrics[0] not being propagated to cvModelCopied.avgMetrics[0].
I am also using copy.deepcopy() as the reference for deep copy. If cvModel.avgMetrics[0] is an class instance, then shallow copy will point to the same instance, while deep copy will create a copy of the instance.
I think here it doesn't make a difference because avgMetrics is a list of float, but in the future if it does become a list of objects then a shallow copy implementation will be sufficient to pass the test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I get your point above. You meant if we just shallow copy the model itself, reassigning of element in avgMetrics will be propagated to the avgMetrics in copied model. Because two models use the same avgMetrics reference.

You want to shallow copy avgMetrics itself. So two models have difference avgMetrics references. Because avgMetrics is a list of float, it is no matter shallow copy or deep copy.

By shallow copy I mean copy.copy() in python, which makes re-assigning cvModel.avgMetrics[0] not being propagated to cvModelCopied.avgMetrics[0].

No matter deep copy or shallow copy, I think reassigning avgMetrics[0] won't propagate to the avgMetrics[0] of copied model. Shallow copy copies object references, reassigning changes references, so won't propagate. Deep copy copies object instance, reassigning changes references too, of course won't propagate either.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with the above.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked Scala API but want to confirm, this is only happened at Python API, right?

@SparkQA
Copy link

SparkQA commented Aug 17, 2020

Test build #127493 has finished for PR 29445 at commit b662ee0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Louiszr Louiszr changed the title [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write() [WIP][SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write() Aug 17, 2020
@Louiszr
Copy link
Author

Louiszr commented Aug 17, 2020

I checked Scala API but want to confirm, this is only happened at Python API, right?

AFAIK yes. The Scala API should be covered by the copyValues call at the end of the function.

@Louiszr
Copy link
Author

Louiszr commented Aug 17, 2020

@viirya I think similar issues on copy/read/write is affecting TrainValidationSplitModel as well. Shall I fix it in the same PR?

@viirya
Copy link
Member

viirya commented Aug 17, 2020

If it is same issue, I'm ok to have the fix in this PR too.

@SparkQA
Copy link

SparkQA commented Aug 17, 2020

Test build #127512 has finished for PR 29445 at commit e7d79be.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 19, 2020

Test build #127659 has finished for PR 29445 at commit ba994fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Louiszr Louiszr changed the title [WIP][SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write() [SPARK-32092][ML][PySpark] Fix parameters not being copied in CrossValidatorModel.copy(), read() and write() Aug 19, 2020
@SparkQA
Copy link

SparkQA commented Aug 19, 2020

Test build #127663 has finished for PR 29445 at commit 1c98218.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Louiszr Louiszr requested a review from viirya August 19, 2020 23:49
@huaxingao
Copy link
Contributor

LGTM

@huaxingao
Copy link
Contributor

@srowen @viirya @zhengruifeng
Do we need to put the fix in 3.0.1 too? I guess we can still make it in RC2.

@SparkQA
Copy link

SparkQA commented Aug 20, 2020

Test build #127707 has finished for PR 29445 at commit 8ae74d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

@huaxingao I think we need to put this bugfix into RC2.

@viirya
Copy link
Member

viirya commented Aug 21, 2020

+1 to put in RC2.

@srowen srowen closed this in d9eb06e Aug 22, 2020
@srowen
Copy link
Member

srowen commented Aug 22, 2020

Merged to master/3.0

srowen pushed a commit that referenced this pull request Aug 22, 2020
…lidatorModel.copy(), read() and write()

### What changes were proposed in this pull request?

Changed the definitions of `CrossValidatorModel.copy()/_to_java()/_from_java()` so that exposed parameters (i.e. parameters with `get()` methods) are copied in these methods.

### Why are the changes needed?

Parameters are copied in the respective Scala interface for `CrossValidatorModel.copy()`.
It fits the semantics to persist parameters when calling `CrossValidatorModel.save()` and `CrossValidatorModel.load()` so that the user gets the same model by saving and loading it after. Not copying across `numFolds` also causes bugs like Array index out of bound and losing sub-models because this parameters will always default to 3 (as described in the JIRA ticket).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests for `CrossValidatorModel.copy()` and `save()`/`load()` are updated so that they check parameters before and after function calls.

Closes #29445 from Louiszr/master.

Authored-by: Louiszr <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit d9eb06e)
Signed-off-by: Sean Owen <[email protected]>
@viirya
Copy link
Member

viirya commented Aug 23, 2020

@Louiszr This patch seems causing test error in branch-3.0. Can you check it?

https://github.com/apache/spark/pull/29521/checks?check_run_id=1017861326:

Traceback (most recent call last):
  File "/home/runner/work/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 109, in test_copy
    self.assertEqual(param(cv), param(cvCopied))
  File "/home/runner/work/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 104, in <lambda>
    lambda x: x.getFoldCol(),
AttributeError: 'CrossValidator' object has no attribute 'getFoldCol'

@huaxingao
Copy link
Contributor

Sorry I just realized foldCol is for 3.1.0 only. Could you submit a follow-up PR to remove the foldCol related code from 3.0.1? @Louiszr

@Louiszr
Copy link
Author

Louiszr commented Aug 23, 2020

@viirya @huaxingao Yep I will open a PR. Shall I fork from branch-3.0?

@viirya
Copy link
Member

viirya commented Aug 23, 2020

@Louiszr Yes, because it is failed in branch-3.0, you need to create a PR against branch-3.0, instead of master branch.

huaxingao pushed a commit that referenced this pull request Aug 24, 2020
### What changes were proposed in this pull request?

- Removed `foldCol` related code introduced in #29445 which is causing issues in the base branch.
- Fixed `CrossValidatorModel.copy()` so that it correctly calls `.copy()` on the models instead of lists of models.

### Why are the changes needed?

- `foldCol` is from 3.1 hence causing tests to fail.
- `CrossValidatorModel.copy()` is supposed to shallow copy models not lists of models.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

- Existing tests created in #29445 ran and passed.
- Updated `test_copy` to make sure `copy()` is called on models instead of lists of models.

Closes #29524 from Louiszr/remove-foldcol-3.0.

Authored-by: Louiszr <[email protected]>
Signed-off-by: Huaxin Gao <[email protected]>
huaxingao pushed a commit that referenced this pull request Aug 28, 2020
… to copy models instead of list

### What changes were proposed in this pull request?

Fixed `CrossValidatorModel.copy()` so that it correctly calls `.copy()` on the models instead of lists of models.

### Why are the changes needed?

`copy()` was first changed in #29445 . The issue was found in CI of #29524 and fixed. This PR introduces the exact same change so that `CrossValidatorModel.copy()` and its related tests are aligned in branch `master` and branch `branch-3.0`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Updated `test_copy` to make sure `copy()` is called on models instead of lists of models.

Closes #29553 from Louiszr/fix-cv-copy.

Authored-by: Louiszr <[email protected]>
Signed-off-by: Huaxin Gao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants