[SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug #20566

viirya · 2018-02-10T09:16:23Z

What changes were proposed in this pull request?

Since 2.3, Bucketizer supports multiple input/output columns. We will check if exclusive params are set during transformation. E.g., if inputCols and outputCol are both set, an error will be thrown.

However, when we write Bucketizer, looks like the default params and user-supplied params are merged during writing. All saved params are loaded back and set to created model instance. So the default outputCol param in HasOutputCol trait will be set in paramMap and become an user-supplied param. That makes the check of exclusive params failed.

This patch changes DefaultParamsWriter and only save user-supplied params.

The multi-column QuantileDiscretizer also has the same issue.

How was this patch tested?

Modified test.

SparkQA · 2018-02-10T09:29:06Z

Test build #87283 has finished for PR 20566 at commit 7785cac.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91

I am just wondering whether we should persist the default params too (in case they are changed across multiple versions) but in a separate section. WDYT?

mgaido91 · 2018-02-10T09:36:29Z

mllib/src/main/scala/org/apache/spark/ml/param/params.scala

can't we just make paramMap private[ml]?

Either way are good for me.

In this way I think you can also avoid the MiMa failure...

Looks like it still can't avoid the MiMa failure.

viirya · 2018-02-10T09:47:41Z

@mgaido91 I also considered the issue of changed default values across versions. I'm not sure which is more reasonable, using old version's default value or using current version's default value.

mgaido91 · 2018-02-10T09:53:13Z

@viirya that's a good question. Honestly my idea is that if the user doesn't set a value, he/she doesn't care about it, so it is good to use the new version default IMHO. But it is also true that changing a default may cause unexpected behavior in user code.

So, it LGTM, but I'd like to hear others' opinion on this too.

viirya · 2018-02-10T09:58:49Z

Yeah, IMHO, when the user loads a model from old version into new version to run, I think it is reasonable to run it with current default value because the param is not explicitly set and should use "default" value of current system.

Thanks for your comment. Let's wait for others' option.

SparkQA · 2018-02-10T10:04:17Z

Test build #87285 has finished for PR 20566 at commit 3b5e7c6.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-10T16:08:16Z

Test build #87289 has finished for PR 20566 at commit 6228006.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MrBago · 2018-02-11T00:33:39Z

I believe this will break persistence for LogisticRegression. I believe the issue is that the threshold param on LogisticRegressionModel doesn't get a default directly, but only gets it during the call to fit on LogisticRegression. This is currently fine because the Model can only be created by fitting or by being read from disk and in both case some value gets set for threshold. With this change that's no longer the case. Here's a test to confirm, 5db2108.

I believe LinearRegression may have a similar issue.

Our current tests don't seem to cover this kind of thing so I think we should improve test coverage if we want to make this kind of change.

viirya · 2018-02-11T02:10:50Z

Not only threshold, the default params of NaiveBayes, LogisticRegression (maybe more, I'm looking up now) are all set in the estimator, not in their model. The models are received the default values at the end of fit.

SparkQA · 2018-02-11T06:39:28Z

Test build #87299 has finished for PR 20566 at commit daceafe.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-11T08:05:01Z

Test build #87301 has finished for PR 20566 at commit c1fb657.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-02-11T08:12:33Z

retest this please.

SparkQA · 2018-02-11T11:56:36Z

Test build #87302 has finished for PR 20566 at commit c1fb657.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-02-11T11:58:31Z

cc @MLnick @jkbradley

jkbradley · 2018-02-12T17:47:35Z

Thanks for the patch @viirya
As always, I'll request that we put design decisions & long discussions in JIRA so that they are easier to uncover. It can also be good to get quick feedback about design before implementation. I'll comment in JIRA.

viirya · 2018-02-13T01:24:48Z

@jkbradley Thanks! I will post the problem and proposed design on the JIRA.

viirya · 2018-02-13T05:11:41Z

I'd close this and favor the quick fix #20594 based on the discussion in JIRA. Will re-open it if it is needed later.

mgaido91 reviewed Feb 10, 2018

View reviewed changes

Only save user-supplied params.

3b5e7c6

viirya force-pushed the SPARK-23377 branch from 7785cac to 3b5e7c6 Compare February 10, 2018 09:50

Fix mima.

6228006

Move default params to base trait.

daceafe

Fix mima.

c1fb657

viirya closed this Feb 13, 2018

viirya deleted the SPARK-23377 branch December 27, 2023 18:21

[SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug #20566

[SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug #20566

Uh oh!

Conversation

viirya commented Feb 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 10, 2018

Uh oh!

mgaido91 left a comment

Choose a reason for hiding this comment

Uh oh!

mgaido91 Feb 10, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Feb 10, 2018

Choose a reason for hiding this comment

Uh oh!

mgaido91 Feb 10, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Feb 10, 2018

Choose a reason for hiding this comment

Uh oh!

viirya commented Feb 10, 2018

Uh oh!

mgaido91 commented Feb 10, 2018

Uh oh!

viirya commented Feb 10, 2018

Uh oh!

SparkQA commented Feb 10, 2018

Uh oh!

SparkQA commented Feb 10, 2018

Uh oh!

MrBago commented Feb 11, 2018

Uh oh!

viirya commented Feb 11, 2018

Uh oh!

SparkQA commented Feb 11, 2018

Uh oh!

SparkQA commented Feb 11, 2018

Uh oh!

viirya commented Feb 11, 2018

Uh oh!

SparkQA commented Feb 11, 2018

Uh oh!

viirya commented Feb 11, 2018

Uh oh!

jkbradley commented Feb 12, 2018

Uh oh!

viirya commented Feb 13, 2018

Uh oh!

viirya commented Feb 13, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

viirya commented Feb 10, 2018 •

edited

Loading