Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Feb 10, 2018

What changes were proposed in this pull request?

Since 2.3, Bucketizer supports multiple input/output columns. We will check if exclusive params are set during transformation. E.g., if inputCols and outputCol are both set, an error will be thrown.

However, when we write Bucketizer, looks like the default params and user-supplied params are merged during writing. All saved params are loaded back and set to created model instance. So the default outputCol param in HasOutputCol trait will be set in paramMap and become an user-supplied param. That makes the check of exclusive params failed.

This patch changes DefaultParamsWriter and only save user-supplied params.

The multi-column QuantileDiscretizer also has the same issue.

How was this patch tested?

Modified test.

@SparkQA
Copy link

SparkQA commented Feb 10, 2018

Test build #87283 has finished for PR 20566 at commit 7785cac.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just wondering whether we should persist the default params too (in case they are changed across multiple versions) but in a separate section. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just make paramMap private[ml]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way are good for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this way I think you can also avoid the MiMa failure...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it still can't avoid the MiMa failure.

@viirya
Copy link
Member Author

viirya commented Feb 10, 2018

@mgaido91 I also considered the issue of changed default values across versions. I'm not sure which is more reasonable, using old version's default value or using current version's default value.

@mgaido91
Copy link
Contributor

@viirya that's a good question. Honestly my idea is that if the user doesn't set a value, he/she doesn't care about it, so it is good to use the new version default IMHO. But it is also true that changing a default may cause unexpected behavior in user code.

So, it LGTM, but I'd like to hear others' opinion on this too.

@viirya
Copy link
Member Author

viirya commented Feb 10, 2018

Yeah, IMHO, when the user loads a model from old version into new version to run, I think it is reasonable to run it with current default value because the param is not explicitly set and should use "default" value of current system.

Thanks for your comment. Let's wait for others' option.

@SparkQA
Copy link

SparkQA commented Feb 10, 2018

Test build #87285 has finished for PR 20566 at commit 3b5e7c6.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2018

Test build #87289 has finished for PR 20566 at commit 6228006.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MrBago
Copy link
Contributor

MrBago commented Feb 11, 2018

I believe this will break persistence for LogisticRegression. I believe the issue is that the threshold param on LogisticRegressionModel doesn't get a default directly, but only gets it during the call to fit on LogisticRegression. This is currently fine because the Model can only be created by fitting or by being read from disk and in both case some value gets set for threshold. With this change that's no longer the case. Here's a test to confirm, 5db2108.

I believe LinearRegression may have a similar issue.

Our current tests don't seem to cover this kind of thing so I think we should improve test coverage if we want to make this kind of change.

@viirya
Copy link
Member Author

viirya commented Feb 11, 2018

Not only threshold, the default params of NaiveBayes, LogisticRegression (maybe more, I'm looking up now) are all set in the estimator, not in their model. The models are received the default values at the end of fit.

@SparkQA
Copy link

SparkQA commented Feb 11, 2018

Test build #87299 has finished for PR 20566 at commit daceafe.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 11, 2018

Test build #87301 has finished for PR 20566 at commit c1fb657.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Feb 11, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Feb 11, 2018

Test build #87302 has finished for PR 20566 at commit c1fb657.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Feb 11, 2018

cc @MLnick @jkbradley

@jkbradley
Copy link
Member

Thanks for the patch @viirya
As always, I'll request that we put design decisions & long discussions in JIRA so that they are easier to uncover. It can also be good to get quick feedback about design before implementation. I'll comment in JIRA.

@viirya
Copy link
Member Author

viirya commented Feb 13, 2018

@jkbradley Thanks! I will post the problem and proposed design on the JIRA.

@viirya
Copy link
Member Author

viirya commented Feb 13, 2018

I'd close this and favor the quick fix #20594 based on the discussion in JIRA. Will re-open it if it is needed later.

@viirya viirya closed this Feb 13, 2018
@viirya viirya deleted the SPARK-23377 branch December 27, 2023 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants