Skip to content

Conversation

@mmenestret
Copy link
Contributor

…CrossValidator fold

The fold in the ML CrossValidator depends on a rand whose seed is set to 0 and it leads the sql.functions rand to call sc._jvm.functions.rand() with no seed.
In order to be able to unit test a Cross Validation it would be a good idea to be able to set this seed so the output of the cross validation (with a featureSubsetStrategy set to "all") would always be the same.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Aug 18, 2015

that seems reasonable -- @davies ?

@davies
Copy link
Contributor

davies commented Aug 18, 2015

LGTM. cc @mengxr Should we merge this into 1.5?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this line too.

@jkbradley
Copy link
Member

Please make CrossValidator inherit from HasSeed.
Also, update setParams.
See RandomForestClassifier or other existing classes for examples.

@mengxr
Copy link
Contributor

mengxr commented Sep 8, 2015

@mmenestret We should first add seed to Scala's CrossValidator and then add wrappers to Python. You can use https://github.com/apache/spark/blob/master/python/pyspark/ml/clustering.py#L37 as a reference implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also not necessary if we make CrossValidator extend HasSeed.

@jkbradley
Copy link
Member

Ping!

@jkbradley
Copy link
Member

Ping. Please let me know if you don't have time to push an update.

@mmenestret
Copy link
Contributor Author

Im sorry im on holyday right now, i'll be back in a week, is it ok ?

Envoyé depuis mon HTC

----- Reply message -----
De : "jkbradley" [email protected]
Pour : "apache/spark" [email protected]
Cc : "mmenestret" [email protected]
Objet : [spark] SPARK-9690 Adding the possibility to set the seed of the rand in the … (#7997)
Date : mar., sept. 22, 2015 10:39

Ping. Please let me know if you don't have time to push an update.


Reply to this email directly or view it on GitHub.

@jkbradley
Copy link
Member

Oh OK no problem. Enjoy your holiday!

@jkbradley
Copy link
Member

Ping! Btw, the 1.6 code freeze is scheduled for the end of this month.

@jkbradley
Copy link
Member

@mmenestret I'm going to create a new PR based on yours. You'll still be the primary author.

Please close this issue/PR, thanks.

asfgit pushed a commit that referenced this pull request Dec 16, 2015
Extend CrossValidator with HasSeed in PySpark.

This PR replaces [#7997]

CC: yanboliang thunterdb mmenestret  Would one of you mind taking a look?  Thanks!

Author: Joseph K. Bradley <[email protected]>
Author: Martin MENESTRET <[email protected]>

Closes #10268 from jkbradley/pyspark-cv-seed.
@asfgit asfgit closed this in ce5fd40 Dec 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants