-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13449] Naive Bayes wrapper in SparkR #11486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
test it please |
|
Test build #52376 has finished for PR 11486 at commit
|
|
Test build #52382 has finished for PR 11486 at commit
|
|
Labels of ML |
|
I can see from the mllib.NaiveBayes that the labels are sorted. But how about if it is not 0 based or not continuous? Say, |
|
It's a good question! It's possible that the label of input dataset is not 0 based or not continuous. So we should use |
|
I think it works. I'll try to add it later. |
|
Test build #52574 has finished for PR 11486 at commit
|
|
For this PR, I extract labels manually from labelCol. But I still don't think it's good to make assumption first for labels to be 0-based and continuous like |
|
retest it please |
|
retest this please |
|
Test build #52713 has finished for PR 11486 at commit
|
|
Test build #52723 has finished for PR 11486 at commit
|
R/pkg/R/mllib.R
Outdated
| #' | ||
| #' Fit a naive Bayes model, similarly to R's naiveBayes() except for omitting two arguments 'subset' | ||
| #' and 'na.action'. Users can use 'subset' function and 'fillna' or 'na.omit' function of DataFrame, | ||
| #' respectviely, to preprocess their DataFrame. We use na.omit in this interface to avoid potential |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: respectively
|
From SparkR test failure: |
|
Test build #53019 has finished for PR 11486 at commit
|
| }) | ||
|
|
||
| test_that("naiveBayes", { | ||
| training <- suppressWarnings(createDataFrame(sqlContext, iris)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iris is not a good dataset for naive Bayes. @yinxusen Could you take a look at other base datasets that come with R? https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html It would be great if we can find one with categorical labels and count data. Otherwise, we can make a really simple one here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously I use the HouseVote84 data because e1071::naiveBayes use that. But if I use it, then the testband should have mlbench package installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we can create a really small dataset with 3 categories and some count data. We can also verify against e1071::naiveBayes output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
About labels, I think we should output the raw labels as predictions instead of the encoded indices. It is hard to extract the feature metadata in SparkR. |
|
I'll try to extract raw labels. |
|
@mengxr One more thing, could you take a look at https://issues.apache.org/jira/browse/SPARK-13641? If we extract feature names from the RFormulaModel transformed data, then for categorical data, we can only extract transformed feature names like I said in that JIRA. Do you think it's OK to extract those names? |
| @Since("1.6.0") | ||
| override def write: MLWriter = new NaiveBayesModel.NaiveBayesModelWriter(this) | ||
|
|
||
| private var featureNames: Option[Array[String]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mengxr I remove the previous NaiveBayesSummary and add these two featureNames and labelNames because we need these two variables to be accessed from NaiveBayesModel.
|
Test build #53386 has finished for PR 11486 at commit
|
|
Test build #53387 has finished for PR 11486 at commit
|
| /** | ||
| * Get the original array of labels if exists. | ||
| */ | ||
| private[ml] def getOriginalLabels: Option[Array[String]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a IndexToString transformer at the end of the PipelineModel? I think it would be more general. Other functions such as glm with "binomial" family should also do the same work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm rewriting it now.
|
Test build #53622 has finished for PR 11486 at commit
|
|
@mengxr @yanboliang Since the |
|
Test build #53627 has finished for PR 11486 at commit
|
|
Test build #53629 has finished for PR 11486 at commit
|
What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13449
Add a Naive Bayes wrapper in SparkR, with predict, naiveBayes, summary.
How was this patch tested?
Test with sparkR unit test.