Skip to content

Conversation

@yinxusen
Copy link
Contributor

@yinxusen yinxusen commented Mar 3, 2016

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13449

Add a Naive Bayes wrapper in SparkR, with predict, naiveBayes, summary.

How was this patch tested?

Test with sparkR unit test.

@yinxusen
Copy link
Contributor Author

yinxusen commented Mar 3, 2016

test it please

@SparkQA
Copy link

SparkQA commented Mar 3, 2016

Test build #52376 has finished for PR 11486 at commit 26d38e1.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2016

Test build #52382 has finished for PR 11486 at commit a07beb2.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

Labels of ML NaiveBayesModel are sorted(FYI #7284), so we do not need to store it as member variable. Then it can pass the binary compatibility check.

@yinxusen
Copy link
Contributor Author

yinxusen commented Mar 3, 2016

I can see from the mllib.NaiveBayes that the labels are sorted. But how about if it is not 0 based or not continuous? Say, 1.0, 3.0, 5.0, .... It has no effect on training/prediction, but has one on the summary. Otherwise I have to extract the labels from its labelCol again.

@yanboliang
Copy link
Contributor

It's a good question! It's possible that the label of input dataset is not 0 based or not continuous. So we should use StringIndexer to index label in [0, numLabels), and after training we use IndexToString to map index label to the original ones. We have already store the label map in the metadata of transformed label column.
All the models under ML package will follow this rule. For examples, if you train LogisticRegression with the input label "-1, +1", it will produce erroneous results. You should use StringIndexer to transform labels to "0, 1" firstly. cc @jkbradley

@yinxusen
Copy link
Contributor Author

yinxusen commented Mar 4, 2016

I think it works. I'll try to add it later.

@SparkQA
Copy link

SparkQA commented Mar 7, 2016

Test build #52574 has finished for PR 11486 at commit 1a685e1.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

yinxusen commented Mar 7, 2016

@yanboliang @jkbradley

For this PR, I extract labels manually from labelCol. But I still don't think it's good to make assumption first for labels to be 0-based and continuous like 0.0, 1.0, 2.0, .... Sure we can use a StringIndexer to re-index the labels if it does not match the assumption. But checking it is not efficient. I suggest keeping labels in ml.NaiveBayes.

@yinxusen
Copy link
Contributor Author

yinxusen commented Mar 7, 2016

retest it please

@yinxusen
Copy link
Contributor Author

yinxusen commented Mar 9, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Mar 9, 2016

Test build #52713 has finished for PR 11486 at commit 1a685e1.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 9, 2016

Test build #52723 has finished for PR 11486 at commit 30e9c37.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

R/pkg/R/mllib.R Outdated
#'
#' Fit a naive Bayes model, similarly to R's naiveBayes() except for omitting two arguments 'subset'
#' and 'na.action'. Users can use 'subset' function and 'fillna' or 'na.omit' function of DataFrame,
#' respectviely, to preprocess their DataFrame. We use na.omit in this interface to avoid potential
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: respectively

@felixcheung
Copy link
Member

From SparkR test failure:

1. Error: naiveBayes -----------------------------------------------------------
there is no package called 'mlbench'
1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage"))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: data(HouseVotes84, package = "mlbench") at test_mllib.R:146
5: find.package(package, lib.loc, verbose = verbose)
6: stop(gettextf("there is no package called %s", sQuote(pkg)), domain = NA)
Error: Test failures

@SparkQA
Copy link

SparkQA commented Mar 13, 2016

Test build #53019 has finished for PR 11486 at commit 9991e79.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

})

test_that("naiveBayes", {
training <- suppressWarnings(createDataFrame(sqlContext, iris))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iris is not a good dataset for naive Bayes. @yinxusen Could you take a look at other base datasets that come with R? https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html It would be great if we can find one with categorical labels and count data. Otherwise, we can make a really simple one here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously I use the HouseVote84 data because e1071::naiveBayes use that. But if I use it, then the testband should have mlbench package installed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can create a really small dataset with 3 categories and some count data. We can also verify against e1071::naiveBayes output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr
Copy link
Contributor

mengxr commented Mar 16, 2016

About labels, I think we should output the raw labels as predictions instead of the encoded indices. It is hard to extract the feature metadata in SparkR.

@yinxusen
Copy link
Contributor Author

I'll try to extract raw labels.

@yinxusen
Copy link
Contributor Author

@mengxr One more thing, could you take a look at https://issues.apache.org/jira/browse/SPARK-13641? If we extract feature names from the RFormulaModel transformed data, then for categorical data, we can only extract transformed feature names like I said in that JIRA. Do you think it's OK to extract those names?

@Since("1.6.0")
override def write: MLWriter = new NaiveBayesModel.NaiveBayesModelWriter(this)

private var featureNames: Option[Array[String]] = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengxr I remove the previous NaiveBayesSummary and add these two featureNames and labelNames because we need these two variables to be accessed from NaiveBayesModel.

@SparkQA
Copy link

SparkQA commented Mar 17, 2016

Test build #53386 has finished for PR 11486 at commit 8e21393.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 17, 2016

Test build #53387 has finished for PR 11486 at commit 90b6ad9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Get the original array of labels if exists.
*/
private[ml] def getOriginalLabels: Option[Array[String]] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a IndexToString transformer at the end of the PipelineModel? I think it would be more general. Other functions such as glm with "binomial" family should also do the same work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rewriting it now.

@SparkQA
Copy link

SparkQA commented Mar 20, 2016

Test build #53622 has finished for PR 11486 at commit b4ee1aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

@mengxr @yanboliang Since the ml.NaiveBayes making the assumption that its input data's label is 0-based indices, we should add a StringIndexer for labels after RFormula if the input data's label column is not a string column because the RFormula doesn't handle numerical labels. Now we can extract labels from the final IndexToString.

@SparkQA
Copy link

SparkQA commented Mar 20, 2016

Test build #53627 has finished for PR 11486 at commit 87fa0aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 20, 2016

Test build #53629 has finished for PR 11486 at commit 3d291de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Mar 22, 2016

@yinxusen I checked the implementation in e1071 and found that it supports both categorical and continuous features, which in MLlib we only support categorical features. So I updated the implementation with some refactor in #11890. Could you help review that PR? Thanks!

@asfgit asfgit closed this in d6dc12e Mar 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants