[SPARK-13449] Naive Bayes wrapper in SparkR #11486

yinxusen · 2016-03-03T06:09:05Z

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13449

Add a Naive Bayes wrapper in SparkR, with predict, naiveBayes, summary.

How was this patch tested?

Test with sparkR unit test.

yinxusen · 2016-03-03T06:10:08Z

test it please

SparkQA · 2016-03-03T06:27:25Z

Test build #52376 has finished for PR 11486 at commit 26d38e1.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-03T07:39:33Z

Test build #52382 has finished for PR 11486 at commit a07beb2.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-03-03T10:21:27Z

Labels of ML NaiveBayesModel are sorted(FYI #7284), so we do not need to store it as member variable. Then it can pass the binary compatibility check.

yinxusen · 2016-03-03T15:27:15Z

I can see from the mllib.NaiveBayes that the labels are sorted. But how about if it is not 0 based or not continuous? Say, 1.0, 3.0, 5.0, .... It has no effect on training/prediction, but has one on the summary. Otherwise I have to extract the labels from its labelCol again.

yanboliang · 2016-03-04T04:01:05Z

It's a good question! It's possible that the label of input dataset is not 0 based or not continuous. So we should use StringIndexer to index label in [0, numLabels), and after training we use IndexToString to map index label to the original ones. We have already store the label map in the metadata of transformed label column.
All the models under ML package will follow this rule. For examples, if you train LogisticRegression with the input label "-1, +1", it will produce erroneous results. You should use StringIndexer to transform labels to "0, 1" firstly. cc @jkbradley

yinxusen · 2016-03-04T23:53:26Z

I think it works. I'll try to add it later.

SparkQA · 2016-03-07T18:44:08Z

Test build #52574 has finished for PR 11486 at commit 1a685e1.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

yinxusen · 2016-03-07T18:49:04Z

@yanboliang @jkbradley

For this PR, I extract labels manually from labelCol. But I still don't think it's good to make assumption first for labels to be 0-based and continuous like 0.0, 1.0, 2.0, .... Sure we can use a StringIndexer to re-index the labels if it does not match the assumption. But checking it is not efficient. I suggest keeping labels in ml.NaiveBayes.

yinxusen · 2016-03-07T18:50:28Z

retest it please

yinxusen · 2016-03-09T01:23:31Z

retest this please

SparkQA · 2016-03-09T01:35:42Z

Test build #52713 has finished for PR 11486 at commit 1a685e1.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-09T05:22:54Z

Test build #52723 has finished for PR 11486 at commit 30e9c37.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-03-09T05:45:19Z

R/pkg/R/mllib.R

+#'
+#' Fit a naive Bayes model, similarly to R's naiveBayes() except for omitting two arguments 'subset'
+#' and 'na.action'. Users can use 'subset' function and 'fillna' or 'na.omit' function of DataFrame,
+#' respectviely, to preprocess their DataFrame. We use na.omit in this interface to avoid potential


typo: respectively

felixcheung · 2016-03-09T05:49:52Z

From SparkR test failure:

1. Error: naiveBayes -----------------------------------------------------------
there is no package called 'mlbench'
1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage"))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: data(HouseVotes84, package = "mlbench") at test_mllib.R:146
5: find.package(package, lib.loc, verbose = verbose)
6: stop(gettextf("there is no package called %s", sQuote(pkg)), domain = NA)
Error: Test failures

SparkQA · 2016-03-13T03:59:05Z

Test build #53019 has finished for PR 11486 at commit 9991e79.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-03-16T07:12:56Z

R/pkg/inst/tests/testthat/test_mllib.R

 })
+
+test_that("naiveBayes", {
+  training <- suppressWarnings(createDataFrame(sqlContext, iris))


iris is not a good dataset for naive Bayes. @yinxusen Could you take a look at other base datasets that come with R? https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html It would be great if we can find one with categorical labels and count data. Otherwise, we can make a really simple one here.

Previously I use the HouseVote84 data because e1071::naiveBayes use that. But if I use it, then the testband should have mlbench package installed.

Then we can create a really small dataset with 3 categories and some count data. We can also verify against e1071::naiveBayes output.

I choose https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/infert.html.

mengxr · 2016-03-16T07:15:57Z

About labels, I think we should output the raw labels as predictions instead of the encoded indices. It is hard to extract the feature metadata in SparkR.

yinxusen · 2016-03-16T07:20:08Z

I'll try to extract raw labels.

yinxusen · 2016-03-16T07:22:27Z

@mengxr One more thing, could you take a look at https://issues.apache.org/jira/browse/SPARK-13641? If we extract feature names from the RFormulaModel transformed data, then for categorical data, we can only extract transformed feature names like I said in that JIRA. Do you think it's OK to extract those names?

yinxusen · 2016-03-17T02:26:11Z

mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala

  @Since("1.6.0")
  override def write: MLWriter = new NaiveBayesModel.NaiveBayesModelWriter(this)
+
+  private var featureNames: Option[Array[String]] = None


@mengxr I remove the previous NaiveBayesSummary and add these two featureNames and labelNames because we need these two variables to be accessed from NaiveBayesModel.

SparkQA · 2016-03-17T02:30:26Z

Test build #53386 has finished for PR 11486 at commit 8e21393.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-17T03:30:16Z

Test build #53387 has finished for PR 11486 at commit 90b6ad9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-03-18T07:13:08Z

mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala

+  /**
+   * Get the original array of labels if exists.
+   */
+  private[ml] def getOriginalLabels: Option[Array[String]] = {


Should we add a IndexToString transformer at the end of the PipelineModel? I think it would be more general. Other functions such as glm with "binomial" family should also do the same work.

I'm rewriting it now.

SparkQA · 2016-03-20T04:30:17Z

Test build #53622 has finished for PR 11486 at commit b4ee1aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yinxusen · 2016-03-20T08:16:40Z

@mengxr @yanboliang Since the ml.NaiveBayes making the assumption that its input data's label is 0-based indices, we should add a StringIndexer for labels after RFormula if the input data's label column is not a string column because the RFormula doesn't handle numerical labels. Now we can extract labels from the final IndexToString.

SparkQA · 2016-03-20T08:54:57Z

Test build #53627 has finished for PR 11486 at commit 87fa0aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-20T09:10:17Z

Test build #53629 has finished for PR 11486 at commit 3d291de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-03-22T16:18:11Z

@yinxusen I checked the implementation in e1071 and found that it supports both categorical and continuous features, which in MLlib we only support categorical features. So I updated the implementation with some refactor in #11890. Could you help review that PR? Thanks!

yinxusen added 7 commits February 29, 2016 21:15

runable draft

fb1bca4

refine test and na handler

787f25f

refine getModelName

b66d3e5

remove default interface

a5ab2e6

refine code

9215faf

add summary for NaiveBayes

388e85d

refine

26d38e1

fix bugs

a07beb2

yinxusen added 2 commits March 7, 2016 09:58

revert NaiveBayes labels

afaba4a

refine extracing labels

1a685e1

fix error

30e9c37

felixcheung reviewed Mar 9, 2016
View reviewed changes

fix nit

9991e79

mengxr reviewed Mar 16, 2016
View reviewed changes

yinxusen added 3 commits March 16, 2016 16:32

fix nits

6c97cef

remove NaiveBayesModelSummary

721a8b7

add raw label prediction

8e21393

yinxusen reviewed Mar 17, 2016
View reviewed changes

fix r style

90b6ad9

yanboliang mentioned this pull request Mar 18, 2016

[SPARK-13010] [ML] [SparkR] Implement a simple wrapper of AFTSurvivalRegression in SparkR #11447

Closed

yanboliang reviewed Mar 18, 2016
View reviewed changes

merge with master

b4ee1aa

add IndexToString to extract labels

87fa0aa

remove useless imports

3d291de

mengxr mentioned this pull request Mar 22, 2016

[SPARK-13449] Naive Bayes wrapper in SparkR #11890

Closed

asfgit closed this in d6dc12e Mar 22, 2016

[SPARK-13449] Naive Bayes wrapper in SparkR #11486

[SPARK-13449] Naive Bayes wrapper in SparkR #11486

Uh oh!

Conversation

yinxusen commented Mar 3, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

yinxusen commented Mar 3, 2016

Uh oh!

SparkQA commented Mar 3, 2016

Uh oh!

SparkQA commented Mar 3, 2016

Uh oh!

yanboliang commented Mar 3, 2016

Uh oh!

yinxusen commented Mar 3, 2016

Uh oh!

yanboliang commented Mar 4, 2016

Uh oh!

yinxusen commented Mar 4, 2016

Uh oh!

SparkQA commented Mar 7, 2016

Uh oh!

yinxusen commented Mar 7, 2016

Uh oh!

yinxusen commented Mar 7, 2016

Uh oh!

yinxusen commented Mar 9, 2016

Uh oh!

SparkQA commented Mar 9, 2016

Uh oh!

SparkQA commented Mar 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Mar 9, 2016

Uh oh!

SparkQA commented Mar 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Mar 16, 2016

Uh oh!

yinxusen commented Mar 16, 2016

Uh oh!

yinxusen commented Mar 16, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 17, 2016

Uh oh!

SparkQA commented Mar 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 20, 2016

Uh oh!

yinxusen commented Mar 20, 2016

Uh oh!

SparkQA commented Mar 20, 2016

Uh oh!

SparkQA commented Mar 20, 2016

Uh oh!

mengxr commented Mar 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects