Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Mar 16, 2017

What changes were proposed in this pull request?

Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after #17316

How was this patch tested?

local doc generation and example execution

@SparkQA
Copy link

SparkQA commented Mar 16, 2017

Test build #74689 has finished for PR 17324 at commit f2e7a69.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class JavaImputerExample

@SparkQA
Copy link

SparkQA commented Mar 16, 2017

Test build #74690 has finished for PR 17324 at commit ac0683b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Mar 21, 2017

Will take a look this week - also we may want to add the Python example here once I merge #17316


By default, Imputer will replace all the `Double.NaN` (missing value) with the mean (strategy) from
other values in the corresponding columns. In our example, the surrogates for `a` and `b` are 3.0
and 4.0 respectively. After transformation, the output columns will not contain missing value anymore.
Copy link
Contributor

@MLnick MLnick Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "After transformation, the missing values in the output columns will be replaced by the surrogate value for that column"?

import org.apache.spark.ml.feature.Imputer
// $example off$
import org.apache.spark.sql.SparkSession

Copy link
Contributor

@MLnick MLnick Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most examples have a small doc string that includes a "Run with:" part - see e.g. the recent MinHashLSHExample (this should also be added for the Java example)

.getOrCreate()

// $example on$
val df = spark.createDataFrame( Seq(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Space in ( Seq( should be removed


/** Validates and transforms the input schema. */
protected def validateAndTransformSchema(schema: StructType): StructType = {
require(get(inputCols).isDefined, "Input cols must be defined first.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in #17316, is this really required? Since a non-set param for these will in any case throw an exception during transformSchema (or fit, or transform) with "no default value found"


## Imputer

Imputation transformer for completing missing values in the dataset, either using the mean or the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like "The Imputer transformer completes missing values in ..."


Imputation transformer for completing missing values in the dataset, either using the mean or the
median of the columns in which the missing value are located. The input columns should be of
DoubleType or FloatType. Currently Imputer does not support categorical features and possibly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backticks for DoubleType and FloatType

Imputation transformer for completing missing values in the dataset, either using the mean or the
median of the columns in which the missing value are located. The input columns should be of
DoubleType or FloatType. Currently Imputer does not support categorical features and possibly
creates incorrect values for a categorical feature. All Null values in the input column are
Copy link
Contributor

@MLnick MLnick Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps on a new line:

Note all null values in the input column ...

5.0 | 5.0
~~~

By default, Imputer will replace all the `Double.NaN` (missing value) with the mean (strategy) from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "In this example, Imputer will replace all occurrences of Double.NaN (the default for the missing value) with the mean (the default imputation strategy) from the other values in the corresponding columns".

~~~

By default, Imputer will replace all the `Double.NaN` (missing value) with the mean (strategy) from
other values in the corresponding columns. In our example, the surrogates for `a` and `b` are 3.0
Copy link
Contributor

@MLnick MLnick Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example, the surrogate values for columns a and b are ...

@MLnick
Copy link
Contributor

MLnick commented Mar 21, 2017

Generally looks fine - made a few small comments.

@SparkQA
Copy link

SparkQA commented Mar 22, 2017

Test build #75031 has started for PR 17324 at commit 4bbe2f7.

@MLnick
Copy link
Contributor

MLnick commented Mar 22, 2017

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Mar 22, 2017

Test build #75058 has finished for PR 17324 at commit 4bbe2f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2017

Test build #75059 has finished for PR 17324 at commit 8755dde.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Mar 24, 2017

@hhbyyh #17316 is merged.

@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #75197 has finished for PR 17324 at commit a2e24c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 25, 2017

Updated with python example.

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few mostly minor comments.

One missing thing is to include the Python example in user guide.


## Imputer

The `Imputer` transformer completes missing values in the dataset, either using the mean or the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"values in the dataset" -> "values in a dataset"

## Imputer

The `Imputer` transformer completes missing values in the dataset, either using the mean or the
median of the columns in which the missing value are located. The input columns should be of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"value" -> "values"

The `Imputer` transformer completes missing values in the dataset, either using the mean or the
median of the columns in which the missing value are located. The input columns should be of
`DoubleType` or `FloatType`. Currently `Imputer` does not support categorical features and possibly
creates incorrect values for a categorical feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... creates incorrect values for columns containing categorical features."


**Examples**

Suppose that we have a DataFrame with the column `a` and `b`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columns

5.0 | 5.0
~~~

In this example, Imputer will replace all occurrences of Double.NaN (the default for the missing value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backticks around Double.NaN

Dataset<Row> df = spark.createDataFrame(data, schema);

Imputer imputerModel = new Imputer()
.setStrategy("mean")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're using defaults we can remove the setStrategy call in all examples.

Copy link
Contributor Author

@hhbyyh hhbyyh Mar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the example code, can we keep it to introduce the primary API or important parameters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal - still I think it's not necessary to illustrate setStrategy("mean") as we already mention in the user guide what the defaults are.

if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("imputer example")\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use "PythonImputerExample" to be consistent for app name used in other examples

Copy link
Contributor Author

@hhbyyh hhbyyh Mar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. For consistency, how about just use "ImputerExample"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

.getOrCreate()

# $example on$
dataFrame = spark.createDataFrame([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataFrame -> df to be consistent with other examples

from pyspark.ml.feature import Imputer
# $example off$
from pyspark.sql import SparkSession

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I see that not all Python examples have it, let's add the comment here too:

"""
An example demonstrating Imputer.
Run with:
  bin/spark-submit examples/src/main/python/ml/imputer_example.py
"""

@@ -0,0 +1,46 @@
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer filename imputer_example.py to be consistent with other Python examples for ML

@SparkQA
Copy link

SparkQA commented Mar 27, 2017

Test build #75271 has finished for PR 17324 at commit 7df70b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

});
Dataset<Row> df = spark.createDataFrame(data, schema);

Imputer imputerModel = new Imputer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry just noticed this imputerModel here and model below. Let's call it imputer and model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding this.

], ["a", "b"])

imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
imputerModel = imputer.fit(df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just model

imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
imputerModel = imputer.fit(df)

imputedData = imputerModel.transform(df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other examples we just do model.transform(df).show() so let's be consistent.

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor clean up points, then I think it should be ready.

@SparkQA
Copy link

SparkQA commented Mar 29, 2017

Test build #75346 has started for PR 17324 at commit 48a1361.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 30, 2017

The test was interrupted and need a retest.

@MLnick
Copy link
Contributor

MLnick commented Mar 30, 2017

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Mar 30, 2017

Test build #75383 has finished for PR 17324 at commit 48a1361.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
model = imputer.fit(df)

model.transform(df).select("a", "b", "out_a", "out_b").show()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In previous comment I wasn't totally clear, sorry! I mean let's only have the transform(df).show() - so we can remove the select here as it's unnecessary.

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last tweak to Python example.

LGTM pending that.

@SparkQA
Copy link

SparkQA commented Mar 30, 2017

Test build #75396 has finished for PR 17324 at commit e17f997.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Apr 3, 2017

Viewed generated docs and ran examples locally.

👍

Merged to master. Thanks!

@asfgit asfgit closed this in 4d28e84 Apr 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants