-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19969] [ML] Imputer doc and example #17324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #74689 has finished for PR 17324 at commit
|
|
Test build #74690 has finished for PR 17324 at commit
|
|
Will take a look this week - also we may want to add the Python example here once I merge #17316 |
docs/ml-features.md
Outdated
|
|
||
| By default, Imputer will replace all the `Double.NaN` (missing value) with the mean (strategy) from | ||
| other values in the corresponding columns. In our example, the surrogates for `a` and `b` are 3.0 | ||
| and 4.0 respectively. After transformation, the output columns will not contain missing value anymore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps "After transformation, the missing values in the output columns will be replaced by the surrogate value for that column"?
| import org.apache.spark.ml.feature.Imputer | ||
| // $example off$ | ||
| import org.apache.spark.sql.SparkSession | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most examples have a small doc string that includes a "Run with:" part - see e.g. the recent MinHashLSHExample (this should also be added for the Java example)
| .getOrCreate() | ||
|
|
||
| // $example on$ | ||
| val df = spark.createDataFrame( Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Space in ( Seq( should be removed
|
|
||
| /** Validates and transforms the input schema. */ | ||
| protected def validateAndTransformSchema(schema: StructType): StructType = { | ||
| require(get(inputCols).isDefined, "Input cols must be defined first.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in #17316, is this really required? Since a non-set param for these will in any case throw an exception during transformSchema (or fit, or transform) with "no default value found"
docs/ml-features.md
Outdated
|
|
||
| ## Imputer | ||
|
|
||
| Imputation transformer for completing missing values in the dataset, either using the mean or the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like "The Imputer transformer completes missing values in ..."
docs/ml-features.md
Outdated
|
|
||
| Imputation transformer for completing missing values in the dataset, either using the mean or the | ||
| median of the columns in which the missing value are located. The input columns should be of | ||
| DoubleType or FloatType. Currently Imputer does not support categorical features and possibly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backticks for DoubleType and FloatType
docs/ml-features.md
Outdated
| Imputation transformer for completing missing values in the dataset, either using the mean or the | ||
| median of the columns in which the missing value are located. The input columns should be of | ||
| DoubleType or FloatType. Currently Imputer does not support categorical features and possibly | ||
| creates incorrect values for a categorical feature. All Null values in the input column are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps on a new line:
Note all null values in the input column ...
docs/ml-features.md
Outdated
| 5.0 | 5.0 | ||
| ~~~ | ||
|
|
||
| By default, Imputer will replace all the `Double.NaN` (missing value) with the mean (strategy) from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps "In this example, Imputer will replace all occurrences of Double.NaN (the default for the missing value) with the mean (the default imputation strategy) from the other values in the corresponding columns".
docs/ml-features.md
Outdated
| ~~~ | ||
|
|
||
| By default, Imputer will replace all the `Double.NaN` (missing value) with the mean (strategy) from | ||
| other values in the corresponding columns. In our example, the surrogates for `a` and `b` are 3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this example, the surrogate values for columns a and b are ...
|
Generally looks fine - made a few small comments. |
|
Test build #75031 has started for PR 17324 at commit |
|
Jenkins retest this please |
|
Test build #75058 has finished for PR 17324 at commit
|
|
Test build #75059 has finished for PR 17324 at commit
|
|
Test build #75197 has finished for PR 17324 at commit
|
|
Updated with python example. |
MLnick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few mostly minor comments.
One missing thing is to include the Python example in user guide.
docs/ml-features.md
Outdated
|
|
||
| ## Imputer | ||
|
|
||
| The `Imputer` transformer completes missing values in the dataset, either using the mean or the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"values in the dataset" -> "values in a dataset"
docs/ml-features.md
Outdated
| ## Imputer | ||
|
|
||
| The `Imputer` transformer completes missing values in the dataset, either using the mean or the | ||
| median of the columns in which the missing value are located. The input columns should be of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"value" -> "values"
docs/ml-features.md
Outdated
| The `Imputer` transformer completes missing values in the dataset, either using the mean or the | ||
| median of the columns in which the missing value are located. The input columns should be of | ||
| `DoubleType` or `FloatType`. Currently `Imputer` does not support categorical features and possibly | ||
| creates incorrect values for a categorical feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"... creates incorrect values for columns containing categorical features."
docs/ml-features.md
Outdated
|
|
||
| **Examples** | ||
|
|
||
| Suppose that we have a DataFrame with the column `a` and `b`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columns
docs/ml-features.md
Outdated
| 5.0 | 5.0 | ||
| ~~~ | ||
|
|
||
| In this example, Imputer will replace all occurrences of Double.NaN (the default for the missing value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backticks around Double.NaN
| Dataset<Row> df = spark.createDataFrame(data, schema); | ||
|
|
||
| Imputer imputerModel = new Imputer() | ||
| .setStrategy("mean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're using defaults we can remove the setStrategy call in all examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the example code, can we keep it to introduce the primary API or important parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a big deal - still I think it's not necessary to illustrate setStrategy("mean") as we already mention in the user guide what the defaults are.
| if __name__ == "__main__": | ||
| spark = SparkSession\ | ||
| .builder\ | ||
| .appName("imputer example")\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use "PythonImputerExample" to be consistent for app name used in other examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. For consistency, how about just use "ImputerExample"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| .getOrCreate() | ||
|
|
||
| # $example on$ | ||
| dataFrame = spark.createDataFrame([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataFrame -> df to be consistent with other examples
| from pyspark.ml.feature import Imputer | ||
| # $example off$ | ||
| from pyspark.sql import SparkSession | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I see that not all Python examples have it, let's add the comment here too:
"""
An example demonstrating Imputer.
Run with:
bin/spark-submit examples/src/main/python/ml/imputer_example.py
"""| @@ -0,0 +1,46 @@ | |||
| # | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer filename imputer_example.py to be consistent with other Python examples for ML
|
Test build #75271 has finished for PR 17324 at commit
|
| }); | ||
| Dataset<Row> df = spark.createDataFrame(data, schema); | ||
|
|
||
| Imputer imputerModel = new Imputer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry just noticed this imputerModel here and model below. Let's call it imputer and model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding this.
| ], ["a", "b"]) | ||
|
|
||
| imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"]) | ||
| imputerModel = imputer.fit(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just model
| imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"]) | ||
| imputerModel = imputer.fit(df) | ||
|
|
||
| imputedData = imputerModel.transform(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the other examples we just do model.transform(df).show() so let's be consistent.
MLnick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor clean up points, then I think it should be ready.
|
Test build #75346 has started for PR 17324 at commit |
|
The test was interrupted and need a retest. |
|
Jenkins retest this please |
|
Test build #75383 has finished for PR 17324 at commit
|
| imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"]) | ||
| model = imputer.fit(df) | ||
|
|
||
| model.transform(df).select("a", "b", "out_a", "out_b").show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In previous comment I wasn't totally clear, sorry! I mean let's only have the transform(df).show() - so we can remove the select here as it's unnecessary.
MLnick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last tweak to Python example.
LGTM pending that.
|
Test build #75396 has finished for PR 17324 at commit
|
|
Viewed generated docs and ran examples locally. 👍 Merged to master. Thanks! |
What changes were proposed in this pull request?
Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after #17316
How was this patch tested?
local doc generation and example execution