Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 43 additions & 1 deletion mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ import org.apache.spark.annotation.{Experimental, Since}
import org.apache.spark.ml.{Estimator, Model, Pipeline, PipelineModel, PipelineStage, Transformer}
import org.apache.spark.ml.attribute.AttributeGroup
import org.apache.spark.ml.linalg.VectorUDT
import org.apache.spark.ml.param.{BooleanParam, Param, ParamMap}
import org.apache.spark.ml.param.{BooleanParam, Param, ParamMap, ParamValidators}
import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
import org.apache.spark.ml.util._
import org.apache.spark.sql.{DataFrame, Dataset}
Expand All @@ -37,6 +37,42 @@ import org.apache.spark.sql.types._
*/
private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {

/**
* Param for how to order categories of a string FEATURE column used by `StringIndexer`.
* The last category after ordering is dropped when encoding strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? Do you have some references? AFAIK, R formula drop the first category alphabetically ascending order when encode string/category feature (which is consistent with your PR description). I think test("StringIndexer order types") in #17879 is correct. Could you double check this?

* Supported options: 'frequencyDesc', 'frequencyAsc', 'alphabetDesc', 'alphabetAsc'.
* The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', `RFormula`
* drops the same category as R when encoding strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order should be alphabetAsc to match R.

*
* The options are explained using an example `'b', 'a', 'b', 'a', 'c', 'b'`:
* {{{
* +-----------------+---------------------------------------+----------------------------------+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to suggest just to write out as prose with a simple list if we are all fine for now, which I guess we would generally agree with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Would you please clarify what you mean by a list? Thanks.
I would like to preserve the table structure because it helps show the difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sure, I initially meant a HTML list that we are already using -

* <ul>
* <li>`primitivesAsString` (default `false`): infers all primitive values as a string type</li>
* <li>`prefersDecimal` (default `false`): infers all floating-point values as a decimal
* type. If the values do not fit in decimal, then it infers them as doubles.</li>
* <li>`allowComments` (default `false`): ignores Java/C++ style comment in JSON records</li>
* <li>`allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names</li>
* <li>`allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes
* </li>
* <li>`allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers
* (e.g. 00012)</li>
* <li>`allowBackslashEscapingAnyCharacter` (default `false`): allows accepting quoting of all
* character using backslash quoting mechanism</li>
* <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
* during parsing.
* <ul>
* <li>`PERMISSIVE` : sets other fields to `null` when it meets a corrupted record, and puts
* the malformed string into a field configured by `columnNameOfCorruptRecord`. To keep
* corrupt records, an user can set a string type field named `columnNameOfCorruptRecord`
* in an user-defined schema. If a schema does not have the field, it drops corrupt records
* during parsing. When inferring a schema, it implicitly adds a `columnNameOfCorruptRecord`
* field in an output schema.</li>
* <li>`DROPMALFORMED` : ignores the whole corrupted records.</li>
* <li>`FAILFAST` : throws an exception when it meets corrupted records.</li>
* </ul>
* </li>
* <li>`columnNameOfCorruptRecord` (default is the value specified in
* `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field having malformed string
* created by `PERMISSIVE` mode. This overrides `spark.sql.columnNameOfCorruptRecord`.</li>
* <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format.
* Custom date formats follow the formats at `java.text.SimpleDateFormat`. This applies to
* date type.</li>
* <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSXXX`): sets the string that
* indicates a timestamp format. Custom date formats follow the formats at
* `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
* <li>`wholeFile` (default `false`): parse one record, which may span multiple lines,
* per file</li>
* </ul>
...

<ul>
  <li> abc </li>
  <li> abc </li>
</ul>

I just tested it to double-check a wiki-style list ( - ) - http://subnormalnumbers.blogspot.kr/2011/08/scaladoc-wiki-syntax.html. This does not work correctly as below (but please go ahead if you know any compatible way for both Scaladoc and Javadoc):

   *  1. item one
   *
   *  1. item two
   *    - sublist
   *    - next item
   *
   *  1. now for broken sub-numbered list, the leading item must be one of
   *     `-`, `1.`, `I.`, `i.`, `A.`, or `a.`. And it must be followed by a space.
   *    1. one
   *    2. two
   *    3. three
   *
   *  1. list types
   *    I. one
   *      i. one
   *      i. two
   *    I. two
   *      A. one
   *      A. two
   *    I. three
   *      a. one
   *      a. two

Scaladoc

2017-05-23 9 52 51

Javadoc

2017-05-23 9 53 07

My worry is, it draws attention with a different format. I believe we have similar instances but wonder if it is worth changing only this one. I would not strongly against but {{{ ... }}} basically means codes. If we can't find a better way to render this, I would leave this out as prose with a list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I am not supposed to make a decision call though. Please let me know @felixcheung and @yanboliang if you have any preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks for the clarification. I don't think list paints a clear picture here. Would rather keep the table structure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to this, table is https://wiki.scala-lang.org/display/SW/Syntax

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung @HyukjinKwon The scaladoc complied, but the javadoc failed... Not sure if there is additional config for java?

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think javadoc 8 complains about HTML. It looks this works:

   * <table summary="abc">
   * <tr>
   *   <th>Firstname</th>
   *   <th>Lastname</th>
   *   <th>Age</th>
   * </tr>
   * <tr>
   *   <td>Jill</td>
   *   <td>Smith</td>
   *   <td>50</td>
   * </tr>
   * <tr>
   *   <td>Eve</td>
   *   <td>Jackson</td>
   *   <td>94</td>
   *  </tr>
   * </table>

Scaladoc

2017-05-23 4 28 14

Javadoc

2017-05-23 4 27 55

Other errors probably are spurious (please refer https://issues.apache.org/jira/browse/SPARK-20840 which I am fighting with right now).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Nice. Thanks much. One issue is in the scaladoc, the columns are very close to each other. How to add spacing between columns in the scaladoc?
I tried <table cellspacing="4"> but it does not seem to take any effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. it did not work for me too ...

* | Option | Category mapped to 0 by StringIndexer | Category dropped by RFormula |
* +-----------------+---------------------------------------+----------------------------------+
* | 'frequencyDesc' | most frequent category ('b') | least frequent category ('c') |
* | 'frequencyAsc' | least frequent category ('c') | most frequent category ('b') |
* | 'alphabetDesc' | last alphabetical category ('c') | first alphabetical category ('a')|
* | 'alphabetAsc' | first alphabetical category ('a') | last alphabetical category ('c') |
* +-----------------+---------------------------------------+----------------------------------+
* }}}
* Note that this ordering option is NOT used for the label column. When the label column is
* indexed, it uses the default descending frequency ordering in `StringIndexer`.
*
* @group param
*/
@Since("2.3.0")
final val stringIndexerOrderType: Param[String] = new Param(this, "stringIndexerOrderType",
"How to order categories of a string FEATURE column used by StringIndexer. " +
"The last category after ordering is dropped when encoding strings. " +
s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}. " +
"The default value is 'frequencyDesc'. When the ordering is set to 'alphabetDesc', " +
"RFormula drops the same category as R when encoding strings.",
ParamValidators.inArray(StringIndexer.supportedStringOrderType))

/** @group getParam */
@Since("2.3.0")
def getStringIndexerOrderType: String = $(stringIndexerOrderType)

protected def hasLabelCol(schema: StructType): Boolean = {
schema.map(_.name).contains($(labelCol))
}
Expand Down Expand Up @@ -125,6 +161,11 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String)
@Since("2.1.0")
def setForceIndexLabel(value: Boolean): this.type = set(forceIndexLabel, value)

/** @group setParam */
@Since("2.3.0")
def setStringIndexerOrderType(value: String): this.type = set(stringIndexerOrderType, value)
setDefault(stringIndexerOrderType, StringIndexer.frequencyDesc)

/** Whether the formula specifies fitting an intercept. */
private[ml] def hasIntercept: Boolean = {
require(isDefined(formula), "Formula must be defined first.")
Expand Down Expand Up @@ -155,6 +196,7 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String)
encoderStages += new StringIndexer()
.setInputCol(term)
.setOutputCol(indexCol)
.setStringOrderType($(stringIndexerOrderType))
prefixesToRewrite(indexCol + "_") = term + "_"
(term, indexCol)
case _ =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ private[feature] trait StringIndexerBase extends Params with HasInputCol with Ha
* @group param
*/
@Since("1.6.0")
val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "how to handle " +
val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "How to handle " +
"invalid data (unseen labels or NULL values). " +
"Options are 'skip' (filter out rows with invalid data), error (throw an error), " +
"or 'keep' (put invalid data in a special additional bucket, at index numLabels).",
Expand All @@ -73,7 +73,7 @@ private[feature] trait StringIndexerBase extends Params with HasInputCol with Ha
*/
@Since("2.3.0")
final val stringOrderType: Param[String] = new Param(this, "stringOrderType",
"how to order labels of string column. " +
"How to order labels of string column. " +
"The first label after ordering is assigned an index of 0. " +
s"Supported options: ${StringIndexer.supportedStringOrderType.mkString(", ")}.",
ParamValidators.inArray(StringIndexer.supportedStringOrderType))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,90 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul
assert(result.collect() === expected.collect())
}

test("encodes string terms with string indexer order type") {
val formula = new RFormula().setFormula("id ~ a + b")
val original = Seq((1, "foo", 4), (2, "bar", 4), (3, "bar", 5), (4, "aaz", 5))
.toDF("id", "a", "b")

val expected = Seq(
Seq(
(1, "foo", 4, Vectors.dense(0.0, 0.0, 4.0), 1.0),
(2, "bar", 4, Vectors.dense(1.0, 0.0, 4.0), 2.0),
(3, "bar", 5, Vectors.dense(1.0, 0.0, 5.0), 3.0),
(4, "aaz", 5, Vectors.dense(0.0, 1.0, 5.0), 4.0)
).toDF("id", "a", "b", "features", "label"),
Seq(
(1, "foo", 4, Vectors.dense(0.0, 1.0, 4.0), 1.0),
(2, "bar", 4, Vectors.dense(0.0, 0.0, 4.0), 2.0),
(3, "bar", 5, Vectors.dense(0.0, 0.0, 5.0), 3.0),
(4, "aaz", 5, Vectors.dense(1.0, 0.0, 5.0), 4.0)
).toDF("id", "a", "b", "features", "label"),
Seq(
(1, "foo", 4, Vectors.dense(1.0, 0.0, 4.0), 1.0),
(2, "bar", 4, Vectors.dense(0.0, 1.0, 4.0), 2.0),
(3, "bar", 5, Vectors.dense(0.0, 1.0, 5.0), 3.0),
(4, "aaz", 5, Vectors.dense(0.0, 0.0, 5.0), 4.0)
).toDF("id", "a", "b", "features", "label"),
Seq(
(1, "foo", 4, Vectors.dense(0.0, 0.0, 4.0), 1.0),
(2, "bar", 4, Vectors.dense(0.0, 1.0, 4.0), 2.0),
(3, "bar", 5, Vectors.dense(0.0, 1.0, 5.0), 3.0),
(4, "aaz", 5, Vectors.dense(1.0, 0.0, 5.0), 4.0)
).toDF("id", "a", "b", "features", "label")
)

var idx = 0
for (orderType <- StringIndexer.supportedStringOrderType) {
val model = formula.setStringIndexerOrderType(orderType).fit(original)
val result = model.transform(original)
val resultSchema = model.transformSchema(original.schema)
assert(result.schema.toString == resultSchema.toString)
assert(result.collect() === expected(idx).collect())
idx += 1
}
}

test("test consistency with R when encoding string terms") {
/*
R code:

df <- data.frame(id = c(1, 2, 3, 4),
a = c("foo", "bar", "bar", "aaz"),
b = c(4, 4, 5, 5))
model.matrix(id ~ a + b, df)[, -1]

abar afoo b
0 1 4
1 0 4
1 0 5
0 0 5
*/
val original = Seq((1, "foo", 4), (2, "bar", 4), (3, "bar", 5), (4, "aaz", 5))
.toDF("id", "a", "b")
val formula = new RFormula().setFormula("id ~ a + b")
.setStringIndexerOrderType(StringIndexer.alphabetDesc)

/*
Note that the category dropped after encoding is the same between R and Spark
(i.e., "aaz" is treated as the reference level).
However, the column order is still different:
R renders the columns in ascending alphabetical order ("bar", "foo"), while
RFormula renders the columns in descending alphabetical order ("foo", "bar").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R and RFormula should behavior consistent if you fix the issue I mentioned above.

*/
val expected = Seq(
(1, "foo", 4, Vectors.dense(1.0, 0.0, 4.0), 1.0),
(2, "bar", 4, Vectors.dense(0.0, 1.0, 4.0), 2.0),
(3, "bar", 5, Vectors.dense(0.0, 1.0, 5.0), 3.0),
(4, "aaz", 5, Vectors.dense(0.0, 0.0, 5.0), 4.0)
).toDF("id", "a", "b", "features", "label")

val model = formula.fit(original)
val result = model.transform(original)
val resultSchema = model.transformSchema(original.schema)
assert(result.schema.toString == resultSchema.toString)
assert(result.collect() === expected.collect())
}

test("index string label") {
val formula = new RFormula().setFormula("id ~ a + b")
val original =
Expand Down