Skip to content

Conversation

@holdenk
Copy link
Contributor

@holdenk holdenk commented Jul 7, 2015

As a precursor to adding a public constructor add an option to handle unseen values by skipping rather than throwing an exception (default remains throwing an exception),

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36715 has finished for PR 7266 at commit d69ef5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove newline

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the two different exceptions add any new information (the user already knows if the model is configured to skip invalid or not). Also, this branch should never execute because of L152. Perhaps we can collapse this into a single exception?

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36848 has finished for PR 7266 at commit 75ffa69.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36964 has finished for PR 7266 at commit aa5b093.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@feynmanliang
Copy link
Contributor

LGTM

@holdenk
Copy link
Contributor Author

holdenk commented Jul 13, 2015

cc @jkbradley for review since created the jira.

@SparkQA
Copy link

SparkQA commented Jul 16, 2015

Test build #37544 has finished for PR 7266 at commit 100a39b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class UnresolvedAttribute(nameParts: Seq[String]) extends Attribute
    • abstract class Star extends LeafExpression with NamedExpression
    • case class UnresolvedAlias(child: Expression) extends UnaryExpression with NamedExpression
    • case class SortOrder(child: Expression, direction: SortDirection) extends UnaryExpression
    • trait AggregateExpression extends Expression
    • trait PartialAggregate extends AggregateExpression
    • case class Min(child: Expression) extends UnaryExpression with PartialAggregate
    • case class Max(child: Expression) extends UnaryExpression with PartialAggregate
    • case class Count(child: Expression) extends UnaryExpression with PartialAggregate
    • case class Average(child: Expression) extends UnaryExpression with PartialAggregate
    • case class Sum(child: Expression) extends UnaryExpression with PartialAggregate
    • case class SumDistinct(child: Expression) extends UnaryExpression with PartialAggregate
    • case class First(child: Expression) extends UnaryExpression with PartialAggregate
    • case class Last(child: Expression) extends UnaryExpression with PartialAggregate
    • trait Generator extends Expression
    • case class Explode(child: Expression) extends UnaryExpression with Generator
    • trait NamedExpression extends Expression
    • abstract class Attribute extends LeafExpression with NamedExpression
    • case class PrettyAttribute(name: String) extends Attribute
    • abstract class LeafNode extends LogicalPlan
    • abstract class UnaryNode extends LogicalPlan

@holdenk
Copy link
Contributor Author

holdenk commented Aug 1, 2015

jenkins, retest this please.

@holdenk
Copy link
Contributor Author

holdenk commented Aug 1, 2015

@jkbradley if you have a chance to look at this PR too its in the same class/file as the last one.

@SparkQA
Copy link

SparkQA commented Aug 1, 2015

Test build #186 has finished for PR 7266 at commit 100a39b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 1, 2015

Test build #39348 has finished for PR 7266 at commit 100a39b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

I was taking another look at it, and I like the setup. But one thing I had not thought of was a third option: creating a new label/index which all unseen values are mapped to. That would let users avoid filtering out rows, and instead replace the bad/unseen values with a default value. Rather than putting that in this PR, could you modify the Param to be a String, so that we can specify other options in the future?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #39952 has finished for PR 7266 at commit 414e249.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #39957 has finished for PR 7266 at commit 7f37f6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make docs more explicit. It's very helpful to users. Here, it'd be nice to state the options and what they mean: "Options: skip (filter out rows with bad values) or error (throw an error). Note that skipping rows means the Transformer can output a smaller dataset."

@jkbradley
Copy link
Member

Those 2 small items are the only issues I see.

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #40054 has finished for PR 7266 at commit 045bf22.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #40059 has finished for PR 7266 at commit 38a4de9.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Aug 6, 2015

Catalyst failure seems likely unrelated, jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #258 has finished for PR 7266 at commit 38a4de9.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #40068 has finished for PR 7266 at commit 38a4de9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM, merging with master.
Thanks!

@asfgit asfgit closed this in dbd778d Aug 11, 2015
CodingCat pushed a commit to CodingCat/spark that referenced this pull request Aug 17, 2015
…values

As a precursor to adding a public constructor add an option to handle unseen values by skipping rather than throwing an exception (default remains throwing an exception),

Author: Holden Karau <[email protected]>

Closes apache#7266 from holdenk/SPARK-8764-string-indexer-should-take-option-to-handle-unseen-values and squashes the following commits:

38a4de9 [Holden Karau] fix long line
045bf22 [Holden Karau] Add a second b entry so b gets 0 for sure
81dd312 [Holden Karau] Update the docs for handleInvalid param to be more descriptive
7f37f6e [Holden Karau] remove extra space (scala style)
414e249 [Holden Karau] And switch to using handleInvalid instead of skipInvalid
1e53f9b [Holden Karau] update the param (codegen side)
7a22215 [Holden Karau] fix typo
100a39b [Holden Karau] Merge in master
aa5b093 [Holden Karau] Since we filter we should never go down this code path if getSkipInvalid is true
75ffa69 [Holden Karau] Remove extra newline
d69ef5e [Holden Karau] Add a test
b5734be [Holden Karau] Add support for unseen labels
afecd4e [Holden Karau] Add a param to skip invalid entries.
@miro-balaz
Copy link

For me it would be usefull(when working with trees) if string indexer was mapping unseen labels to maximum known label+1. The row would not be skipped if it contained some non-used feature for classification.

@holdenk
Copy link
Contributor Author

holdenk commented Sep 12, 2016

@miro-balaz : This probably isn't the best place for a new feature request - but if you head over to the ASF JIRA you can create a new ticket and cc the people who worked on this.

@miro-balaz
Copy link

thank you for directions

On Monday, 12 September 2016, Holden Karau [email protected] wrote:

@miro-balaz https://github.com/miro-balaz : This probably isn't the
best place for a new feature request - but if you head over to the ASF JIRA
you can create a new ticket and cc the people who worked on this.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7266 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ALaD0wJm0im_ycl7fyxq-c97bY3RBCXkks5qpKJQgaJpZM4FT5k-
.

@b-g-d
Copy link

b-g-d commented Feb 9, 2017

When "skip" is chosen as the way to handle Unseen labels, is there a way to know which rows were skipped?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants