Skip to content

Conversation

@srowen
Copy link
Member

@srowen srowen commented Sep 21, 2016

What changes were proposed in this pull request?

To match Tokenizer and for compatibility with Word2Vec, output a nullable string array type in NGram

How was this patch tested?

Jenkins tests.

@srowen srowen changed the title [SPARK-1085 [SPARK-10835] [ML] Change Output of NGram to Array(String, True) Sep 21, 2016
@srowen
Copy link
Member Author

srowen commented Sep 21, 2016

This change merely parallels #7414 from @hhbyyh (what do you think?) I'd still sort of like to understand why we didn't 'fix' Word2Vec instead in that change?

@SparkQA
Copy link

SparkQA commented Sep 21, 2016

Test build #65711 has finished for PR 15179 at commit 0c7e552.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor

hhbyyh commented Sep 21, 2016

Hi @srowen. By fixing Word2Vec, do you mean to change its input type to Array(String, false)?

I cannot recall all the details. Right now I think the primary reason is to be consistent with auto schema inference, thus to support usage like https://github.com/apache/spark/blob/180fd3e0a3426db200c97170926afb60751dfd0e/examples/src/main/scala/org/apache/spark/examples/ml/Word2VecExample.scala

@srowen
Copy link
Member Author

srowen commented Sep 21, 2016

Or really to have it accept an array of strings, whether nullable or not. Right now it can accept a nullable string array type, but will pointlessly reject input that is not nullable. The check is too strict. We could just loosen that and then any string array input would work, which seems like the idea?

@hhbyyh
Copy link
Contributor

hhbyyh commented Sep 21, 2016

I'm not sure what will happen if we actually send an Array(String, false) column to Word2Vec. If nothing goes wrong, I think we can loosen the constraint.

I'll try something locally.

@hhbyyh
Copy link
Contributor

hhbyyh commented Sep 21, 2016

Checked locally and Word2Vec can work with Array(String, false). We can loosen the constraint. Maybe a new UT can be added.

@srowen
Copy link
Member Author

srowen commented Sep 21, 2016

I think the problem is that it won't accept anything that outputs Array(String, true)? it seems like it's looking for exactly the same type. Elsewhere I see a similar check written so as to not have to match nullability.

@hhbyyh
Copy link
Contributor

hhbyyh commented Sep 21, 2016

CountVectorizer can accept both Array(String, true) and Array(String, false). Perhaps we can just change the validation code.

…currently supported nullable string array type
@srowen
Copy link
Member Author

srowen commented Sep 22, 2016

Yes that's better IMHO. It makes this more consistent. Word2Vec was the only case that looked for an exact match on nullability when it was expecting nullable input (i.e. pointlessly disallowed non-null input)

@SparkQA
Copy link

SparkQA commented Sep 22, 2016

Test build #65769 has finished for PR 15179 at commit 4335237.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor

hhbyyh commented Sep 22, 2016

LGTM. How about adding an UT for the input type new ArrayType(StringType, false)?

@srowen
Copy link
Member Author

srowen commented Sep 22, 2016

Sorry, what's a UT here? user defined type?

@hhbyyh
Copy link
Contributor

hhbyyh commented Sep 22, 2016

Sorry, I meant unit test

@jkbradley
Copy link
Member

Accepting more input types SGTM too (with unit tests). The PR title and description (and perhaps the JIRA too) should be updated. Thanks!

@SparkQA
Copy link

SparkQA commented Sep 23, 2016

Test build #65821 has finished for PR 15179 at commit 76af236.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor

hhbyyh commented Sep 23, 2016

Thanks Sean. LGTM.

@srowen srowen changed the title [SPARK-10835] [ML] Change Output of NGram to Array(String, True) [SPARK-10835] [ML] Word2Vec should accept non-null string array, in addition to existing null string array Sep 24, 2016
asfgit pushed a commit that referenced this pull request Sep 24, 2016
…dition to existing null string array

## What changes were proposed in this pull request?

To match Tokenizer and for compatibility with Word2Vec, output a nullable string array type in NGram

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <[email protected]>

Closes #15179 from srowen/SPARK-10835.

(cherry picked from commit f3fe554)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member Author

srowen commented Sep 24, 2016

Merged to master/2.0

@asfgit asfgit closed this in f3fe554 Sep 24, 2016
@srowen srowen deleted the SPARK-10835 branch September 28, 2016 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants