-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10835] [ML] Word2Vec should accept non-null string array, in addition to existing null string array #15179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…able string array type in NGram
|
Test build #65711 has finished for PR 15179 at commit
|
|
Hi @srowen. By fixing I cannot recall all the details. Right now I think the primary reason is to be consistent with auto schema inference, thus to support usage like https://github.com/apache/spark/blob/180fd3e0a3426db200c97170926afb60751dfd0e/examples/src/main/scala/org/apache/spark/examples/ml/Word2VecExample.scala |
|
Or really to have it accept an array of strings, whether nullable or not. Right now it can accept a nullable string array type, but will pointlessly reject input that is not nullable. The check is too strict. We could just loosen that and then any string array input would work, which seems like the idea? |
|
I'm not sure what will happen if we actually send an I'll try something locally. |
|
Checked locally and Word2Vec can work with |
|
I think the problem is that it won't accept anything that outputs |
|
|
…currently supported nullable string array type
|
Yes that's better IMHO. It makes this more consistent. Word2Vec was the only case that looked for an exact match on nullability when it was expecting nullable input (i.e. pointlessly disallowed non-null input) |
|
Test build #65769 has finished for PR 15179 at commit
|
|
LGTM. How about adding an UT for the input type new ArrayType(StringType, false)? |
|
Sorry, what's a UT here? user defined type? |
|
Sorry, I meant unit test |
|
Accepting more input types SGTM too (with unit tests). The PR title and description (and perhaps the JIRA too) should be updated. Thanks! |
|
Test build #65821 has finished for PR 15179 at commit
|
|
Thanks Sean. LGTM. |
…dition to existing null string array ## What changes were proposed in this pull request? To match Tokenizer and for compatibility with Word2Vec, output a nullable string array type in NGram ## How was this patch tested? Jenkins tests. Author: Sean Owen <[email protected]> Closes #15179 from srowen/SPARK-10835. (cherry picked from commit f3fe554) Signed-off-by: Sean Owen <[email protected]>
|
Merged to master/2.0 |
What changes were proposed in this pull request?
To match Tokenizer and for compatibility with Word2Vec, output a nullable string array type in NGram
How was this patch tested?
Jenkins tests.