-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures) #18872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Better title please? |
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you pls add a testcase for this fix ?
|
@srowen It worked in v2.0, but was broken probably in v2.2.0 by b3d3962. Current unit tests check writing only for dataframes which were previously read from a LibSVM format, not general ones. (And I guess people don't write LibSVMs very often - that may be why nobody has reported it.) @WeichenXu123 Yes, good idea, will do it! |
|
To reproduce the bug on v2.2 and v2.3: import org.apache.spark.ml.linalg.Vectors
val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0)))),
(4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0)))))
val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features")
dfTemp.coalesce(1).write.format("libsvm").save("...filename...")This causes |
|
I added the unit test, please review. |
|
ok to test |
|
Test build #80508 has finished for PR 18872 at commit
|
| val rawData = new java.util.ArrayList[Row]() | ||
| rawData.add(Row(1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0))))) | ||
| rawData.add(Row(4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0))))) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subtle: it didn't like the whitespace on this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
Test build #80533 has finished for PR 18872 at commit
|
|
@ProtD this needs a JIRA or else needs to be linked to whatever one you opened, in the title |
|
@srowen Ok, I created and linked a JIRA. |
|
|
||
| test("write libsvm data and read it again") { | ||
| val df = spark.read.format("libsvm").load(path) | ||
| val tempDir2 = new File(tempDir, "read_write_test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest the temp dir name to be Identifiable.randomUID("read_write_test"). Avoid conflicts with other parallel running tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Utils.createTempDir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utils.createTempDir seems to be a nicer way. The directory is automatically deleted when VM shuts down, so I believe no manual cleanup (cf. comment below) is needed.
| val df = spark.read.format("libsvm").load(path) | ||
| val tempDir2 = new File(tempDir, "read_write_test") | ||
| val writepath = tempDir2.toURI.toString | ||
| val writePath = tempDir2.toURI.toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use tempDir2.getPath
| val row1 = df2.first() | ||
| val v = row1.getAs[SparseVector](1) | ||
| assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0)))) | ||
| Utils.deleteRecursively(tempDir2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this cleanup I think. The test framework will clean temp dir automatically I think.
|
Test build #80676 has finished for PR 18872 at commit
|
Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. liancheng HyukjinKwon (Maybe the usage should be forbidden when writing, in a major version change?). Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Jan Vrsovsky <[email protected]> Closes #18872 from ProtD/master. (cherry picked from commit 8321c14) Signed-off-by: Sean Owen <[email protected]>
|
Merged to master/2.2 |
Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. liancheng HyukjinKwon (Maybe the usage should be forbidden when writing, in a major version change?). Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Jan Vrsovsky <[email protected]> Closes apache#18872 from ProtD/master. (cherry picked from commit 8321c14) Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. @liancheng @HyukjinKwon
(Maybe the usage should be forbidden when writing, in a major version change?).
How was this patch tested?
Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option.
Please review http://spark.apache.org/contributing.html before opening a pull request.