[SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures) #18872

ProtD · 2017-08-07T17:42:14Z

What changes were proposed in this pull request?

Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. @liancheng @HyukjinKwon

(Maybe the usage should be forbidden when writing, in a major version change?).

How was this patch tested?

Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option.

Please review http://spark.apache.org/contributing.html before opening a pull request.

srowen · 2017-08-07T18:10:04Z

Better title please?
See http://spark.apache.org/contributing.html
How does this work at all now?

WeichenXu123

Could you pls add a testcase for this fix ?

ProtD · 2017-08-10T13:08:29Z

@srowen It worked in v2.0, but was broken probably in v2.2.0 by b3d3962. Current unit tests check writing only for dataframes which were previously read from a LibSVM format, not general ones. (And I guess people don't write LibSVMs very often - that may be why nobody has reported it.)

@WeichenXu123 Yes, good idea, will do it!

ProtD · 2017-08-10T14:45:40Z

To reproduce the bug on v2.2 and v2.3:

import org.apache.spark.ml.linalg.Vectors
val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0)))),
                  (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0)))))
val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features")
dfTemp.coalesce(1).write.format("libsvm").save("...filename...")

This causes java.util.NoSuchElementException: key not found: numFeatures.

ProtD · 2017-08-10T16:58:02Z

I added the unit test, please review.

HyukjinKwon · 2017-08-10T22:51:16Z

ok to test

SparkQA · 2017-08-10T22:54:01Z

Test build #80508 has finished for PR 18872 at commit b86bb44.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-08-11T09:30:04Z

mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala

+    val rawData = new java.util.ArrayList[Row]()
+    rawData.add(Row(1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0)))))
+    rawData.add(Row(4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0)))))
+


Subtle: it didn't like the whitespace on this line

SparkQA · 2017-08-11T17:46:36Z

Test build #80533 has finished for PR 18872 at commit 38da77d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-08-11T18:06:44Z

@ProtD this needs a JIRA or else needs to be linked to whatever one you opened, in the title

ProtD · 2017-08-14T09:46:43Z

@srowen Ok, I created and linked a JIRA.

WeichenXu123 · 2017-08-14T22:51:47Z

mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala


  test("write libsvm data and read it again") {
    val df = spark.read.format("libsvm").load(path)
    val tempDir2 = new File(tempDir, "read_write_test")


I suggest the temp dir name to be Identifiable.randomUID("read_write_test"). Avoid conflicts with other parallel running tests.

Use Utils.createTempDir

Utils.createTempDir seems to be a nicer way. The directory is automatically deleted when VM shuts down, so I believe no manual cleanup (cf. comment below) is needed.

WeichenXu123 · 2017-08-14T22:53:56Z

mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala

    val df = spark.read.format("libsvm").load(path)
    val tempDir2 = new File(tempDir, "read_write_test")
-    val writepath = tempDir2.toURI.toString
+    val writePath = tempDir2.toURI.toString


use tempDir2.getPath

WeichenXu123 · 2017-08-14T22:55:46Z

mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala

    val row1 = df2.first()
    val v = row1.getAs[SparseVector](1)
    assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0))))
+    Utils.deleteRecursively(tempDir2)


You can remove this cleanup I think. The test framework will clean temp dir automatically I think.

SparkQA · 2017-08-15T10:40:50Z

Test build #80676 has finished for PR 18872 at commit 7530f00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. liancheng HyukjinKwon (Maybe the usage should be forbidden when writing, in a major version change?). Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Jan Vrsovsky <[email protected]> Closes #18872 from ProtD/master. (cherry picked from commit 8321c14) Signed-off-by: Sean Owen <[email protected]>

srowen · 2017-08-16T07:24:55Z

Merged to master/2.2

Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. liancheng HyukjinKwon (Maybe the usage should be forbidden when writing, in a major version change?). Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Jan Vrsovsky <[email protected]> Closes apache#18872 from ProtD/master. (cherry picked from commit 8321c14) Signed-off-by: Sean Owen <[email protected]>

check numFeatures only when reading LibSVM -- not when writing

3b43de0

WeichenXu123 requested changes Aug 8, 2017

View reviewed changes

ProtD changed the title ~~[MLlib] Fix writing LibSVM~~ [MLlib] Fix writing LibSVM (key not found: numFeatures) Aug 10, 2017

test

b86bb44

srowen approved these changes Aug 10, 2017

View reviewed changes

srowen reviewed Aug 11, 2017

View reviewed changes

Scala code style - removed spaces

38da77d

ProtD changed the title ~~[MLlib] Fix writing LibSVM (key not found: numFeatures)~~ [SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures) Aug 14, 2017

WeichenXu123 reviewed Aug 14, 2017

View reviewed changes

better handling of temp dirs, as suggested

7530f00

srowen approved these changes Aug 15, 2017

View reviewed changes

WeichenXu123 approved these changes Aug 15, 2017

View reviewed changes

asfgit closed this in 8321c14 Aug 16, 2017

[SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures) #18872

[SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures) #18872

Uh oh!

Conversation

ProtD commented Aug 7, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen commented Aug 7, 2017

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

ProtD commented Aug 10, 2017

Uh oh!

ProtD commented Aug 10, 2017

Uh oh!

ProtD commented Aug 10, 2017

Uh oh!

HyukjinKwon commented Aug 10, 2017

Uh oh!

SparkQA commented Aug 10, 2017

Uh oh!

srowen Aug 11, 2017

Choose a reason for hiding this comment

Uh oh!

ProtD Aug 11, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 11, 2017

Uh oh!

srowen commented Aug 11, 2017

Uh oh!

ProtD commented Aug 14, 2017

Uh oh!

WeichenXu123 Aug 14, 2017

Choose a reason for hiding this comment

Uh oh!

srowen Aug 15, 2017

Choose a reason for hiding this comment

Uh oh!

ProtD Aug 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Aug 14, 2017

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Aug 14, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 15, 2017

Uh oh!

srowen commented Aug 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ProtD Aug 15, 2017 •

edited

Loading