Skip to content

Conversation

@ProtD
Copy link

@ProtD ProtD commented Aug 7, 2017

What changes were proposed in this pull request?

Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. @liancheng @HyukjinKwon

(Maybe the usage should be forbidden when writing, in a major version change?).

How was this patch tested?

Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@srowen
Copy link
Member

srowen commented Aug 7, 2017

Better title please?
See http://spark.apache.org/contributing.html
How does this work at all now?

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you pls add a testcase for this fix ?

@ProtD
Copy link
Author

ProtD commented Aug 10, 2017

@srowen It worked in v2.0, but was broken probably in v2.2.0 by b3d3962. Current unit tests check writing only for dataframes which were previously read from a LibSVM format, not general ones. (And I guess people don't write LibSVMs very often - that may be why nobody has reported it.)

@WeichenXu123 Yes, good idea, will do it!

@ProtD
Copy link
Author

ProtD commented Aug 10, 2017

To reproduce the bug on v2.2 and v2.3:

import org.apache.spark.ml.linalg.Vectors
val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0)))),
                  (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0)))))
val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features")
dfTemp.coalesce(1).write.format("libsvm").save("...filename...")

This causes java.util.NoSuchElementException: key not found: numFeatures.

@ProtD ProtD changed the title [MLlib] Fix writing LibSVM [MLlib] Fix writing LibSVM (key not found: numFeatures) Aug 10, 2017
@ProtD
Copy link
Author

ProtD commented Aug 10, 2017

I added the unit test, please review.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Aug 10, 2017

Test build #80508 has finished for PR 18872 at commit b86bb44.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val rawData = new java.util.ArrayList[Row]()
rawData.add(Row(1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0)))))
rawData.add(Row(4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0)))))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subtle: it didn't like the whitespace on this line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@SparkQA
Copy link

SparkQA commented Aug 11, 2017

Test build #80533 has finished for PR 18872 at commit 38da77d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 11, 2017

@ProtD this needs a JIRA or else needs to be linked to whatever one you opened, in the title

@ProtD ProtD changed the title [MLlib] Fix writing LibSVM (key not found: numFeatures) [SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures) Aug 14, 2017
@ProtD
Copy link
Author

ProtD commented Aug 14, 2017

@srowen Ok, I created and linked a JIRA.


test("write libsvm data and read it again") {
val df = spark.read.format("libsvm").load(path)
val tempDir2 = new File(tempDir, "read_write_test")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest the temp dir name to be Identifiable.randomUID("read_write_test"). Avoid conflicts with other parallel running tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Utils.createTempDir

Copy link
Author

@ProtD ProtD Aug 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utils.createTempDir seems to be a nicer way. The directory is automatically deleted when VM shuts down, so I believe no manual cleanup (cf. comment below) is needed.

val df = spark.read.format("libsvm").load(path)
val tempDir2 = new File(tempDir, "read_write_test")
val writepath = tempDir2.toURI.toString
val writePath = tempDir2.toURI.toString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use tempDir2.getPath

val row1 = df2.first()
val v = row1.getAs[SparseVector](1)
assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0))))
Utils.deleteRecursively(tempDir2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this cleanup I think. The test framework will clean temp dir automatically I think.

@SparkQA
Copy link

SparkQA commented Aug 15, 2017

Test build #80676 has finished for PR 18872 at commit 7530f00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 8321c14 Aug 16, 2017
asfgit pushed a commit that referenced this pull request Aug 16, 2017
Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. liancheng HyukjinKwon

(Maybe the usage should be forbidden when writing, in a major version change?).

Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Jan Vrsovsky <[email protected]>

Closes #18872 from ProtD/master.

(cherry picked from commit 8321c14)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented Aug 16, 2017

Merged to master/2.2

MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. liancheng HyukjinKwon

(Maybe the usage should be forbidden when writing, in a major version change?).

Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Jan Vrsovsky <[email protected]>

Closes apache#18872 from ProtD/master.

(cherry picked from commit 8321c14)
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants