Skip to content

Conversation

@yinxusen
Copy link
Contributor

What changes were proposed in this pull request?

The patch adds python API for mllib.stat.test.StreamingTest under JIRA https://issues.apache.org/jira/browse/SPARK-12042.

Note that for StreamingTestResult, unlike other test results in Python, I define it as a normal Python class which doesn't extend from TestResult with a _java_obj in it.

How was this patch tested?

The patch is tested with Python unit test.

@mengxr
Copy link
Contributor

mengxr commented Feb 25, 2016

cc: @feynmanliang

@SparkQA
Copy link

SparkQA commented Feb 25, 2016

Test build #51989 has finished for PR 11374 at commit 770703b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

yinxusen commented Mar 5, 2016

test it please

@SparkQA
Copy link

SparkQA commented Mar 6, 2016

Test build #52523 has finished for PR 11374 at commit e4e8d5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

yinxusen commented Mar 6, 2016

cc @feynmanliang

@SparkQA
Copy link

SparkQA commented May 14, 2016

Test build #58598 has finished for PR 11374 at commit e4e8d5e.

  • This patch fails R style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Oct 24, 2016

Any updates to this PR?

@SparkQA
Copy link

SparkQA commented Oct 28, 2016

Test build #67696 has finished for PR 11374 at commit 615fbbb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

Ping @mengxr @feynmanliang @yanboliang

@feynmanliang
Copy link
Contributor

I'll review this tonight

@feynmanliang
Copy link
Contributor

Apologies for the delay, I am traveling but I'll get this done this weekend.

Copy link
Contributor

@feynmanliang feynmanliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass.

It's been awhile since I've looked at PySpark so I may be a bit rusty on some things.


"""
Create a DStream that contains several RDDs to show the StreamingTest of PySpark.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like other examples are including a from __future__ import print_function here

sc = SparkContext(appName="PythonStreamingTestExample")
ssc = StreamingContext(sc, 1)

checkpoint_path = tempfile.mkdtemp()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

from pyspark.mllib.stat.test import BinarySample, StreamingTest

if __name__ == "__main__":

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't include newline here


from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.mllib.stat.test import BinarySample, StreamingTest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$example on$ and $example off appear to be used in other examples, though I'm not sure why myself

ssc.checkpoint(checkpoint_path)

# Create the queue through which RDDs can be pushed to a QueueInputDStream.
rdd_queue = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use camelCase to be consistent with other examples

"""

checkpoint_path = tempfile.mkdtemp()
self.ssc.checkpoint(checkpoint_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

input_stream = self.ssc.queueStream(rdd_queue)

model = StreamingTest()
model.setPeacePeriod(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we break this into another test just for model params like

def test_model_params(self):
?

}
}

private[python] class StreamingTestResultPickler extends BasePickler[StreamingTestResult] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test these in PythonMLLibAPISuite?

streamingTest.setTestMethod(self._testMethod)

javaDStream = sc._jvm.SerDe.pythonToJava(data._jdstream, True)
testResult = streamingTest.registerStream(javaDStream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need pythonToJava and javaToPython; its not used for streaming K means

updatedModel = callMLlibFunc(

*/
@Since("1.6.0")
private[stat] class StreamingTestResult @Since("1.6.0") (
class StreamingTestResult @Since("1.6.0") (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be public? Java API doesn't seem to require it

@HyukjinKwon
Copy link
Member

Hi @yinxusen, are you able to proceed this further? If not, it seems it might be better closed for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants