Skip to content

Conversation

@jiangxb1987
Copy link
Contributor

@jiangxb1987 jiangxb1987 commented Aug 6, 2018

What changes were proposed in this pull request?

This PR add python support for barrier execution mode, thus enable launch a job containing barrier stage(s) from PySpark.

We just forked the existing RDDBarrier and RDD.barrier() in Python api.

How was this patch tested?

Manually tested:

>>> rdd = sc.parallelize([1, 2, 3, 4])
>>> def f(iterator): yield sum(iterator)
... 
>>> rdd.barrier().mapPartitions(f).isBarrier() == True
True

Unit tests will be added in a follow-up PR that implements BarrierTaskContext on python side.

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94302 has finished for PR 22011 at commit ec2f668.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JavaRDDBarrier[T: ClassTag](javaRdd: JavaRDD[T])
  • class RDDBarrier(object):

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94308 has finished for PR 22011 at commit b0b2f86.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we expose a package private method to get the annotated RDD with isBarrier=True in RDDBarrier, we can implement mapPartitions easily here:

jBarrierRdd = self._jrdd.rdd.barrier().barrierRdd.javaRdd
pyBarrierRdd = RDD(self._jrdd.rdd.barrier().barrierRdd.javaRdd)
pyBarrierRdd.mapPartitions(f, preservesPartitioning)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary to implement Python support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why we didn't mark the version so far here but we really should .. versionadded:: 2.4.0 here or

@since(2.4)
def barrier(self):
    ...

Copy link
Member

@HyukjinKwon HyukjinKwon Aug 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto let's add .. versionadded:: 2.4.0 at the end. I guess optionally add them to each API here exposed as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: RDDBarrier -> RDD barrier

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we match the documentation, or why is it different?

FWIW, for coding block, just `blabla` should be good enough. Nicer if linked properly by like :class:`ClassName`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring?

@jiangxb1987 jiangxb1987 changed the title [WIP][SPARK-24822][PySpark] Python support for barrier execution mode [SPARK-24822][PySpark] Python support for barrier execution mode Aug 9, 2018
@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94512 has finished for PR 22011 at commit 1ee8025.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94514 has finished for PR 22011 at commit d508fc5.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@mengxr
Copy link
Contributor

mengxr commented Aug 10, 2018

test this please

"""
def func(s, iterator):
return f(iterator)
jBarrierRdd = self._jrdd.rdd().barrier().toJavaRDD()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will materialize the java RDD, which means the map functions before and after barrier will be executed by 2 python workers.

We should not materialize the java RDD here, but just set a isBarrier flag in the pythhon PipelinedRDD.

@SparkQA
Copy link

SparkQA commented Aug 10, 2018

Test build #94530 has finished for PR 22011 at commit d508fc5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2018

Test build #94549 has finished for PR 22011 at commit ea2330b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Aug 10, 2018

test this please

@mengxr
Copy link
Contributor

mengxr commented Aug 10, 2018

@jiangxb1987 Please mention that tests will be added in a follow-up PR that implements BarrierTaskContext.

"""
return RDDBarrier(self)

def isBarrier(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have this API in the JVM RDD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In scala RDD there is a private[spark] isBarrier() function, we don't add this to JavaRDD

@SparkQA
Copy link

SparkQA commented Aug 10, 2018

Test build #94565 has finished for PR 22011 at commit ea2330b.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2018

Test build #94575 has finished for PR 22011 at commit cf38531.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 11, 2018

Test build #94590 has finished for PR 22011 at commit cf38531.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 11, 2018

Test build #94600 has finished for PR 22011 at commit cf38531.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in 4855d5c Aug 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants