[SPARK-25004][CORE] Add spark.executor.pyspark.memory limit. #21977

rdblue · 2018-08-02T19:32:53Z

What changes were proposed in this pull request?

This adds spark.executor.pyspark.memory to configure Python's address space limit, resource.RLIMIT_AS. Limiting Python's address space allows Python to participate in memory management. In practice, we see fewer cases of Python taking too much memory because it doesn't know to run garbage collection. This results in YARN killing fewer containers. This also improves error messages so users know that Python is consuming too much memory:

  File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in fe_engineer
    fe_eval_rec.update(f(src_rec_prep, mat_rec_prep))
  File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp
    comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, []), mat_rec_prep.get(item, []))
  File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in leven_list_compare
    permutations = sorted(permutations, reverse=True)
  MemoryError

The new pyspark memory setting is used to increase requested YARN container memory, instead of sharing overhead memory between python and off-heap JVM activity.

How was this patch tested?

Tested memory limits in our YARN cluster and verified that MemoryError is thrown.

rdblue · 2018-08-02T19:33:30Z

@holdenk, can you help review this since it is related to PySpark?

gatorsmile · 2018-08-02T19:39:50Z

cc @ueshin

holdenk · 2018-08-02T23:02:10Z

I'd be happy to. I've got a live review tomorrow I'll take a look at this tomorrow.

viirya · 2018-08-03T00:01:46Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

Should add pysparkWorkerMemory here too.

Maybe just switch it to use the total $executorMem instead?

I like having it broken out so users can see where their allocation is going. Otherwise, users that only know about spark.executor.memory might not know how their allocation is 1gb higher when running PySpark. I've updated this to include the worker memory.

viirya · 2018-08-03T00:05:17Z

python/pyspark/worker.py

Forget to output msg here?

mccheah · 2018-08-03T00:07:03Z

Does this have applications in the other cluster managers that are considering overhead memory, like Kuberrnetes and Mesos?

HyukjinKwon · 2018-08-03T00:44:12Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

tiny nit: indentation ..

HyukjinKwon · 2018-08-03T04:58:34Z

cc @BryanCutler and @icexelloss too since we recently discussed about memory issue.

ifilonenko · 2018-08-03T15:10:36Z

This seems very applicable to add to Kubernetes as well. We already increased the DEFAULT_MEMORY_OVERHEAD to account for memory issues that arise with users forgetting to increase the memory overhead. Could this be expanded for that cluster manager as well, I'd be happy to help with appending to this PR (or in a followup) to include that.

holdenk

Thank you so much for this PR it's really exciting to see a possible solution for one of the largest long lasting problems in PySpark!

I have a few questions for clarification and I'd really love to see the test suite, even if its not something for Jenkins usage, included somehow so that it can be part of the release verification.

I know we've mode some good progress on having integration tests in K8s so maybe we could have some good integration testing there eventually.

I'd also love @BryanCutler's feedback on anything in Arrow -- I don't think this would impact it but I am curious how rlimits interplay with native allocations that things like Arrow might do (I think?) (I'll do some more reading).

This is only a first pass with less than a full cup-of-coffee in me, so I'll come back and do a more thorough read through but once again really really excited to see this and hoping we can find a way to get this in for Spark 2.4.

holdenk · 2018-08-03T16:14:27Z

python/pyspark/worker.py

So the logic of this block appears to be the user has requested a memory limit and Python does not have a memory limit set. If the user has requested a different memory limit than the one set though, regardless of if there is a current memory limit, would it make sense to set?

Also possible I've misunderstood the rlmit return values here.

That being said even if that is the behaviour we want, should we use resource.RLIM_INFINITY to check if its unlimited?

I've updated to use resource.RLIM_INFINITY.

I think this should only set the resource limit if it isn't already set. It is unlikely that it's already set because this is during worker initialization, but the intent is to not cause harm if a higher-level system (i.e. container provider) has already set the limit.

That makes sense. What about if we only set the limit if it was lower than the current limit? (e.g. I could see a container system setting a limit based on an assumption which doesn't hold once Spark is in the mix and if we come up with a lower limit we could apply it)?

Works for me. I'll update this.

holdenk · 2018-08-03T16:17:30Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

Maybe just switch it to use the total $executorMem instead?

holdenk · 2018-08-03T16:19:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala

This is minor, but this code block is repeated, would it make sense to factor out?

The other configuration options are already duplicated, so I was trying to make as few changes as possible.

Since there are several duplicated options, I think it makes more sense to pass the SparkConf through to PythonRunner so it can extract its own configuration.

@holdenk, would you like this refactor done in this PR, or should I do it in a follow-up?

I went ahead with the refactor.

holdenk · 2018-08-03T16:19:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvalPythonExec.scala

Same repeated code block as mentioned.

holdenk · 2018-08-03T16:36:16Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

It's been awhile since I spent a lot of time thinking about how we launch our python worker processes. Maybe it would make sense to add a comment here explaining the logic a bit more? Based on the documentation in PythonWorkerFactory it appears we do the fork/not-fork decision not based on if reuseworker is set but instead on if we're in Windows or not. Is that the logic that this block was attempting to handle?

I thought the comments below were clear: if a single worker is reused, it gets the entire allocation. If each core starts its own worker, each one gets an equal share.

If reuseWorker is actually ignored, then this needs to be updated.

I think there might be a misunderstanding on what reuseWorker means perhaps. The workers will be reused but the decision on if we fork in Python or not is based on if we are in Windows or not. How about we both go and read the code path there and see if we reach the same understanding? I could be off too.

rdblue · 2018-08-03T18:44:56Z

@ifilonenko, I opened follow-up SPARK-25021 for adding the PySpark memory allocation to Kubernetes. @mccheah, I opened follow-up SPARK-25022 for Mesos.

rdblue · 2018-08-03T23:50:59Z

@holdenk, I attempted to write a YARN unit test for this, but evidently the MiniYARNCluster doesn't run python workers. The task is run, but a worker is never started. If you have any idea how to fix this, I think we could have an easy test. Here's what I have so far: https://gist.github.com/rdblue/9848a00f49eaad6126fbbcfa1b039e19

holdenk · 2018-08-04T00:07:58Z

@rdblue I'll take a look at that test on Monday during one of my streams after I get something else done :)

SparkQA · 2018-08-28T01:33:54Z

Test build #95313 has finished for PR 21977 at commit 0b275cf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-28T06:01:17Z

core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala

+  // each python worker gets an equal part of the allocation. the worker pool will grow to the
+  // number of concurrent tasks, which is determined by the number of cores in this executor.
+  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+      .map(_ / conf.getInt("spark.executor.cores", 1))


@rdblue, I fixed the site to refer databricks's guide. mind fixing this one if there are more changes to be pushed?

Sure, thanks for taking the time to clarify it.

@HyukjinKwon, sorry but it looks like this was merged before I could push a commit to update it.

Oh, it's fine. I meant to fix them together if there are more changes to push. Not a big deal.

rdblue · 2018-08-28T17:39:52Z

The last couple of commits have failed a test case, but there have been no code changes since a passing test. I think master is just a bit flaky right now and that this PR is fine.

vanzin · 2018-08-28T19:31:14Z

Test failure seems completely unrelated. Merging to master.

rdblue · 2018-08-29T17:58:49Z

@vanzin, thanks for merging! And thanks to everyone for the reviews!

HyukjinKwon · 2019-01-18T15:10:12Z

Sorry for a late input like this. I totally forgot about it and suddenly I recalled it. Have you guys taken a look for spark.python.worker.memory configuration before? This configuration limits the memory to decide when to spill due to the similar problem.

I think we should consolidate both into one configuration, or at least setting spark.executor.pyspark.memory should also affect spark.python.worker.memory. spark.python.worker.memory's default seems to be 512m.

rdblue · 2019-01-18T17:39:41Z

@HyukjinKwon, I haven't looked at spark.python.worker.memory before. Thanks for pointing it out.

Looks like this limit controls when data is spilled to disk. Do you know what data is spilled and what is accumulating in the python worker? My understanding was that python processed groups of rows (either pickled or in Arrow format) and doesn't typically hold data like the executor JVM does. More information here would be helpful to know what the right way to set this is.

HyukjinKwon · 2019-01-20T10:27:52Z

I think the actual data is spilled into disk on Python RDD APIs during shuffle and/or aggregation. For instance, partitionBy, sortBy, etc.

Up to my knowledge, this configuration does not apply to SQL or Arrow related APIs in Python but only RDD APIs. During aggregation and/or batch processing, it looks inevitable to hold the groups in memory and it might exceed the memory limit. In this case, it spills into disk as configured this spark.python.worker.memory.

I think basically we should just use spark.executor.pyspark.memory for this if I am not mistaken here since essentially spark.python.worker.memory means the memory that should be used in Python worker.

HyukjinKwon · 2019-01-20T10:32:15Z

BTW, I don't think many people use spark.python.worker.memory since arguably RDD APIs are being less used, and considering apparently all reviewers (including me) missed this configuration.

I think we can just remove and replace it to spark.executor.pyspark.memory with a migration guide note. If you guys agree on this approach, I'll make a followup PR. I actually have a bit of works done during this investigation.

rdblue · 2019-01-21T16:55:11Z

@HyukjinKwon, I like the idea of using spark.executor.pyspark.memory to control or bound the other setting, but I don't think that it can be used to replace spark.python.worker.memory.

The problem is that the first setting controls the total size of the address space and the second is a threshold that will cause data to be spilled. If the threshold for spilling is the total size limit, then Python would run out of memory before it started spilling data.

I think it makes sense to have both settings. The JVM has executor memory and spark memory (controlled by spark.memory.fraction), so these settings create something similar: total python memory and the threshold above which PySpark will spill to disk.

I think that means the spill setting should have a better name and should be limited by the total memory. Maybe ensure it's max is spark.executor.pyspark.memory - 300MB or something reasonable?

I think we should avoid introducing a property like spark.memory.fraction for Python. That is confusing for users and often ignored, leading to wasted memory. Setting explicit sizes is a better approach.

HyukjinKwon · 2019-01-21T17:18:19Z

Yea, I was thinking in that way and that works to me too. Would you be willing to work on the follow-up?

HyukjinKwon · 2019-01-21T17:24:58Z

Btw, let's clarify the practical case when both configurations can be set differently if we're going to keep two separate configurations in the JIRA or PR.

HyukjinKwon · 2019-01-21T18:27:16Z

If the threshold for spilling is the total size limit, then Python would run out of memory before it started spilling data.

Oh, BTW, I don't think it's true since it already uses less memory then what it sets.

spark/python/pyspark/rdd.py

Lines 1761 to 1762 in c2d0d70

    
           limit = (_parse_memory(self.ctx._conf.get( 
        
               "spark.python.worker.memory", "512m")) / 2)

spark/python/pyspark/rdd.py

Line 626 in c2d0d70

sort = ExternalSorter(memory * 0.9, serializer).sorted

spark/python/pyspark/rdd.py

Line 654 in c2d0d70

sort = ExternalSorter(memory * 0.9, serializer).sorted

felixcheung · 2019-01-21T19:56:29Z

hey do you guys want to capture this into a JIRA and broadcast to dev@ for visibility?

rdblue · 2019-01-21T21:42:58Z

@felixcheung, good idea. I opened https://issues.apache.org/jira/browse/SPARK-26679 for this. I'm not sure that I'll have time to work on it since I'm working on some DSv2 pull requests. Others should feel free to claim it now and I can review. If no one claims it by the time I can work on it, I'll take it on.

HyukjinKwon · 2019-01-22T01:25:13Z

Thanks all, I'll move to the JIRA.

HyukjinKwon · 2019-01-23T06:09:17Z

FWIW, I hope all reviewers here put some input for the JIRA. It's confusing to have both configurations and I think we should fix.

HyukjinKwon · 2019-01-27T04:27:56Z

docs/configuration.md

+  <td>
+    The amount of memory to be allocated to PySpark in each executor, in MiB
+    unless otherwise specified.  If set, PySpark memory for an executor will be
+    limited to this amount. If not set, Spark will not limit Python's memory use


@rdblue, which OS did you test?

I doesn't work in my case in non-yarn (local mode) at my Mac and I suspect it's OS-specific.

$ ./bin/pyspark --conf spark.executor.pyspark.memory=1m

def ff(iter): def get_used_memory(): import psutil process = psutil.Process(os.getpid()) info = process.memory_info() return info.rss import numpy a = numpy.arange(1024 * 1024 * 1024, dtype="u8") return [get_used_memory()] sc.parallelize([], 1).mapPartitions(ff).collect()

def ff(_): import sys, numpy a = numpy.arange(1024 * 1024 * 1024, dtype="u8") return [sys.getsizeof(a)] sc.parallelize([], 1).mapPartitions(ff).collect()

Can you clarify how you tested in the PR description?

FYI,

My Mac:

>>> import resource >>> size = 50 * 1024 * 1024 >>> resource.setrlimit(resource.RLIMIT_AS, (size, size)) >>> a = 'a' * size

at CentOS Linux release 7.5.1804 (Core):

>>> import resource >>> size = 50 * 1024 * 1024 >>> resource.setrlimit(resource.RLIMIT_AS, (size, size)) >>> a = 'a' * size Traceback (most recent call last): File "<stdin>", line 1, in <module> MemoryError

Looks we should better note this for clarification. For instance, we could just document that this feature is dependent on Python's resource module. WDYT?

Sounds fine to me. I tested in a linux environment.

…set via 'spark.executor.pyspark.memory' ## What changes were proposed in this pull request? #21977 added a feature to limit Python worker resource limit. This PR is kind of a followup of it. It proposes to add a test that checks the actual resource limit set by 'spark.executor.pyspark.memory'. ## How was this patch tested? Unit tests were added. Closes #23663 from HyukjinKwon/test_rlimit. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ndent on 'resource' ## What changes were proposed in this pull request? This PR adds a note that explicitly `spark.executor.pyspark.memory` is dependent on resource module's behaviours at Python memory usage. For instance, I at least see some difference at #21977 (comment) ## How was this patch tested? Manually built the doc. Closes #23664 from HyukjinKwon/note-resource-dependent. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…set via 'spark.executor.pyspark.memory' ## What changes were proposed in this pull request? apache#21977 added a feature to limit Python worker resource limit. This PR is kind of a followup of it. It proposes to add a test that checks the actual resource limit set by 'spark.executor.pyspark.memory'. ## How was this patch tested? Unit tests were added. Closes apache#23663 from HyukjinKwon/test_rlimit. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ndent on 'resource' ## What changes were proposed in this pull request? This PR adds a note that explicitly `spark.executor.pyspark.memory` is dependent on resource module's behaviours at Python memory usage. For instance, I at least see some difference at apache#21977 (comment) ## How was this patch tested? Manually built the doc. Closes apache#23664 from HyukjinKwon/note-resource-dependent. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

This adds `spark.executor.pyspark.memory` to configure Python's address space limit, [`resource.RLIMIT_AS`](https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS). Limiting Python's address space allows Python to participate in memory management. In practice, we see fewer cases of Python taking too much memory because it doesn't know to run garbage collection. This results in YARN killing fewer containers. This also improves error messages so users know that Python is consuming too much memory: ``` File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in fe_engineer fe_eval_rec.update(f(src_rec_prep, mat_rec_prep)) File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, []), mat_rec_prep.get(item, [])) File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in leven_list_compare permutations = sorted(permutations, reverse=True) MemoryError ``` The new pyspark memory setting is used to increase requested YARN container memory, instead of sharing overhead memory between python and off-heap JVM activity. Tested memory limits in our YARN cluster and verified that MemoryError is thrown. Author: Ryan Blue <[email protected]> Closes apache#21977 from rdblue/SPARK-25004-add-python-memory-limit. (cherry picked from commit 7ad18ee) Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala python/pyspark/worker.py sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonForeachWriter.scala sql/core/src/main/scala/org/apache/spark/sql/execution/python/WindowInPandasExec.scala