Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a long-standing bug.

In TaskMemoryManager.acquireExecutionMemory, we may lock TaskMemoryManager and spill MemoryConsumers. UnsafeExternalSorter is a MemoryConsumer and locks its SpillableIterator during spill.

In UnsafeExternalSorter#SpillableIterator.loadNext, we lock SpillableIterator and may call freePage which locks TaskMemoryManager.

If there are 2 threads doing these 2 locking chains together, a deadlock happens:
thread1: lock TaskMemoryManager and then SpillableIterator
thread2: lock SpillableIterator and then TaskMemoryManager

As an example, PythonUDFExec launches 2 threads for one task, which can trigger this deadlock.

The thread dump when the deadlock happens:
thread 1

org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator.spill(UnsafeExternalSorter.java:535)
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:201)
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:177) => holding Monitor(org.apache.spark.memory.TaskMemoryManager@49054460})
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:285)
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:117)
org.apache.spark.sql.execution.python.HybridRowQueue.createNewQueue(RowQueue.scala:227)
org.apache.spark.sql.execution.python.HybridRowQueue.add(RowQueue.scala:250)
org.apache.spark.sql.execution.python.EvalPythonExec$

thread 2

org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:334)
org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130)
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.access$1100(UnsafeExternalSorter.java:49)
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator.loadNext(UnsafeExternalSorter.java:593) => holding Monitor(org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator@717107972})
org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:187)
org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:174)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown Source)

How was this patch tested?

manual test, by inserting some sync points in the code, to enter to code flow that triggers dead lock.

@cloud-fan
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Apr 2, 2019

Test build #104183 has finished for PR 24269 at commit 3d16c7c.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible lastPage != null on line 550 above when it is set? https://github.com/apache/spark/pull/24269/files#diff-027299fb14327ddcaba457f81ecff32cR550

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible but doesn't matter. It's kind of we delay the page releasing a little bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that both the 2 methods are inside synchronized (this)

Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seams to be solved twice as here is another PR dealing with the same problem (different Jira): #24265

small nit, but LGTM otherwise

numRecords--;
upstream.loadNext();
} finally {
if (pageToFree != null) freePage(pageToFree);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add braces (according to Oracle java style: if statements always use braces, {})

@cloud-fan
Copy link
Contributor Author

ah didn't see that PR. Since that PR is opened earlier. I'm closing mine.

@cloud-fan cloud-fan closed this Apr 2, 2019
@SparkQA
Copy link

SparkQA commented Apr 2, 2019

Test build #104213 has finished for PR 24269 at commit 20ff8c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants