-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8202] [PYSPARK] fix infinite loop during external sort in PySpark #6714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -179,9 +179,12 @@ def test_in_memory_sort(self): | |
| list(sorter.sorted(l, key=lambda x: -x, reverse=True))) | ||
|
|
||
| def test_external_sort(self): | ||
| class CustomizedSorter(ExternalSorter): | ||
| def _next_limit(self): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that we should add a comment here to explain why we're mocking out this part of the code; it doesn't seem self-evident to me and I'm worried that it's going to confuse future readers of this code. Also, do you think that it's worth adding a separate test case for this path and keeping the old test? There might be some duplication of the code which does assertions over metrics, but we possibly can factor it out into a shared method.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like the intent here is to mock
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, without the mock, it will take a long time to reach memory limit (slowing down tests). |
||
| return self.memory_limit | ||
| l = list(range(1024)) | ||
| random.shuffle(l) | ||
| sorter = ExternalSorter(1) | ||
| sorter = CustomizedSorter(1) | ||
| self.assertEqual(sorted(l), list(sorter.sorted(l))) | ||
| self.assertGreater(shuffle.DiskBytesSpilled, 0) | ||
| last = shuffle.DiskBytesSpilled | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the call to gc.collect being removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't change
limitto_next_limit()(which call get_used_memory()). This line here was to get better number about how much memory was used, is not needed anymore.