-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16931][PYTHON] PySpark APIS for bucketBy and sortBy #14517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/pyspark/sql/readwriter.py
Outdated
| self._jwrite = self._jwrite.partitionBy(_to_seq(self._spark._sc, cols)) | ||
| return self | ||
|
|
||
| @since(2.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since should be 2.1 since I don't think this will go into branch-2.0
|
ok to test |
|
Test build #63364 has finished for PR 14517 at commit
|
47d9ef7 to
0bc078a
Compare
|
Amended commit with style changes from MLNick. Can someone call the OK to test please |
0bc078a to
96df186
Compare
|
Test build #63411 has finished for PR 14517 at commit
|
|
Test build #63414 has finished for PR 14517 at commit
|
96df186 to
31c43e6
Compare
|
Test build #63422 has finished for PR 14517 at commit
|
31c43e6 to
ce9f9c0
Compare
|
Test build #63738 has finished for PR 14517 at commit
|
|
Test build #63739 has finished for PR 14517 at commit
|
python/pyspark/sql/readwriter.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8 is saying this line is too long (over 100 chars) so its failing the style tests. You can run the style tests locally with ./dev/lint-python as well for a faster turn around than Jenkins :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the note, I was getting annoyed at not knowing where to find the tools for such things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing - I'm always happy to help people get up to speed with contributing to PySpark so feel free to reach out to me if you get stuck with something similar.
|
Test build #64097 has finished for PR 14517 at commit
|
|
Test build #64106 has finished for PR 14517 at commit
|
|
Test build #64113 has finished for PR 14517 at commit
|
|
Test build #64227 has finished for PR 14517 at commit
|
|
Test build #64229 has finished for PR 14517 at commit
|
8eb8e71 to
dbb50ad
Compare
|
Test build #64240 has finished for PR 14517 at commit
|
|
What thoughts do people have about merging in? |
| except py4j.protocol.Py4JError: | ||
| spark = SparkSession(sc) | ||
|
|
||
| seed = int(time() * 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to have determistic test, testing with parquet should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GregBowyer ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been really busy with work of late, but I will try to sort this out today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GregBowyer ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GregBowyer Any progress on this? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GregBowyer ping. Let me propose to close this after a week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @zero323, would you maybe be interested in taking over this? I was thinking of taking over this if no one goes for it assuming it looks quite close to be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon By all means. I prepared a bunch of tests (7d911c647f21ada7fb429fd7c1c5f15934ff8847) and extended a bit code provided by @GregBowyer (72c04a3f196da5223ebb44725aa88cffa81036e4). I think we can skip low level tests (direct access to the files) which are already present in Scala test base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zero323, Good to know. Then, please go ahead if you are ready :).
What changes were proposed in this pull request?
API access to allow pyspark to use bucketBy and sortBy in datraframes.