-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-7263] Add new shuffle manager which stores shuffle blocks in Parquet #7265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR currently only has four new tests to show that:
I'm submitting this PR to solicit feedback about the overall approach. Once reviewer have given this approach their blessing, I'll spend a few days writing targeted tests. Please let me know what parts of this change concern you most to help inform my test writing. |
|
Test build #36713 has finished for PR 7265 at commit
|
|
Is my understanding correct that, with this shuffle manager, we wouldn't be able to do reduce-side spilling that sorts records, or any map-side spilling (because the default shuffle writer involves sorting)? |
|
Map-side aggregation is supported in this pull request. If you look at the The |
|
Having the Parquet shuffle reader follow that pattern seems preferable to me over failing when spilling would be required. |
|
Done. The Parquet shuffle reader behaves identically to the hash shuffle reader now and uses the defined Spark Serializer. |
|
Test build #36723 has started for PR 7265 at commit |
|
Test build #36718 has finished for PR 7265 at commit
|
|
Jenkins has just been reconfigured to fix a testing infra bug. I'm going to kill the current build and start a new one. Jenkins, test this please. |
|
Jenkins, test this please. |
|
Test build #36739 has finished for PR 7265 at commit
|
|
I'm not sure why Jenkins is calling out changes to since this PR makes no changes to them. |
|
I just rebased on master to fix a pom.xml conflict. Both |
|
Test build #36863 has finished for PR 7265 at commit
|
|
Jenkins, test this please. |
|
Looks like Jenkins was in a bad state. I'll kick it to test again. |
|
Jenkins, test this please. |
|
Test build #36954 has finished for PR 7265 at commit
|
|
I've opened PR #7403 which includes only the changes to the Spark shuffle, serializing the key, value and combiner class names. I'll rebase this PR to only have the Parquet shuffle implementation work. I'm hoping the separating these will make it easier to review. |
|
Test build #37270 has finished for PR 7265 at commit
|
|
Jenkins, test this please. |
|
Test build #37374 has finished for PR 7265 at commit
|
|
Once #7403 is merged, I'll rebase this PR on and fix the conflicts. |
56be3a2 to
2f424f0
Compare
|
Test build #42301 has finished for PR 7265 at commit
|
…arquet
This commit adds a new Spark shuffle manager which reads and writes shuffle data to Apache
Parquet files. Parquet has a File interface (not a streaming interface) because it is
column-oriented and seeks in a File for metadata information, e.g. schemas, statistics.
As such, this implementation fetches remote data to local, temporary blocks before the
data is passed to Parquet for reading.
This managers uses the following spark configuration parameters to configure Parquet:
spark.shuffle.parquet.{compression, blocksize, pagesize, enabledictionary}.
There is a spark.shuffle.parquet.fallback configuration option which allows users to
specify a fallback shuffle manager. If the Parquet manager finds that the classes
being shuffled have no schema information, and therefore can't be used, it will
fallback to the specified fallback manager. With this PR, only Avro IndexedRecords
are supported in the Parquet shuffle; however, it is straight-forward to extend
this to other serialization systems that Parquet supports, e.g. Apache Thrift.
If there is no spark.shuffle.parquet.fallback defined, any shuffle objects which are
not compatible with Parquet will cause an error to be thrown which lists the
incompatible objects.
Because the ShuffleDependency forwards the key, value and combined class information,
a full schema can be generated before the first read/write. This allows for less
errors (since reflection isn't used) and makes support for null values possible without
complex code.
The ExternalSorter, if needed, is setup to not spill to disk if Parquet is used. In
the future, an ExternalSorter would need to be created that can read/write Parquet.
Only record-level metrics are supported at this time. Byte-level metrics are not
currently supported and are complicated somewhat by column compression.
a670789 to
0a4c028
Compare
|
Now that #7403 is merged, I've rebased this PR on top of |
|
Test build #42302 has finished for PR 7265 at commit
|
|
Test build #42306 has finished for PR 7265 at commit
|
|
Test build #57552 has finished for PR 7265 at commit
|
|
Might this make some 2.0.x release of Spark? |
|
Test build #66134 has finished for PR 7265 at commit
|
|
Test build #69069 has finished for PR 7265 at commit
|
|
I'm gong to close this for now. Next year we might actually come back and revisit this - probably not with the current parquet implementation since it is not very efficient, but some sort of columnar format. |
This commit adds a new Spark shuffle manager which reads and writes shuffle data to Apache Parquet files. Parquet has a File interface (not a streaming interface) because it is column-oriented and seeks in a File for metadata information, e.g. schemas, statistics. As such, this implementation fetches remote data to local, temporary blocks before the data is passed to Parquet for reading.
This managers uses the following spark configuration parameters to configure Parquet:
spark.shuffle.parquet.{compression, blocksize, pagesize, enabledictionary}.There is a
spark.shuffle.parquet.fallbackconfiguration option which allows users to specify a fallback shuffle manager. If the Parquet manager finds that the classes being shuffled have no schema information, and therefore can't be used, it will fallback to the specified fallback manager. With this PR, only AvroIndexedRecordsare supported in the Parquet shuffle; however, it is straight-forward to extend this to other serialization systems that Parquet supports, e.g. Apache Thrift. If there is nospark.shuffle.parquet.fallbackdefined, any shuffle objects which are not compatible with Parquet will cause an error to be thrown which lists the incompatible objects.Because the
ShuffleDependencyforwards the key, value and combined class information, a full schema can be generated before the first read/write. This allows for less errors (since reflection isn't used) and makes support for null values possible without complex code.The
ExternalSorter, if needed, is setup to not spill to disk if Parquet is used. In the future, anExternalSorterwould need to be created that can read/write Parquet.Only record-level metrics are supported at this time. Byte-level metrics are not currently supported and are complicated somewhat by column compression.