-
Notifications
You must be signed in to change notification settings - Fork 117
Use tar and gzip to compress+archive shipped jars #2
Use tar and gzip to compress+archive shipped jars #2
Conversation
| val usedFileNames = mutable.HashSet.empty[String] | ||
| for (path <- paths) { | ||
| val file = new File(path) | ||
| if (!file.isFile) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does leveraging the tar API make it easy to provide a directory, and have it send all the files underneath that directory? I can imagine that might be a nice feature, although it increases the likelihood of accidentally grabbing a lot of data that wasn't intended.
Related question: is useful to allow a configured hard limit on tarball size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do this but we need to add the logic to recursively add the files in directories. I think what we have here will cover enough use cases; I recall that the jars API in general doesn't allow for adding directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the usefulness of a configured limit mostly because the size that matters is the size after compression which could be difficult for application submitters to predict.
|
|
||
| def unpackAndWriteCompressedFiles( | ||
| compressedData: TarGzippedData, | ||
| rootOutputDir: File): Seq[String] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is root output directory a configuration parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're dynamically created in temp space.
|
We can wait to see if anybody else has questions, but LGTM |
aeb4351 to
ee07588
Compare
ash211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! Some requests for better comments and a warning log line but nothing significant
| var deduplicationCounter = 1 | ||
| while (usedFileNames.contains(resolvedFileName)) { | ||
| resolvedFileName = s"$nameWithoutExtension-$deduplicationCounter.$extension" | ||
| deduplicationCounter += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be a fan of logging a warning here -- having multiple jars on a classpath with the same name is bad practice anyway, especially if their contents are different...
|
|
||
| private[spark] object CompressionUtils { | ||
| private val BLOCK_SIZE = 10240 | ||
| private val RECORD_SIZE = 512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment that these are the defaults from TarArchiveOutputStream
| val paths = mutable.Buffer.empty[String] | ||
| val compressedBytes = Base64.decodeBase64(compressedData.dataBase64) | ||
| if (!rootOutputDir.exists) { | ||
| rootOutputDir.mkdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be a mkdir -p -- create intermediate directories?
| private val RECORD_SIZE = 512 | ||
| private val ENCODING = CharsetNames.UTF_8 | ||
|
|
||
| def createTarGzip(paths: Iterable[String]): TarGzippedData = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment to this method that any folder hierarchy on the input paths is flattened and duplicate filenames have a _N suffix added before the extension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Folder hierarchies actually aren't allowed, an exception is thrown instead. Still not sure if that's the right call however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah never mind, we want to note that only the file names are extracted from the full folder paths.
| ) | ||
| } | ||
|
|
||
| def unpackAndWriteCompressedFiles( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scaladoc that the return value is a seq of absolute file paths in their written location
| */ | ||
| package org.apache.spark.deploy.kubernetes.integrationtest; | ||
|
|
||
| public class PiHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this class primarily pulled out from SparkPiWithInfiniteWait.scala so there are multiple jars? Would be worth saying that in comments
|
Merge conflicts now after the folder rename |
|
@mccheah, sanity-check: the merge from |
|
Generally, it is cleaner to re-base for keeping updated, than merging. |
…cremental' into compress-jars
|
Still LGTM |
ash211
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
* Use tar and gzip to archive shipped jars. * Address comments * Move files to resolve merge
* Use tar and gzip to archive shipped jars. * Address comments * Move files to resolve merge
* Use tar and gzip to archive shipped jars. * Address comments * Move files to resolve merge
Augments #1 with compressing the user's jar uploads before they are sent over the wire. Note that while gzipping is the important part to compress things, there could be multiple possible choices over tar as the format for indicating "boundaries" between items in the stream. Encoding the file names is tricky to do by hand however so being able to get this for free using tar is helpful.