[SPARK-17711] Compress rolled executor log #15285

loneknightpy · 2016-09-28T22:09:59Z

What changes were proposed in this pull request?

This PR adds support for executor log compression.

How was this patch tested?

Unit tests

cc: @yhuai @tdas @mengxr

mengxr · 2016-09-29T13:45:46Z

ok to test

mengxr · 2016-09-29T13:45:57Z

add to whitelist

SparkQA · 2016-09-29T16:11:04Z

Test build #66104 has finished for PR 15285 at commit 15034e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-09-29T18:04:21Z

core/src/main/scala/org/apache/spark/util/logging/RollingFileAppender.scala

  import RollingFileAppender._

  private val maxRetainedFiles = conf.getInt(RETAINED_FILES_PROPERTY, -1)
+  private val enableCompression = conf.getBoolean(ENABLE_COMPRESSION, false)


Should we enable this by default?

I don't want to existing behavior.

yhuai · 2016-09-29T18:07:11Z

core/src/main/scala/org/apache/spark/util/logging/RollingFileAppender.scala

+        activeFile.delete()
+      } finally {
+        IOUtils.closeQuietly(inputStream)
+        IOUtils.closeQuietly(gzOutputStream)


What kinds of error do we expect? If there is an exception, we will lose log data, right? Seems we should at least have a log.

It may throw some kind of IOException which will be logged at rollover() method.

yhuai · 2016-09-29T18:09:54Z

core/src/main/scala/org/apache/spark/util/logging/RollingFileAppender.scala

        var altRolloverFile: File = null
        do {
          altRolloverFile = new File(activeFile.getParent,
            s"${activeFile.getName}$rolloverSuffix--$i").getAbsoluteFile


Want to double check. If we enable compression, the suffix will still be gz even we use the alternative file name, right?

Yes, it will be whatever we have before + .gz.

yhuai · 2016-09-29T18:14:13Z

core/src/main/scala/org/apache/spark/util/Utils.scala

    val effectiveStart = math.max(0, start)
    val buff = new Array[Byte]((effectiveEnd-effectiveStart).toInt)
-    val stream = new FileInputStream(file)
+    val stream = if (path.endsWith(".gz")) {


Should we use GZIP_LOG_SUFFIX?

oh, Utils.scala doesn't depend on FileAppender. It does use string literal in other places. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L463

yhuai · 2016-09-29T18:14:27Z

core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala

    logInfo("Filtered files: \n" + generatedFiles.mkString("\n"))
    assert(generatedFiles.size > 1)
+    if (isCompressed) {
+      assert(generatedFiles.filter(_.getName.endsWith(".gz")).size > 0)


Should we use GZIP_LOG_SUFFIX?

yhuai · 2016-09-29T18:14:37Z

core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala

+    }
    val allText = generatedFiles.map { file =>
-      Files.toString(file, StandardCharsets.UTF_8)
+      if (file.getName.endsWith(".gz")) {


same as https://github.com/apache/spark/pull/15285/files#r81200650

yhuai · 2016-09-29T18:20:53Z

@tdas will also take a look

SparkQA · 2016-09-29T20:45:11Z

Test build #66116 has finished for PR 15285 at commit 1df96e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

loneknightpy · 2016-10-05T17:59:20Z

ping @tdas

tdas

There are major concerns with this patch. Reading the log files to show in the UI may be completely broken if the files are compressed. Fixing this needs some brainstorming and adding a lot of new tests.

tdas · 2016-10-07T11:31:57Z

core/src/main/scala/org/apache/spark/util/logging/RollingFileAppender.scala

+        inputStream = new FileInputStream(activeFile)
+        gzOutputStream = new GZIPOutputStream(new FileOutputStream(gzFile))
+        IOUtils.copy(inputStream, gzOutputStream)
+        activeFile.delete()


Are you sure this is a good idea to delete the activeFile before closing the inputStream? I am not sure this is the right thing to do.

In fact the docs of IOUtils.closeQuietly says that it should not be used as a replacement for normal closing.
See https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/IOUtils.html#closeQuietly(java.io.Closeable...)

So this is not right.

tdas · 2016-10-07T11:39:11Z

core/src/main/scala/org/apache/spark/util/logging/RollingFileAppender.scala

  val SIZE_DEFAULT = (1024 * 1024).toString
  val RETAINED_FILES_PROPERTY = "spark.executor.logs.rolling.maxRetainedFiles"
  val DEFAULT_BUFFER_SIZE = 8192
+  val ENABLE_COMPRESSION = "spark.executor.logs.rolling.enableCompression"


Shouldnt we document this in the spark docs?

tdas · 2016-10-07T11:40:15Z

core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala

+      new TimeBasedRollingPolicy(rolloverIntervalMillis, s"--HH-mm-ss-SSSS", false),
+      sparkConf, 10)
+
+    testRolling(appender, testOutputStream, textToAppend, rolloverIntervalMillis, true)


nit: isCompressed = true

tdas · 2016-10-07T11:42:36Z

core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala

+    val files = testRolling(appender, testOutputStream, textToAppend, 0, true)
+    files.foreach { file =>
+      logInfo(file.toString + ": " + file.length + " bytes")
+      assert(file.length <= rolloverSize)


maybe if we should check that it is indeed gzipped by checking file.length < rolloverSize

tdas · 2016-10-07T11:42:53Z

core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala

+    val appender = new RollingFileAppender(testInputStream, testFile,
+      new SizeBasedRollingPolicy(rolloverSize, false), sparkConf, 99)
+
+    val files = testRolling(appender, testOutputStream, textToAppend, 0, true)


isCompressed = true

tdas · 2016-10-07T11:44:27Z

core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala

    // verify whether all the data written to rolled over files is same as expected
    val generatedFiles = RollingFileAppender.getSortedRolledOverFiles(
      testFile.getParentFile.toString, testFile.getName)
    logInfo("Filtered files: \n" + generatedFiles.mkString("\n"))


nit: Can you change this to "Generate files: \n"? This is incorrect, misleading.

tdas · 2016-10-11T00:18:35Z

ping. any updates?

SparkQA · 2016-10-11T03:19:17Z

Test build #66709 has finished for PR 15285 at commit c9dd1b0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-11T03:29:27Z

Test build #66711 has finished for PR 15285 at commit ae08495.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

loneknightpy · 2016-10-11T04:38:15Z

@tdas Addressed your comments

SparkQA · 2016-10-11T06:10:39Z

Test build #66716 has finished for PR 15285 at commit ef4f2b9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-11T06:48:21Z

Test build #66718 has finished for PR 15285 at commit e5676a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-11T08:55:23Z

Test build #66725 has finished for PR 15285 at commit 7cc6935.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

loneknightpy · 2016-10-14T03:11:08Z

@tdas Add fileUncompressedLengthCacheSize to executor conf.

SparkQA · 2016-10-14T05:02:15Z

Test build #66942 has finished for PR 15285 at commit a6fbebe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-14T22:09:59Z

Test build #66977 has finished for PR 15285 at commit 440fcfc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas

Overall functionality and tests are good, but I would like to see a bit more code refactoring to make the design cleaner.

tdas · 2016-10-17T21:50:20Z

core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala

+    .maximumSize(fileUncompressedLengthCacheSize)
+    .build[String, java.lang.Long](new CacheLoader[String, java.lang.Long]() {
+      override def load(path: String): java.lang.Long = {
+        Utils.getFileLength(new File(path))


I just learnt that CacheLoad.load has issues with obfuscating exception. So I think what may happen is that if Utils.getFileLength throws any exception, it will be wrapped and rethrown by Guava with a different stack trace, thus making it very confusing. I strongly suggest adding try catch here to log stuff (as warning) before rethrowing the same exception.

tdas · 2016-10-17T22:27:47Z

core/src/main/scala/org/apache/spark/util/logging/RollingFileAppender.scala

  val DEFAULT_BUFFER_SIZE = 8192
+  val ENABLE_COMPRESSION = "spark.executor.logs.rolling.enableCompression"
+  val FILE_UNCOMPRESSED_LENGTH_CACHE_SIZE =
+    "spark.executor.logs.rolling.fileUncompressedLengthCacheSize"


This is not a configuration inside executor. Its inside the worker. So why is this named "spark.executor"?

Its nothing to do with executor. The worker process (that manages executors) runs this code, and is independent of the application specific configuration in the executor.

Spark worker configurations are named as "spark.worker.*". See http://spark.apache.org/docs/latest/spark-standalone.html

So how about renaming it to "spark.worker.ui. fileUncompressedLengthCacheSize"

tdas · 2016-10-17T22:39:20Z

core/src/main/scala/org/apache/spark/util/Utils.scala

    CallSite(shortForm, longForm)
  }

+  def getFileLength(file: File): Long = {


Add docs, that this can handle non-compressed and gzip compressed files.

tdas · 2016-10-17T22:48:49Z

core/src/main/scala/org/apache/spark/deploy/worker/ui/LogPage.scala

    UIUtils.basicSparkPage(content, logType + " log page for " + pageName)
  }

+  private val fileUncompressedLengthCacheSize = parent.worker.conf.getInt(


I thought about the organization of the code, I dont like this. LogPage should not be aware of the details of compressing uncompressing, when Utils is doing the heavy lifting handling gzip files in a special manner. Its spreading the gzip support of log viewing between two classes unnecessary. Rather a cleaner approach is

Utils has the cache, and the cacheloader calls private util function to get gzip file size.

LogPage calls public method called Utils.getFileSize(file, conf), which transparently handle compressed and non-compressed files.

In the Utils, the cache should be called compressedLogFileLengthCache. It is initialized when Utils.getFileSize(file, conf) is called for the first time, using the configurations in conf.

tdas · 2016-10-17T22:59:38Z

core/src/test/scala/org/apache/spark/util/UtilsSuite.scala

+    val files = (1 to 3).map(i => new File(tmpDir, i.toString + suffix))
+    writeLogFile(files(0).getAbsolutePath, "0123456789".getBytes(StandardCharsets.UTF_8))
+    writeLogFile(files(1).getAbsolutePath, "abcdefghij".getBytes(StandardCharsets.UTF_8))
+    writeLogFile(files(2).getAbsolutePath, "ABCDEFGHIJ".getBytes(StandardCharsets.UTF_8))


could you test mixed compressed and uncompressed files. i think that case arises when we compressed rolled files, and active file.

just having another file(3) which is always uncompressed should be fine.

SparkQA · 2016-10-18T02:26:33Z

Test build #67099 has finished for PR 15285 at commit 81465ca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-18T05:03:57Z

Test build #67106 has finished for PR 15285 at commit 82d4575.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

loneknightpy · 2016-10-18T05:14:57Z

@tdas Addressed your comments, please take a look.

tdas

Its looking better, but still needs some work to improve test coverage. Currently the tests does not seem to test the code path of loading from the cache, as theh tests call the Utils.getFileLength() that bypasses the cache.

tdas · 2016-10-18T09:09:07Z

core/src/main/scala/org/apache/spark/util/Utils.scala

    CallSite(shortForm, longForm)
  }

+  val UNCOMPRESSED_LOG_FILE_LENGTH_CACHE_SIZE = "spark.worker.ui.compressedLogFileLengthCacheSize"


Also this is not the cache size, this is the cache size conf.

tdas · 2016-10-18T09:09:14Z

core/src/main/scala/org/apache/spark/util/Utils.scala

  }

+  val UNCOMPRESSED_LOG_FILE_LENGTH_CACHE_SIZE = "spark.worker.ui.compressedLogFileLengthCacheSize"
+  val DEFAULT_UNCOMPRESSED_LOG_FILE_LENGTH_CACHE_SIZE = 100


tdas · 2016-10-18T09:12:32Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+      }
+    } catch {
+      case e: Throwable =>
+        logWarning(s"Cannot get file length of ${file}", e)


Actually, this is a critical error. Better to make this logError

tdas · 2016-10-18T09:14:31Z

core/src/test/scala/org/apache/spark/util/UtilsSuite.scala

-    Files.write("abcdefghij", files(1), StandardCharsets.UTF_8)
-    Files.write("ABCDEFGHIJ", files(2), StandardCharsets.UTF_8)
+    val suffix = getSuffix(isCompressed)
+    val files = (1 to 3).map(i => new File(tmpDir, i.toString + suffix)) ++


nit: ++ Seq(item) can probably be replaced by :+ item

tdas · 2016-10-18T09:16:59Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+    compressedLogFileLengthCache
+  }
+
+  def getFileLength(file: File, sparkConf: SparkConf): Long = {


Add docs. Specify why conf is needed.
Rename sparkConf to more specific workerConf.

tdas · 2016-10-18T09:23:50Z

core/src/test/scala/org/apache/spark/util/UtilsSuite.scala

+    val suffix = getSuffix(isCompressed)
+    val f1Path = tmpDir2 + "/f1" + suffix
+    writeLogFile(f1Path, "1\n2\n3\n4\n5\n6\n7\n8\n9\n".getBytes(StandardCharsets.UTF_8))
+    val f1Length = Utils.getFileLength(new File(f1Path))


Nothing in these tests is testing the cache usage. How about making the internal getFileLength completely private (not private[util]) and deal with only compressed files (rename to getCompressedFileLenght). And so the only publicly visible function would be Utils.getFileLenght(), which is used by both LogPage and test code.

This would ensure that we expose only one new public interface in Utils, which gets tested thoroughly in the tests.

tdas · 2016-10-18T18:00:57Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+   * Return the file length, if the file is compressed it returns the uncompressed file length.
+   * It also caches the uncompressed file size to avoid repeated decompression. The cache size is
+   * read from workerConf.
+   * */


nit: doc style incorrect

tdas · 2016-10-18T18:02:10Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+      fileSize
    } catch {
      case e: Throwable =>
        logWarning(s"Cannot get file length of ${file}", e)


SparkQA · 2016-10-18T20:12:00Z

Test build #67133 has finished for PR 15285 at commit b3041b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-10-18T20:20:33Z

LGTM. Merging to master and 2.0

## What changes were proposed in this pull request? This PR adds support for executor log compression. ## How was this patch tested? Unit tests cc: yhuai tdas mengxr Author: Yu Peng <[email protected]> Closes #15285 from loneknightpy/compress-executor-log. (cherry picked from commit 231f39e) Signed-off-by: Tathagata Das <[email protected]>

SparkQA · 2016-10-18T20:32:00Z

Test build #67135 has finished for PR 15285 at commit 1e3302d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-18T20:39:37Z

Test build #67134 has finished for PR 15285 at commit 4ead779.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-10-18T21:54:03Z

This one breaks hadoop-2.2 builds: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.2/1951/console

## What changes were proposed in this pull request? This PR adds support for executor log compression. ## How was this patch tested? Unit tests cc: yhuai tdas mengxr Author: Yu Peng <[email protected]> Closes apache#15285 from loneknightpy/compress-executor-log.

compress executor log

15034e5

yhuai reviewed Sep 29, 2016

View reviewed changes

address comments

1df96e6

tdas suggested changes Oct 7, 2016

View reviewed changes

loneknightpy added 3 commits October 10, 2016 19:56

address 2

7356cdf

file size

490eeed

add test

c9dd1b0

fix style

ae08495

loneknightpy added 2 commits October 10, 2016 21:08

fix style 2

ef4f2b9

fix test

e5676a6

minor

7cc6935

add cache

89b9acd

fix

a6fbebe

fix test

440fcfc

tdas reviewed Oct 17, 2016

View reviewed changes

loneknightpy added 2 commits October 17, 2016 17:33

refactor

b645b76

minor

81465ca

fix test

82d4575

tdas reviewed Oct 18, 2016

View reviewed changes

better test

b3041b7

tdas reviewed Oct 18, 2016

View reviewed changes

minor

4ead779

tdas reviewed Oct 18, 2016

View reviewed changes

fix comment

1e3302d

asfgit closed this in 231f39e Oct 18, 2016

ueshin mentioned this pull request Oct 19, 2016

[SPARK-17985][CORE] Bump commons-lang3 version to 3.5. #15525

Closed

[SPARK-17711] Compress rolled executor log #15285

[SPARK-17711] Compress rolled executor log #15285

Uh oh!

Conversation

loneknightpy commented Sep 28, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

mengxr commented Sep 29, 2016

Uh oh!

mengxr commented Sep 29, 2016

Uh oh!

SparkQA commented Sep 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai commented Sep 29, 2016

Uh oh!

SparkQA commented Sep 29, 2016

Uh oh!

loneknightpy commented Oct 5, 2016

Uh oh!

tdas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas commented Oct 11, 2016

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

loneknightpy commented Oct 11, 2016

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

loneknightpy commented Oct 14, 2016

Uh oh!

SparkQA commented Oct 14, 2016

Uh oh!

SparkQA commented Oct 14, 2016

Uh oh!

tdas left a comment