[SPARK-3151] [Block Manager] DiskStore.getBytes fails for files larger than 2GB #18855

eyalfa · 2017-08-05T21:21:51Z

What changes were proposed in this pull request?

introduced DiskBlockData, a new implementation of BlockData representing a whole file.
this is somehow related to SPARK-6236 as well

This class follows the implementation of EncryptedBlockData just without the encryption. hence:

toInputStream is implemented using a FileInputStream (todo: encrypted version actually uses Channels.newInputStream, not sure if it's the right choice for this)
toNetty is implemented in terms of io.netty.channel.DefaultFileRegion
toByteBuffer fails for files larger than 2GB (same behavior of the original code, just postponed a bit), it also respects the same configuration keys defined by the original code to choose between memory mapping and simple file read.

How was this patch tested?

added test to DiskStoreSuite and MemoryManagerSuite

…ite.

…on to represent a disk backed block data.

…ons in the tested class.

…of the >2gb tests.

eyalfa · 2017-08-05T21:27:35Z

@rxin, @JoshRosen , @cloud-fan ,
you seem to be the last guys to touch this class, can you please review?

kiszk · 2017-08-06T13:34:15Z

core/src/main/scala/org/apache/spark/storage/DiskStore.scala

+  }
+
+  override def toByteBuffer(): ByteBuffer = {
+    require( size < Int.MaxValue


Is it better to check blockSize since this method refers to blockSize?

@kiszk , not sure I'm following your comment.
this requirement results with an explicit errors when one tries to obtain a ByteBuffer larger than java.nio's limitations.
original code used to fail in line 115 when calling ByteBuffer.allocate with block size larger than 2GB, newer code fails explicitly in this cae.

...or do you mean refer the blockSize val rather than the size method? can't really see a difference in that case.

Sorry for confusing you. I mean the last line in your comment. I know the size method has the same value as blockSize since I saw the size method.
Other places in toByteBuffer uses blockSize instead of size. For ease of code reading, it would be good to use blockSize instead of size here.
What do you think?

kiszk · 2017-08-06T13:48:13Z

core/src/main/scala/org/apache/spark/storage/DiskStore.scala

+
+  override def dispose(): Unit = {}
+
+    private def open() = new FileInputStream(file).getChannel


nit: remove 2 spaces for better indent.

will do 😎

kiszk · 2017-08-06T14:28:48Z

core/src/test/scala/org/apache/spark/storage/DiskStoreSuite.scala

+    val blockId = BlockId("rdd_1_2")
+    diskStore.put(blockId) { chan =>
+      val arr = new Array[Byte](mb)
+      for{


nit: for (

kiszk · 2017-08-06T14:29:46Z

core/src/test/scala/org/apache/spark/storage/DiskStoreSuite.scala

+      val arr = new Array[Byte](mb)
+      for{
+        _ <- 0 until 2048
+      }{


kiszk · 2017-08-06T14:31:02Z

core/src/test/scala/org/apache/spark/storage/DiskStoreSuite.scala

+    val diskBlockManager = new DiskBlockManager(conf, deleteFilesOnStop = true)
+    val diskStore = new DiskStore(conf, diskBlockManager, new SecurityManager(conf))
+
+    val mb = 1024*1024


nit: 1024 * 1024

eyalfa · 2017-08-06T18:53:26Z

@kiszk , fixed styling+readability according to your comments.
BTW, any idea why JIRA didn't associate this PR with SPARK-3151?

HyukjinKwon · 2017-08-07T05:21:26Z

Yea, please refer http://apache-spark-developers-list.1001551.n3.nabble.com/Some-PRs-not-automatically-linked-to-JIRAs-td22067.html Looks some problems related with it.

cloud-fan · 2017-08-07T09:03:03Z

Looks reasonable, can you explain which end-to-end cases this patch fixed? and also which end-to-end cases remain problematic.

kiszk · 2017-08-07T15:26:36Z

Sounds good to me except BlockManagerSuite.scala. I am not familiar whether this new test checks new code well or not.

eyalfa · 2017-08-07T16:53:57Z

@HyukjinKwon , interesting reading but I couldn't find a concrete reason or solution to the issue.
@cloud-fan, I've encountered this bug when working with a disk persisted RDD, it turned out some of our partitions exceeded the 2GB limit and failed on DiskStore.getBytes.

since we're using a specific commercial distro I couldn't test this patch on my use case (we got rid of the large partitions anyway by tuning number of partitions and tweaking the DAG altogether 😎 ), so I did the next best thing: reproduce the issue in a test case, please see test "blocks larger than 2gb" in DiskStoreSuite inthis PR.

cloud-fan · 2017-08-07T17:01:08Z

core/src/main/scala/org/apache/spark/storage/DiskStore.scala

+    }
+  }
+
+  override def toByteBuffer(): ByteBuffer = {


we will still hit the 2g limitation here, I'm wondering which end-to-end use cases are affected by it.

indeed.
I chose to postpone the failure from DiskStroe.getBytes to this place as I believe it introduces no regression while still allowing the more common 'streaming' like use-case.

further more, I think this plays well with the comment about future deprecation of org.apache.spark.network.buffer.ManagedBuffer#nioByteBuffer which seems to be the main reason for BlockData exposing the toByteBuffer method.

@cloud-fan
it took me roughly 4 hours, but I looked both at the shuffle cod path and at BlockManager.getRemoteBytes:
it seems the first is robust to large blocks by using Netty's stream capabilities,
the later seems to be broken as it's not using the Netty's streaming capabilities and actually tries to copy the result buffer into a heap based buffer. I think this deserves its own JIRA/PR.
I think these two places plus the external shuffle server cover most of the relevant use cases (aside from local caching which i believe this PR completes in terms of being 2GB proof).

eyalfa · 2017-08-07T17:22:10Z

core/src/test/scala/org/apache/spark/storage/DiskStoreSuite.scala

+      }
+    }
+
+    val blockData = diskStore.getBytes(blockId)


@kiszk, this is the test case I was referring to.
I actually introduced it prior to actually (hopefully) fixing the bug in DiskStore.getBytes.

cloud-fan · 2017-08-08T11:43:02Z

core/src/main/scala/org/apache/spark/storage/DiskStore.scala

 }

+private class DiskBlockData(
+    conf: SparkConf,


we can pass in minMemoryMapBytes directly.

cloud-fan · 2017-08-08T11:43:56Z

core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala

    }
  }
+
+  def testGetOrElseUpdateForLargeBlock(storageLevel : StorageLevel) {


nit: storageLevel: StorageLevel

cloud-fan · 2017-08-08T11:46:09Z

core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala

+      implicitly[ClassTag[Array[Byte]]],
+      mkBlobs _
+    )
+    withClue(res1) {


does res1 have a reasonable string representation?

I think it'd print an Either where left side is a case class with members: iterator (prints as empty/non empty iterator), an enum and number of bytes.
right side is an iterator, again this'd print an empty/not-empty iterator.

cloud-fan · 2017-08-08T11:46:56Z

core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala

+    withClue(res1) {
+      assert(res1.isLeft)
+      assert(res1.left.get.data.zipAll(mkBlobs(), null, null).forall {
+        case (a, b) =>


just a === b?

can't compare Arrays, you get identity equality which is usually not what you want. hence the .seq that forces it to be wrapped with a Seq

=== is a helper method in scala test and should be able to compare arrays

Even if === does not work, you have Arrays.equal, which is null-safe.

cloud-fan · 2017-08-08T11:48:57Z

core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala

+    }
+  }
+
+  test("getOrElseUpdate > 2gb, storage level = disk only") {


shall we just write a test in DiskStoreSuite?

oh we already have, then why we have these tests?

these tests cover more than just the DiskOnly storage level, they were crafted when I had bigger ambitions of solving the entire 2GB issue 😎 , that was before seeing some ~100 files pull requests being abandoned or rejected.
aside, these tests also test the entire orchestration done by BlockManager when an RDD requests a cached partition, notice that these tests intentionally makes two calls to the BlockManager in order to simulate both code paths (cache-hit, cache-miss).

cloud-fan · 2017-08-08T11:52:35Z

core/src/test/scala/org/apache/spark/storage/DiskStoreSuite.scala

+    }
+
+    val blockData = diskStore.getBytes(blockId)
+    assert(blockData.size == 2 * gb)


test with 3gb to be more explicit that it's larger than 2gb?

possible, will fix.
I guess I aimed for the lowest possible failing value

eyalfa · 2017-08-09T08:45:41Z

Agreed, though I can't really understand it... Ran SBT locally with --mem 12000 and provided the magical jvm flag that prints jvm provided CLI args, it seems SBT is indeed running with 12gb and the test is still failing - even when running this test only. Does SBT fork before executing tests? If so how does it configure the jvm running the tests?

…

On Aug 9, 2017 08:29, "Wenchen Fan" ***@***.***> wrote: since this PR only focus on DiskStore, shall we remove the new tests in BlockManagerSuite? Seems the OOM only happens in BlockManagerSuite — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#18855 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABFFOZ8ahiQXd_dce4M6UG-Wib8ebaBTks5sWUOmgaJpZM4OukZQ> .

eyalfa · 2017-08-11T07:35:51Z

@cloud-fan I think I found the sbt setting that controlled max heap size for forked tests, I've increased it from 3g to 6g.
cc: @srowen, @vanzin and @a-roberts you guys seem to be the last ones to update this area in sbt, please review.

SparkQA · 2017-08-11T19:59:45Z

Test build #80534 has finished for PR 18855 at commit 6cbe8d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-08-14T14:26:30Z

project/SparkBuild.scala

      .map { case (k,v) => s"-D$k=$v" }.toSeq,
    javaOptions in Test += "-ea",
-    javaOptions in Test ++= "-Xmx3g -Xss4096k"
+    javaOptions in Test ++= "-Xmx6g -Xss4096k"


I'm a little worried about this change. Since the change to BlockManagerSuite is not very related to this PR, can we revert and revisit it in follow-up PR? Then we can unblock this PR.

@cloud-fan , let's wait few hours and see what the other guys CCed for this (the last ones to edit the build) have to say about this. if they are also worried or do not comment I'll revert this.

I must say I'm reluctant to revert these tests as I personally believe that lack of such tests contributed to spark's 2GB issues, including this one.

I am +1 for separating it if this can be. Let's get some changes we are sure of into the code base first.

vanzin · 2017-08-14T17:08:53Z

core/src/main/scala/org/apache/spark/storage/DiskStore.scala

+  }
+
+  override def toByteBuffer(): ByteBuffer = {
+    require( blockSize < Int.MaxValue


no space after (; comma should be on this line.

vanzin · 2017-08-14T17:14:00Z

core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala

+    withClue(res1) {
+      assert(res1.isLeft)
+      assert(res1.left.get.data.zipAll(mkBlobs(), null, null).forall {
+        case (a, b) =>


Even if === does not work, you have Arrays.equal, which is null-safe.

vanzin · 2017-08-14T17:16:02Z

core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala

+          val iter = store
+            .serializerManager
+            .dataDeserializeStream(RDDBlockId(42, 0)
+              , inpStrm)(implicitly[ClassTag[Array[Byte]]])


Comma goes in the previous line. inpStrm is kind of an ugly variable name; pick one: is, in, inputStream.

vanzin · 2017-08-14T17:25:11Z

core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala

    }
  }
+
+  def testGetOrElseUpdateForLargeBlock(storageLevel: StorageLevel) {


Have you measured how long these tests take? I've seen this tried before in other changes related to 2g limits, and this kind of test was always ridiculously slow.

You can avoid this kind of test by making the chunk size configurable, e.g. in this line you're adding above:

val chunkSize = math.min(remaining, Int.MaxValue)

Then your test can run fast and not use a lot of memory. You just need to add extra checks that the data is being chunked properly, instead of relying on the JVM not throwing errors at you.

@vanzin ,
I've measured, test cases times range from 7-25 seconds on my laptop.
point well taken 😎

7-25 seconds is really a long time for a unit test...

@cloud-fan I know,
it even gets worse when using the === operator.

I'm currently exploring the second direction pointed by @vanzin , introducing a test-only configuration key to configure the max page size

@cloud-fan , @vanzin ,
taking the 'parameterized approach', I'd remove most of the tests from BlockManagerSuite as they'd require propagating this parameter to too many subsystems.
so, I'm going to modify DiskStore and DiskStoreSuite to use such a parameter, I'm not sure about leaving a test-case in BlockManagerSuite that tests DISK_ONLY persistence, what do you guys think?

shall we do them together in a follow-up PR? I think the test case in DiskStoreSuite is enough.

yes, currently working on:

parameterizing DiskStore and DiskStoreSuite

revert the tests in BlockManagerSuite

revert the 6gb change in sbt

It would be probably easier to propagate the chunk size as a SparkConf entry that is not documented. But up to you guys.

's comments.

eyalfa · 2017-08-15T16:31:43Z

Funny enough, that's the approach I've chosen.

…

On Aug 15, 2017 19:17, "Marcelo Vanzin" ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala <#18855 (comment)>: > @@ -1415,6 +1415,79 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with BeforeAndAfterE super.fetchBlockSync(host, port, execId, blockId) } } + + def testGetOrElseUpdateForLargeBlock(storageLevel: StorageLevel) { It would be probably easier to propagate the chunk size as a SparkConf entry that is not documented. But up to you guys. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#18855 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABFFOZXLge3svlsrRWDJnt_xImYb6cTjks5sYcSVgaJpZM4OukZQ> .

This reverts commit 6cbe8d0.

SparkQA · 2017-08-15T22:10:09Z

Test build #80697 has finished for PR 18855 at commit 8be899f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Looks ok to me but I'll let Wenchen take a final look.

vanzin · 2017-08-15T22:15:05Z

core/src/test/scala/org/apache/spark/storage/DiskStoreSuite.scala

+    val chunkedByteBuffer = blockData.toChunkedByteBuffer(ByteBuffer.allocate)
+    val chunks = chunkedByteBuffer.chunks
+    assert(chunks.size === 2)
+    for( chunk <- chunks ) {


nit: for (chunk...) {

vanzin · 2017-08-15T22:15:14Z

core/src/test/scala/org/apache/spark/storage/DiskStoreSuite.scala

+      .set("spark.storage.memoryMapLimitForTests", "10k" )
+    val diskBlockManager = new DiskBlockManager(conf, deleteFilesOnStop = true)
+    val diskStore = new DiskStore(conf, diskBlockManager, new SecurityManager(conf))
+


nit: remove this empty line

vanzin · 2017-08-15T22:17:43Z

core/src/test/scala/org/apache/spark/storage/DiskStoreSuite.scala

+      blockData.toByteBuffer()
+    }
+
+    assert(e.getMessage ==


SparkQA · 2017-08-16T07:04:49Z

Test build #80716 has finished for PR 18855 at commit 732073c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

eyalfa · 2017-08-16T07:12:28Z

@cloud-fan , @vanzin ,
any idea what happened to this build? seem environment issue after a successful build (0 failed tests, 'Build step 'Execute shell' marked build as failure')
can one of you kindly ask jenkins to retest?

cloud-fan · 2017-08-16T08:02:58Z

retest this please

cloud-fan · 2017-08-16T08:04:25Z

core/src/main/scala/org/apache/spark/storage/DiskStore.scala


  private val minMemoryMapBytes = conf.getSizeAsBytes("spark.storage.memoryMapThreshold", "2m")
+  private val maxMemoryMapBytes = conf.getSizeAsBytes("spark.storage.memoryMapLimitForTests",
+    s"${Int.MaxValue}b")


nit: just Int.MaxValue.toString

cloud-fan · 2017-08-16T08:09:08Z

core/src/main/scala/org/apache/spark/storage/DiskStore.scala

 }

+private class DiskBlockData(
+    minMemoryMapBytes : Long,


nit: no space before :

cloud-fan · 2017-08-16T08:12:12Z

core/src/main/scala/org/apache/spark/storage/DiskStore.scala

+    // I chose to leave to original error message here
+    // since users are unfamiliar with the configureation key
+    // controling maxMemoryMapBytes for tests
+    require(blockSize < maxMemoryMapBytes,


why we need this? I think we can see the original error message if we don't have this check and go to the memory map code path.

oh this is to verify the bug fix.

cloud-fan · 2017-08-16T08:16:56Z

core/src/main/scala/org/apache/spark/storage/DiskStore.scala

+    // controling maxMemoryMapBytes for tests
+    require(blockSize < maxMemoryMapBytes,
+      s"can't create a byte buffer of size $blockSize" +
+      s" since it exceeds Int.MaxValue ${Int.MaxValue}.")


since it exceeds $maxMemoryMapBytes is more accurate.

or call Utils.bytesToString to make it more readable.

cloud-fan · 2017-08-16T08:18:07Z

LGTM except some minor comments, thanks for working on it!

SparkQA · 2017-08-16T11:12:50Z

Test build #80728 has finished for PR 18855 at commit 732073c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-16T21:57:13Z

Test build #80747 has finished for PR 18855 at commit d0c98a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-08-17T01:22:00Z

thanks, merging to master!

Eyal Farago added 6 commits July 5, 2017 16:20

SPARK-6235__take1: introduce a failing test.

fc3f1d7

SPARK-6235__add_failing_tests: add failing tests for block manager su…

8468738

…ite.

SPARK-6235__add_failing_tests: introduce a new BlockData implementati…

1580449

…on to represent a disk backed block data.

SPARK-6235__add_failing_tests: styling

c5028f5

SPARK-6235__add_failing_tests: adapt DiskStoreSuite to the modificati…

908c786

…ons in the tested class.

SPARK-6235__add_failing_tests: try to reduce actual memory footprint …

67f4259

…of the >2gb tests.

eyalfa changed the title ~~[Spark 3151][Block Manager] DiskStore.getBytes fails for files larger than 2GB~~ [Spark-3151][Block Manager] DiskStore.getBytes fails for files larger than 2GB Aug 5, 2017

eyalfa changed the title ~~[Spark-3151][Block Manager] DiskStore.getBytes fails for files larger than 2GB~~ [SPARK-3151][Block Manager] DiskStore.getBytes fails for files larger than 2GB Aug 6, 2017

kiszk reviewed Aug 6, 2017

View reviewed changes

Eyal Farago added 2 commits August 6, 2017 21:13

SPARK-3151: address styling issues.

8338b4e

SPARK-3151: address a comment about code readability.

4a320e6

cloud-fan reviewed Aug 7, 2017

View reviewed changes

eyalfa commented Aug 7, 2017

View reviewed changes

eyalfa changed the title ~~[SPARK-3151][Block Manager] DiskStore.getBytes fails for files larger than 2GB~~ [SPARK-3151] [Block Manager] DiskStore.getBytes fails for files larger than 2GB Aug 7, 2017

cloud-fan reviewed Aug 8, 2017

View reviewed changes

SPARK-3151: increase max heap size for forked tests.

6cbe8d0

cloud-fan reviewed Aug 14, 2017

View reviewed changes

vanzin reviewed Aug 14, 2017

View reviewed changes

Eyal Farago added 2 commits August 15, 2017 11:22

SPARK-3151: follow @vanzin's style comments

f6fb9e9

SPARK-3151: refactort DiskStore and DiskSttoreSuite according to @vansin

0e3cd82

's comments.

Eyal Farago added 3 commits August 15, 2017 21:57

Revert "SPARK-3151: increase max heap size for forked tests."

a911c85

This reverts commit 6cbe8d0.

Merge branch 'master' into SPARK-3151

c877dcf

SPARK-3151: revert BlockManagerSuite changes.

8be899f

vanzin reviewed Aug 15, 2017

View reviewed changes

SPARK-3151: address style comments by @vanzin

732073c

cloud-fan reviewed Aug 16, 2017

View reviewed changes

SPARK-3151: address comments by @cloud-fan.

d0c98a1

asfgit closed this in b8ffb51 Aug 17, 2017


		override def dispose(): Unit = {}

		private def open() = new FileInputStream(file).getChannel

[SPARK-3151] [Block Manager] DiskStore.getBytes fails for files larger than 2GB #18855

[SPARK-3151] [Block Manager] DiskStore.getBytes fails for files larger than 2GB #18855

Uh oh!

Conversation

eyalfa commented Aug 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

eyalfa commented Aug 5, 2017

Uh oh!

kiszk Aug 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eyalfa commented Aug 6, 2017

Uh oh!

HyukjinKwon commented Aug 7, 2017

Uh oh!

cloud-fan commented Aug 7, 2017

Uh oh!

kiszk commented Aug 7, 2017

Uh oh!

eyalfa commented Aug 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eyalfa Aug 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eyalfa commented Aug 9, 2017 via email

Uh oh!

eyalfa commented Aug 11, 2017

Uh oh!

SparkQA commented Aug 11, 2017

eyalfa commented Aug 5, 2017 •

edited

Loading

kiszk Aug 6, 2017 •

edited

Loading

eyalfa Aug 7, 2017 •

edited

Loading