Skip to content

Conversation

@ConeyLiu
Copy link
Contributor

What changes were proposed in this pull request?

The code logic between MemoryStore.putIteratorAsValues and Memory.putIteratorAsBytes are almost same, so we should reduce the duplicate code between them.

How was this patch tested?

Existing UT.

@ConeyLiu
Copy link
Contributor Author

Hi @cloud-fan @jiangxb1987 , would you mind take a look ? Thanks a lot.

@jerryshao
Copy link
Contributor

ok to test.

@SparkQA
Copy link

SparkQA commented Sep 20, 2017

Test build #81961 has finished for PR 19285 at commit d2b8ccd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 20, 2017

Test build #81971 has finished for PR 19285 at commit 9ea8f49.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Sep 20, 2017

Test build #81981 has finished for PR 19285 at commit 9ea8f49.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

new DeserializedMemoryEntry[T](arrayValues, SizeEstimator.estimate(arrayValues), classTag)
val size = entry.size
// get the precise size
val size = estimateSize(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need estimateSize(true)? Is this just creating the entry and getting entry.size

Copy link
Contributor Author

@ConeyLiu ConeyLiu Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just unrolled the iterator successfully until here. But the size of underlying vector maybe greater than the unrollMemoryUsedByThisBlock which we requested memory for unroll the block. So we need check it again and determine whether we need request more memory. And we only should call bbos.toChunkedByteBuffer or vector.toArray after we requested enough memory.

Here, because the underlying storage is different. For putIteratorAsValues, it use SizeTrackingVector, while putIteratorAsBytes use ChunkedByteBufferOutputStream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the previous code just calls entry.size, are you fixing a new bug?

Copy link
Contributor Author

@ConeyLiu ConeyLiu Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the putIteratorAsValues seems no problem. But the putIteratorAsBytes doesn't check again after unrolled the iterator. Now the putIterator is copied form previous putIteratorAsValues . For SizeTrackingVector, we could call arrayValues.toIterator to get a iterator again after call SizeTrackingVector.toArray. But for ChunkedByteBufferOutputStream, we can't back to stream after called ChunkedByteBufferOutputStream.toChunkedByteBuffer (the PartiallySerializedBlock need a stream).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems deserialized values do not have a precise size, even for SizeEstimator.estimate(arrayValues). This would be confused.

}
// Acquire storage memory if necessary to store this block in memory.
val enoughStorageMemory = {
if (unrollMemoryUsedByThisBlock <= size) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the size of underlying vector or bytebuffer maybe greater than the unrollMemoryUsedByThisBlock .

reserveAdditionalMemoryIfNecessary()
def estimateSize(precise: Boolean): Long = {
if (precise) {
serializationStream.flush()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anywhere in the previous code call flush.

Copy link
Contributor Author

@ConeyLiu ConeyLiu Sep 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there are some data cached in the serializationStream, we can't get the precise size if don't call flush. Previous we don't check again after unrolled the block, and it directly call the serializationStream.close(). But here we maybe need the serializationStream again if we can't get anther unroll memory, so we only should call flush.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you send a PR to fix this issue for putIteratorAsBytes first? It will make this PR easier to review

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll do it tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Sorry for the previous saying, I read the code again. Here seems call serializationStream .close is also OK. Because the the iterator is has not value need write, that's meaning the serializationStream don't need anymore.

@SparkQA
Copy link

SparkQA commented Sep 22, 2017

Test build #82084 has started for PR 19285 at commit d0fcf4f.

@jiangxb1987
Copy link
Contributor

@ConeyLiu Could you rebase this with the latest master so we can continue review it? Thanks!

@ConeyLiu
Copy link
Contributor Author

ConeyLiu commented Nov 7, 2017

It's updated. Thanks a lot.

@SparkQA
Copy link

SparkQA commented Nov 7, 2017

Test build #83527 has finished for PR 19285 at commit 3a90ad1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* original input iterator. The caller must either fully consume this iterator or call
* `close()` on it in order to free the storage memory consumed by the partially-unrolled
* block.
* @param memoryMode The values saved mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: also add param description for blockIdvalues and classTag.

// We only call need the precise size after all values unrolled.
arrayValues = vector.toArray
preciseSize = SizeEstimator.estimate(arrayValues)
vector = null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks scary to put vector to null in the function estimateSize.

def createMemoryEntry(): MemoryEntry[T] = {
// We successfully unrolled the entirety of this block
assert(arrayValues != null, "arrayValue shouldn't be null!")
assert(preciseSize != -1, "preciseSize shouldn't be -1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under which condition would preciseSize be -1?

// We successfully unrolled the entirety of this block
assert(arrayValues != null, "arrayValue shouldn't be null!")
assert(preciseSize != -1, "preciseSize shouldn't be -1")
val entry = new DeserializedMemoryEntry[T](arrayValues, preciseSize, classTag)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to create the val entry?

@SparkQA
Copy link

SparkQA commented Nov 8, 2017

Test build #83575 has finished for PR 19285 at commit bc3ad4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

memoryMode: MemoryMode,
storeValue: T => Unit,
estimateSize: Boolean => Long,
createMemoryEntry: () => MemoryEntry[T]): Either[Long, Long] = {
Copy link
Contributor

@cloud-fan cloud-fan Jan 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of passing 3 functions, I'd like to introduce

class ValuesHolder {
  def storeValue(value)
  def esitimatedSize()
  def buildEntry(): MemoryEntry
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trait?

* OOM exceptions, this method will gradually unroll the iterator while periodically checking
* whether there is enough free memory. If the block is successfully materialized, then the
* temporary unroll memory used during the materialization is "transferred" to storage memory,
* so we won't acquire more memory than is actually needed to store the block.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not duplicated this document

@cloud-fan
Copy link
Contributor

overall looks good

@jerryshao
Copy link
Contributor

Are we targeting this to 2.3 or 2.4?

@cloud-fan
Copy link
Contributor

It's just a refactor so I'd like to target it for 2.4

@ConeyLiu
Copy link
Contributor Author

Thanks for your valuable suggestion, the code has been updated.

@SparkQA
Copy link

SparkQA commented Jan 24, 2018

Test build #86557 has finished for PR 19285 at commit c988762.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be a local variable.

@SparkQA
Copy link

SparkQA commented Jan 24, 2018

Test build #86565 has finished for PR 19285 at commit f392217.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

val valuesHolder = new SerializedValuesHolder[T](blockId, chunkSize, classTag,
memoryMode, serializerManager)

if (keepUnrolling) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it better to use this code structure?

if (keepUnrolling) {
  // get precise size and reserve extra memory if needed
}
if (keepUnrolling) {
  // create the entry 
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand what you mean, could you explain it more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putIteratorAsValues and putIteratorAsBytes have different code structure for the last step. In the new putIterator method, you followed the code structure of putIteratorAsValues, is it better to follow the one from putIteratorAsBytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. I have been updated, the code looks more clearly now.


private trait ValuesHolder[T] {
def storeValue(value: T): Unit
def estimatedSize(roughly: Boolean): Long
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a good API design, we can do

trait ValuesHolder {
  def putValue(value: T)
  def estimatedSize: Long
  def getBuilder(): ValuesBuilder
}
trait ValuesBuilder {
  def preciseSize: Long
  def build(): MemoryEntry
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an example

class DeserializedValuesHolder extends ValuesHolder {
  ...
  def getBuilder = new ValuesBuilder {
    val valuesArray = vector.toArray
    def preciseSize = SizeEstimator.estimate(valuesArray)
    def buid = ...
  }
}



class SerializedValuesHolder extends ValuesHolder {
  ...
  def getBuilder = new ValuesBuilder {
    serializationStream.close()
    def preciseSize = bbos.size
    def build = ...
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very thanks, I'll update it tomorrow.

}
}

if (keepUnrolling) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a little improvement

if (keepUnrolling) {
  val builder = valuesHolder.getBuilder()
  ...
  if (keepUnrolling) {
    val entry = builder.build()
    ...
    Right(entry.size)
  } else {
    ...
    logUnrollFailureMessage(blockId, builder.preciseSize)
    Left(unrollMemoryUsedByThisBlock)
  }
} else {
  ...
  logUnrollFailureMessage(blockId, valueHolder.estimatedSize)
  Left(unrollMemoryUsedByThisBlock)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

// We successfully unrolled the entirety of this block
serializationStream.close()

override val preciseSize: Long = bbos.size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be a def?

private trait ValuesHolder[T] {
def storeValue(value: T): Unit
def estimatedSize(): Long
def getBuilder(): ValuesBuilder[T]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment to say that, after getBuilder is called, this ValuesHolder becomes invalid.

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86620 has finished for PR 19285 at commit b41f1bb.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86619 has finished for PR 19285 at commit ded080d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86629 has finished for PR 19285 at commit 9e0759f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86630 has finished for PR 19285 at commit 40bdcac.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

}

private trait ValuesBuilder[T] {
def preciseSize: Long
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey guys, why not name the trait as MemoryEntryBuilder? As I see from the code, it is used to build the MemoryEntry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

@SparkQA
Copy link

SparkQA commented Jan 25, 2018

Test build #86634 has finished for PR 19285 at commit 40bdcac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 26, 2018

Test build #86674 has finished for PR 19285 at commit 9d1aeef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 3e25251 Jan 26, 2018
@ConeyLiu
Copy link
Contributor Author

thanks all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants