[SPARK-21271][SQL] Ensure Unsafe.sizeInBytes is a multiple of 8 #18503

kiszk · 2017-07-02T08:31:21Z

What changes were proposed in this pull request?

This PR ensures that Unsafe.sizeInBytes must be a multiple of 8. It it is not satisfied. Unsafe.hashCode causes the assertion violation.

How was this patch tested?

Will add test cases

SparkQA · 2017-07-02T09:07:58Z

Test build #79042 has finished for PR 18503 at commit eb87c63.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-02T17:35:59Z

Test build #79048 has finished for PR 18503 at commit ccc820f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-07-04T19:14:12Z

@cloud-fan I believe that this failure occurs since the checkpoint file does not have a value record whose size is not a multiple of 8 (i.e. 28).

We could solve this failure by regenerating these files. However, I think that the real issue is that checkpoint files, whose size may not be a multiple of 8, exist in the production environment. Should we resize the value record size for UnsafeRow to a multiple of 8 if it is not a multiple of 8?

What do you think?

cloud-fan · 2017-07-05T02:34:55Z

...rc/main/java/org/apache/spark/sql/catalyst/expressions/FixedLengthRowBasedKeyValueBatch.java

    keyRowId = numRows;
    keyRow.pointTo(base, recordOffset, klen);
-    valueRow.pointTo(base, recordOffset + klen, vlen + 4);
+    valueRow.pointTo(base, recordOffset + klen, vlen + 8);


why add 8 here?

Line 59 puts long value whose length is 8. In summary, line 57 and 59 consumes vlen + 8 bytes from base + recordOffset+ klen.

strictly speaking, the final long value doesn't belong to value row, why are we doing so?

Good catch. While Long consumes 8 bytes in a page, this is not a part of UnsafeRow.

…in a page

SparkQA · 2017-07-05T12:43:46Z

Test build #79214 has finished for PR 18503 at commit 8a7a948.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-05T13:09:25Z

...rc/main/java/org/apache/spark/sql/catalyst/expressions/FixedLengthRowBasedKeyValueBatch.java


    keyRowId = numRows;
    keyRow.pointTo(base, recordOffset, klen);
-    valueRow.pointTo(base, recordOffset + klen, vlen + 4);


I'm wondering why we did this before. Was it a mistake?

I have the same question.
@sameeragarwal had similar question one year ago. However, no response from @ooq

I recall it being intentional.
See discussion here.

@ooq thank you for pointing out interesting discussion.
This discussion seems to make sense for page management. The question of @cloud-fan and me is whether valueRow uses only vlen. I think that +4 is for page management.

cloud-fan · 2017-07-05T13:11:49Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

-          val keyRowBuffer = new Array[Byte](keySize)
+          // If key size in an existing file is not a multiple of 8, round it to multiple of 8
+          val keyAllocationSize = ((keySize + 7) / 8) * 8
+          val keyRowBuffer = new Array[Byte](keyAllocationSize)


so RowBasedKeyValueBatch is the format for state store? cc @zsxwing

I think that RowBasedKeyValueBatch is not used for state store in HDFSBackedStateStoreProvider for now.

why we have this logic? what do we write into the state store?

Here is why I added this logic.
I believe that this failure occurs since the checkpoint file does not have a value record whose size is not a multiple of 8 (i.e. 28). Thus, I always round up its size to a multiple of 8.

I think all unsafe rows have the size of a multiple of 8, except RowBasedKeyValueBatch in the previous code. So I'm wondering how the state store can have unsafe rows with wrong size, does state store use RowBasedKeyValueBatch?

I agree. If we are having to regenerate, then we are breaking something. And we must not since we have been guaranteeing backward compatibility. If @cloud-fan claims that all unsafe rows except RowBasedKeyValueBatch should have a size multiple of 8, then we need to understand what is going on; why does reading the checkpoint files fail.

ok we need to figure out what's going on, seems there are other places we may have wrong size in UnsafeRow

We have to exactly know how these check point files were generated. Were these files generated by a method that @kunalkhamar? Or, were these files generated by another tool?

Should we add a program, which @kunalkhamar pointed out at here, as a new test case to check whether all of the size in UnsafeRow are correct or not?

@cloud-fan @zsxwing @tdas @kunalkhamar
I misunderstood. To store a state, HDFSBackedStateStoreProvider is used.

I added a new test suite to check HDFSBackedStateStoreProvider for storing and restoring state, as @kunalkhamar suggested here.

Do you think it makes sense?

SparkQA · 2017-07-13T13:28:14Z

Test build #79583 has finished for PR 18503 at commit 7f5a269.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-14T01:47:59Z

This PR we changed FixedLengthRowBasedKeyValueBatch, I want to know if it has some external impacts, like checkpoint data, or it's just used internally?

kiszk · 2017-07-14T03:25:37Z

@tdas @marmbrus @zsxwing could you please let us know any impacts by change of FixedLengthRowBasedKeyValueBatch? Can we change this structure?

tdas · 2017-07-18T23:34:00Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

          } else {
-            val valueRowBuffer = new Array[Byte](valueSize)
+            // If value size in an existing file is not a multiple of 8, round it to multiple of 8
+            val valueAllocationSize = ((valueSize + 7) / 8) * 8


Can this be made into utility function inside UnsafeRow? Seems like this sort of adjustment should not be a concerns of external users of UnsafeRow.

For example, how about something like this

class UnsafeRow { def readFromStream(byteStream: InputStream, bytes: Int): UnsafeRow = ??? }

@cloud-fan what do you think?

tdas · 2017-07-18T23:35:13Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

      CheckAnswer((1, 2), (2, 2), (3, 2)))
  }

+  testQuietly("store to and recover from a checkpoint") {


I dont think this test is needed. There are existing tests that already test reading from checkpoints, etc. The critical test was reading 2.1 checkpoint files which seems to be passing.

This test also checks whether the length of checkpoints is a multiple of 8 or not. Does it make sense? Or, is there any other test suite to check the length?

It does not really check it explicitly .. does it? It tests it implicitly by creating checkpoints and then restarting. There are other tests that already do the same thing. E.g. This test is effectively same as
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingAggregationSuite.scala#L88

Ah, you are right. This test currently relies on internal assert at Unsafe.pointTo for checking a multiple of 8

shall we remove this test?

Yes, I will remove this since the test that @tdas pointed out causes the same assertion failure as my test case expected.

cloud-fan · 2017-07-19T01:11:16Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java

   */
  public void pointTo(Object baseObject, long baseOffset, int sizeInBytes) {
    assert numFields >= 0 : "numFields (" + numFields + ") should >= 0";
+    assert sizeInBytes % 8 == 0 : "sizeInBytes (" + sizeInBytes + ") should be a multiple of 8";


I think we only need the assertion here, in pointTo, and in setTotalSize. Other places are just checking length for existing unsafe rows, which is unnecessary.

cloud-fan · 2017-07-19T01:14:48Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

            s"Error reading delta file $fileToRead of $this: key size cannot be $keySize")
        } else {
-          val keyRowBuffer = new Array[Byte](keySize)
+          // If key size in an existing file is not a multiple of 8, round it to multiple of 8


I don't think we can round. Assume the actual length of an unsafe row is 8, and previously we will append 4 bytes and have an unsafe row with 12 bytes, and save it to checkpoint. So here, when we reading old checkppint, we need to read 12 bytes, and set the length to 8.

BTW, we only need to do this for value, not key.

SparkQA · 2017-07-19T21:30:39Z

Test build #79770 has finished for PR 18503 at commit 762f02a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-07-20T01:40:21Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

            val valueRow = new UnsafeRow(valueSchema.fields.length)
-            valueRow.pointTo(valueRowBuffer, valueSize)
+            // If valueSize in existing file is not multiple of 8, round it down to multiple of 8
+            valueRow.pointTo(valueRowBuffer, (valueSize / 8) * 8)


Nit: This isnt rounding. This essentially floor to the multiple of 8.
@cloud-fan is this safe to do with ANY row generated in earlier Spark 2.0 - 2.2? I want to be 100% sure.

yes, because the extra bytes that exceed the 8-bytes boundary in UnsafeRow will never be read, so it's safe to ignore them. And here we still respect the row length when reading the binary, so no data will be missed.

@kiszk can we add more comments to explain why this can happen? We should say that before Spark 2.3 we mistakenly append 4 bytes to the value row in aggregate buffer, which gets persisted into the checkpoint data and so on

Sure, added more comments.

SparkQA · 2017-07-25T09:42:21Z

Test build #79924 has finished for PR 18503 at commit 0159701.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-25T12:32:50Z

Test build #79929 has finished for PR 18503 at commit 54be80e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2017-07-26T01:35:49Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

+            // This is work around for the following.
+            // Pre-Spark 2.3 mistakenly append 4 bytes to the value row in
+            // `FixedLengthRowBasedKeyValueBatch`, which gets persisted into the checkpoint data
+            valueRow.pointTo(valueRowBuffer, (valueSize / 8) * 8)


@cloud-fan @kiszk Just to confirm again, are we absolutely sure that the issue is that there 4 extra bytes in checkpointed rows, and therefore this truncation is safe to do?

yes I'm absolutely sure, as we store aggregate buffers in checkpoint files.

BTW, I run this test without the workaround here, and it can still pass. Which means, after this PR, RowBasedKeyValueBatch is fixed, and the generated checkpoint files have corrected row length.

And I also run the test without the fix for RowBasedKeyValueBatch, and it fails. I think this proves that RowBasedKeyValueBatch is the reason why we have wrong row length in checkpoit files.

cloud-fan · 2017-07-26T02:42:37Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

            val valueRow = new UnsafeRow(valueSchema.fields.length)
-            valueRow.pointTo(valueRowBuffer, valueSize)
+            // If valueSize in existing file is not multiple of 8, floor it to multiple of 8.
+            // This is work around for the following.


nit: This is a workaround for the following:

cloud-fan · 2017-07-26T02:43:18Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

-            valueRow.pointTo(valueRowBuffer, valueSize)
+            // If valueSize in existing file is not multiple of 8, floor it to multiple of 8.
+            // This is work around for the following.
+            // Pre-Spark 2.3 mistakenly append 4 bytes to the value row in


nit: Prior to Spark 2.3, we mistakenly ...

cloud-fan · 2017-07-26T02:43:55Z

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

+            // If valueSize in existing file is not multiple of 8, floor it to multiple of 8.
+            // This is work around for the following.
+            // Pre-Spark 2.3 mistakenly append 4 bytes to the value row in
+            // `FixedLengthRowBasedKeyValueBatch`, which gets persisted into the checkpoint data


and VariableLengthRowBasedKeyValueBatch, let's just say RowBasedKeyValueBatch

cloud-fan · 2017-07-26T04:41:51Z

LGTM, pending tests

SparkQA · 2017-07-26T07:04:52Z

Test build #79956 has finished for PR 18503 at commit cc467de.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-26T17:22:17Z

retest this please

tdas · 2017-07-26T18:52:22Z

@cloud-fan @kiszk thanks for confirming. LGTM pending tests.

SparkQA · 2017-07-26T19:52:22Z

Test build #79969 has finished for PR 18503 at commit cc467de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-27T07:27:44Z

thanks, merging to master!

insert assert to ensure Unsafe.sizeInBytes is a multiple of 8

eb87c63

kiszk added 2 commits July 3, 2017 00:55

fix in ExternalSorter

6cde32f

fix in RowBasedKeyValueBatch

ccc820f

kiszk changed the title ~~[WIP][SPARK-21271][SQL] Ensure Unsafe.sizeInBytes is a multiple of 8~~ [SPARK-21271][SQL] Ensure Unsafe.sizeInBytes is a multiple of 8 Jul 4, 2017

cloud-fan reviewed Jul 5, 2017

View reviewed changes

kiszk added 2 commits July 5, 2017 18:16

fix size of a byte array in UnsafeRow since 4-byte area is used only …

3bd72c9

…in a page

round up size of key/value in UnsafeRow to a multiple of 8

8a7a948

cloud-fan reviewed Jul 5, 2017

View reviewed changes

kunalkhamar mentioned this pull request Jul 6, 2017

[SPARK-19873][SS] Record num shuffle partitions in offset log and enforce in next batch. #17216

Closed

add a test suite for storing to and recovering from a checkpoint

7f5a269

tdas reviewed Jul 18, 2017

View reviewed changes

cloud-fan reviewed Jul 19, 2017

View reviewed changes

address review comments

762f02a

tdas reviewed Jul 20, 2017

View reviewed changes

address review comments

0159701

fix test failure by removing an assert for debugging

54be80e

tdas reviewed Jul 26, 2017

View reviewed changes

cloud-fan reviewed Jul 26, 2017

View reviewed changes

address review comments

cc467de

asfgit closed this in ebbe589 Jul 27, 2017

[SPARK-21271][SQL] Ensure Unsafe.sizeInBytes is a multiple of 8 #18503

[SPARK-21271][SQL] Ensure Unsafe.sizeInBytes is a multiple of 8 #18503

Uh oh!

Conversation

kiszk commented Jul 2, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 2, 2017

Uh oh!

SparkQA commented Jul 2, 2017

Uh oh!

kiszk commented Jul 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Jul 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Jul 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 13, 2017

Uh oh!

cloud-fan commented Jul 14, 2017

Uh oh!

kiszk commented Jul 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas Jul 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

kiszk commented Jul 4, 2017 •

edited

Loading

kiszk Jul 5, 2017 •

edited

Loading

kiszk Jul 7, 2017 •

edited

Loading

tdas Jul 18, 2017 •

edited

Loading

cloud-fan Jul 20, 2017 •

edited

Loading

tdas Jul 20, 2017 •

edited

Loading