[SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows #26108

HeartSaVioR · 2019-10-14T07:26:01Z

What changes were proposed in this pull request?

This patch fixes the edge case of streaming left/right outer join described below:

Suppose query is provided as

select * from A join B on A.id = B.id AND (A.ts <= B.ts AND B.ts <= A.ts + interval 5 seconds)

and there're two rows for L1 (from A) and R1 (from B) which ensures L1.id = R1.id and L1.ts = R1.ts.
(we can simply imagine it from self-join)

Then Spark processes L1 and R1 as below:

row L1 and row R1 are joined at batch 1
row R1 is evicted at batch 2 due to join and watermark condition, whereas row L1 is not evicted
row L1 is evicted at batch 3 due to join and watermark condition

When determining outer rows to match with null, Spark applies some assumption commented in codebase, as below:

Checking whether the current row matches a key in the right side state, and that key	
has any value which satisfies the filter function when joined. If it doesn't,	
we know we can join with null, since there was never (including this batch) a match	
within the watermark period. If it does, there must have been a match at some point, so	
we know we can't join with null.

But as explained the edge-case earlier, the assumption is not correct. As we don't have any good assumption to optimize which doesn't have edge-case, we have to track whether such row is matched with others before, and match with null row only when the row is not matched.

To track the matching of row, the patch adds a new state to streaming join state manager, and mark whether the row is matched to others or not. We leverage the information when dealing with eviction of rows which would be candidates to match with null rows.

This approach introduces new state format which is not compatible with old state format - queries with old state format will be still running but they will still have the issue and be required to discard checkpoint and rerun to take this patch in effect.

Why are the changes needed?

This patch fixes a correctness issue.

Does this PR introduce any user-facing change?

No for compatibility viewpoint, but we'll encourage end users to discard the old checkpoint and rerun the query if they run stream-stream outer join query with old checkpoint, which might be "yes" for the question.

How was this patch tested?

Added UT which fails on current Spark and passes with this patch. Also passed existing streaming join UTs.

…uter nulls for already matched rows

HeartSaVioR · 2019-10-14T07:27:48Z

cc. @tdas @zsxwing @jose-torres @gaborgsomogyi

This is a revised version of the patch #23634 to reduce the diff. Please take a look. Thanks in advance.

HeartSaVioR · 2019-10-14T07:32:48Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala


-  /** A wrapper around a [[StateStore]] that stores [(key, index) -> value]. */
-  private class KeyWithIndexToValueStore extends StateStoreHandler(KeyWithIndexToValueType) {
+  private trait KeyWithIndexToValueRowConverter {


This is unavoidable refactoring, as we need to deal with both schemas depending on the state format version.
I didn't let inner join use state format version 1 (though it could be one of optimizations), as the inner join query could be changed to outer join which Spark will not be able to determine the change - correctness issue will occur then.

HeartSaVioR · 2019-10-14T07:36:22Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

+    var key: UnsafeRow = null,
+    var valueIndex: Long = -1,
+    var value: UnsafeRow = null,
+    var matched: Boolean = false) {


Here matched also has a default value for state format version 1 - the main version of state format version is 2, so focused to optimize for state format version 2. (Option[Boolean] vs Boolean) The value is not used in caller if the state format version is 1, so I think this is OK.

…g other approach

HeartSaVioR · 2019-10-14T08:05:59Z

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

-        //   we know we can join with null, since there was never (including this batch) a match
-        //   within the watermark period. If it does, there must have been a match at some point, so
-        //   we know we can't join with null.
+        // * (state format version 1) Checking whether the current row matches a key in the


I've left origin comment as it is since it is still applied for state format version 1, and added comment for explaining state format version 2 and the reason of making a change. Please let me know if we just want to remove explanation of state format version 1.

SparkQA · 2019-10-14T11:05:25Z

Test build #112006 has finished for PR 26108 at commit 2544f2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
* Helper class for representing data key to (value, matched).
case class KeyToValueAndMatched(

SparkQA · 2019-10-14T11:32:27Z

Test build #112009 has finished for PR 26108 at commit 5f0988e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

vanzin

Ok, I think I get what's going on. Things look correct, I just can't comment much on details of the APIs being used (e.g. whether there's a better / cheaper / easier way to turn an UnsafeRow into another row + some extra data, like the new matched state being tracked).

Also, KeyWithIndexAndValueWithMatched is a pretty long and very specific name. Maybe just keep the old name?

vanzin · 2019-10-24T22:35:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "State between versions are tend to be incompatible, so state format version shouldn't " +
+        "be modified after running.")
+      .intConf
+      .checkValue(v => Set(1, 2).contains(v), "Valid versions are 1 and 2")


checkValues(Set(1, 2))

Nice suggestion. Btw, the line is actually consistent with the old things, STREAMING_AGGREGATION_STATE_FORMAT_VERSION, FLATMAPGROUPSWITHSTATE_STATE_FORMAT_VERSION. Would we like to change them as well, or just leave it be consistent? Maybe from follow-up PR then to exclude unrelated changes.

If you want to change the others, feel free to open another PR. But I'd like the better API to be used here.

(Ignoring the discussion about removing this, in the other comment below.)

I'd like to leave this as it is, and handle this entirely. We can keep consistency regardless of follow-up PR be accepted or not.

vanzin · 2019-10-24T22:38:15Z

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

+  }
+
+  if (stateFormatVersion < 2 && joinType != Inner) {
+    logError(s"The query is using stream-stream outer join with state format version" +


Is logging enough? Shouldn't you just fail the query?

Actually this correctness issue is somewhat occurred in "corner case", so I wasn't 100% sure we want to force end users to discard their checkpoint. #24890 is a similar case for "possible" correctness issue and we just log warn message.

If we would like to avoid possible correctness issue in any case, I'm OK to let the query fail. Forcing end users to drop checkpoint might be unhappy one, though.

I see. I would suggest a config option to choose between error or continue, but then I dislike adding more and more config options.

I guess the question is: in which situations will you have the new version of Spark using the old snapshot format?

The way I understand it, it should only happen if you restart an app that was running 2.4, now running with 3.0. The new query will pick up the state store from the previous run and go from there.

In that situation it doesn't seem horrible to fail the query. But anyway, I'll leave it to your judgement.

So the impact matters. They need to discard checkpoint and replay the query with their previous inputs from the scratch. The impact would depend on how much data they need to replay - join may store rows depending on the condition of join, may couple of hours, even days (though it doesn't seem to be realistic in production).

Maybe replaying couple of hours of inputs wouldn't matter too much, and they still be able to run the query with old Spark version, so yes I agree it wouldn't be so horrible if we guide how to discard checkpoint and rerun the query.

Btw, as I described below, we may try to provide offline migration tool - though this patch shouldn't be blocked by that.

vanzin · 2019-10-24T22:43:32Z

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

+            case 1 => matchesWithRightSideState(
+              new UnsafeRowPair(kvAndMatched.key, kvAndMatched.value))
+            case 2 => kvAndMatched.matched
+            case _ => throw new IllegalStateException("Incorrect state format version! " +


s/Incorrect/Invalid (or Unexpected).

vanzin · 2019-10-24T22:43:40Z

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

+            case 1 => matchesWithLeftSideState(
+              new UnsafeRowPair(kvAndMatched.key, kvAndMatched.value))
+            case 2 => kvAndMatched.matched
+            case _ => throw new IllegalStateException("Incorrect state format version! " +


vanzin · 2019-10-24T22:54:17Z

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

+        thisRow: UnsafeRow,
+        subIter: Iterator[JoinedRow])
+      extends CompletionIterator[JoinedRow, Iterator[JoinedRow]](subIter) {
+      private var iteratorNotEmpty: Boolean = false


Can't this just be a val? hasNext is supposed to be cheap, or at the very least idempotent, so should be safe to call here.

I guess yes, if the implementation of iterator is respecting the spec of iterator. Let me change it and see whether it doesn't break anything.

vanzin · 2019-10-24T23:04:52Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

+      val joinedRow = generateJoinedRow(keyIdxToValue.value)
+      if (predicate(joinedRow)) {
+        if (!keyIdxToValue.matched) {
+          // only update when matched flag is false


comment just repeats the code.

vanzin · 2019-10-24T23:23:26Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

 }

 object SymmetricHashJoinStateManager {
+  val supportedVersions = Seq(1, 2)


It would be good to just reference the same values as the config entry.

In fact, since the config is internal, you could probably just drop the check there, and rely on the checks in this code to make sure the version if correct.

Same here. Sounds nice, but may want to keep consistency. So I'm in favor of changing all, or leaving it as it is.

Same comment as before. I'd like the better approach to be used. If you want to go later and fix the other spots in the code, that's up to you.

vanzin · 2019-10-24T23:34:46Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

+        expr("leftId = rightId AND leftTime >= rightTime AND " +
+          "leftTime <= rightTime + interval 5 seconds"),
+        joinType = "rightOuter")
+      .select(col("rightId"), col("rightTime").cast("int"),


To make it easier to see / compare with the previous test, could you keep the results from the left stream to the left? (so the unmatched row looks like Row(null, null, 1, 1L) instead)

Ah OK. Actually that was to help reviewers to compare the result with previous test easily (as it would be same), but looks like it doesn't work as intended. I'll make a change.

HeartSaVioR · 2019-10-25T11:10:18Z

Thanks again for reviewing!

Also, KeyWithIndexAndValueWithMatched is a pretty long and very specific name. Maybe just keep the old name?

Yes agreed. May not strictly needed to be very specific. Will keep the old name.

SparkQA · 2019-10-25T15:32:03Z

Test build #112664 has finished for PR 26108 at commit 31ca780.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-10-25T20:19:16Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

+    def put(key: UnsafeRow, valueIndex: Long, value: UnsafeRow, matched: Boolean): Unit = {
      val keyWithIndex = keyWithIndexRow(key, valueIndex)
-      stateStore.put(keyWithIndex, value)
+      val valueWithMatched = valueRowConverter.convertToValueRow(value, matched)


So, while replying to your comment, this came to my mind.

The way I see this code, when you start from a v1 state store, you'll also be writing back v1 data.

Shouldn't you write out new data, so that you fix the problem going forward once you start the app with the fixed version of Spark?

Even we write v2 data here, we only update part of state instead of full rewrite of state. It will leave v1 / v2 data co-exist and make state store not possible to read the entire rows - individual row doesn't have schema information. That's why we separate versions and isolate them. Yes that's bad, but that's the current state.

Actually we don't even have schema information for state store as well so we would encounter undefined behavior if we do such unsafe thing. SPARK-27237 (#24173) proposes to introduce schema information per state (not per row) and fail the query if schema is not compatible.

We can still rewrite the state entirely in a batch query (cannot be online modification) - please refer the README of https://github.com/HeartSaVioR/spark-state-tools. I've already implemented the migration of v1 to v2 for streaming aggregation / flatMapGroupsWithState, and I may try to implement this migration as well. I'm also proposing the project as a part of Apache Spark, but not as it is - I'm proposing batch data source for "state" first (SPARK-28190), and once we adopt it, we could also consider to have some tools for helping migration. Flink recently added the state API which denotes the necessity of the data source for "state".

Ok, think I got it. Seems like even if we add some code that would write new data in the v2 format, the state store still may have data in v1 format that can trigger the correctness issue. So it's a small gain in the end, although that old state data would be constrained to state generated before the "fixed" Spark was used.

Given that I think that failing the query when the bad combination is detected is better (I see you already made that change).

One thing is that a "rewrite" of the state in this case, at least, is not a good option. The data is missing from the v1 data, so there's no way to generate correct v2 data from it. The only thing you'd achieve by doing that is to mask errors (since the exception you're adding wouldn't be triggered anymore).

So unless it's possible to have a "mixed" solution here, where the same store can have both v1 and v2 data, erroring out is the best outcome I can think of. I can think of ways to hack it (you could e.g. compare the column count for the UnsafeRow read from the store and see if it matches the count in the expected schema, or if it has the extra fields expected by the v2 data), but haven't thoroughly thought about it.

One thing is that a "rewrite" of the state in this case, at least, is not a good option. The data is missing from the v1 data, so there's no way to generate correct v2 data from it.

Yeah you're right. At least for this case rewrite doesn't work. I forgot what I've already found months ago. Thanks for reminding.

I can think of ways to hack it (you could e.g. compare the column count for the UnsafeRow read from the store and see if it matches the count in the expected schema, or if it has the extra fields expected by the v2 data), but haven't thoroughly thought about it.

I also had thought about it a bit (and that was maybe one of review comment in previous PR) but it can bring side-effect if the query is changed. If end users change the query to let input of join containing one less column, there's a chance Spark may read "v1 row" as "v2 row" incorrectly whereas Spark should just fail the query since schema has been changed. Relying on column count is unsafe.

You could take that approach if you also had schema validation, as in your other PR. The main advantage is that it would simplify some of the versioning-related changes in this PR (i.e. you would only have to deal with the read path).

Anyway, it's more important to fix the correctness issue, so that can be done later.

SparkQA · 2019-10-26T06:07:36Z

Test build #112705 has finished for PR 26108 at commit 3b9c565.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-26T10:43:55Z

Test build #112710 has finished for PR 26108 at commit c4eb26e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-10-28T20:22:12Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

+   * Helper class for representing data key to (value, matched).
+   * Designed for object reuse.
+   */
+  case class KeyToValueAndMatched(


I commented on the other class name, but the comment also applies to this one. AndMatched is very specific, I'd prefer to keep a more generic name (e.g. KeyToValuePair to kinda follow the old UnsafeRowPair name).

… detail

SparkQA · 2019-10-29T00:20:29Z

Test build #112797 has finished for PR 26108 at commit 1fce902.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class KeyToValuePair(

HeartSaVioR · 2019-10-30T22:33:22Z

Could we go through the next round of review? Thanks in advance!

vanzin · 2019-10-31T21:22:30Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

-  def removeByKeyCondition(removalCondition: UnsafeRow => Boolean): Iterator[UnsafeRowPair] = {
-    new NextIterator[UnsafeRowPair] {
+  def removeByKeyCondition(removalCondition: UnsafeRow => Boolean)
+      : Iterator[KeyToValuePair] = {


Fits in previous line.

vanzin · 2019-10-31T21:23:32Z

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

-          .map(pair => joinedRow.withLeft(nullLeft).withRight(pair.value))
+        val outerOutputIter = removedRowIter.filterNot { kv =>
+          stateFormatVersion match {
+            case 1 => matchesWithLeftSideState(


I find that if your statement doesn't fit in the same line as the case, it's easier to read when it starts on the next line. Otherwise the indentation of the next lines look weird.

vanzin · 2019-10-31T21:24:16Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

-  def removeByValueCondition(removalCondition: UnsafeRow => Boolean): Iterator[UnsafeRowPair] = {
-    new NextIterator[UnsafeRowPair] {
+  def removeByValueCondition(removalCondition: UnsafeRow => Boolean)
+      : Iterator[KeyToValuePair] = {


Fits in previous line.

vanzin · 2019-10-31T21:31:44Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

          if (hasMoreValuesForCurrentKey) {
            // First search the values for the current key.
-            val currentValue = keyWithIndexToValue.get(currentKey, index)
+            val (currentValue, matched) = keyWithIndexToValue.getWithMatched(currentKey, index)


getWithMatched can return null; is it guaranteed that it won't at this point? Otherwise you'll get an error here.

Maybe KeyWithIndexToValueRowConverterFormatV2 should return (null, false) instead of null? It would at least be closer to the semantics of the previous code.

And now that I noticed it, I kinda dislike getWithMatched for the same reason I disliked the other type names.

Is the extra overhead of valueRowConverter.convertValue allocating a new tuple, and this code "unapplying" that tuple, something that we need to worry about? I'm thinking that maybe if you return a proper type and take the same reuse approach as is done with KeyToValuePair, that would be better?

(But don't know if that can be done safely here.)

Actually, it's more than just the allocation; every time you call KeyWithIndexToValueRowConverterFormatV2.convertValue you will not only allocate that new tuple (and unapply it, depending on the call site), but you'll also be applying a projection to the UnsafeRow from the state store.

Is that overhead acceptable here?

I'm thinking that maybe if you return a proper type and take the same reuse approach as is done with KeyToValuePair, that would be better?

The optimization around reusing instance is based on the fact that setter provides the iterator. Actually current optimization was really odd for me - after some digging I understood the intention but without knowing such detail, it's super easy to mess up. Iterating two getAll() results concurrently would provide interesting result.

Getting an element with applying reuse approach would be really confusing. Maybe it can be acceptable if we invert the approach and let caller provides the instance to fill (say, "out parameter") - caller is responsible to not let the instance be messed up, but easier to do that given they have context, especially if caller calls in a loop. I'll check if it could be done safely.

Regarding projection, we've done similar things in my previous PR, removing duplicated columns from key part to reduce state store size - the patch applied projection to key row - performance didn't hurt much (more clearly, the size of delta significantly matters) - #21733 (comment)

I'd wait for hearing voice on out parameter, as it complicates the logic (all methods in call hierarchy should support out parameter to achieve object reuse). I've changed tuple to case class to deal with null issue.

vanzin · 2019-10-31T21:32:31Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

        if (numValues > 1) {
-          val valueAtMaxIndex = keyWithIndexToValue.get(currentKey, numValues - 1)
-          keyWithIndexToValue.put(currentKey, index, valueAtMaxIndex)
+          val (valueAtMaxIndex, matchedAtMaxIndex) = keyWithIndexToValue.getWithMatched(currentKey,


Same question re: getWithMatched returning null.

SparkQA · 2019-11-01T06:25:56Z

Test build #113064 has finished for PR 26108 at commit 7213d42.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
/** Helper class for representing data (value, matched). */
case class ValueAndMatchPair(value: UnsafeRow, matched: Boolean)

vanzin · 2019-11-04T23:22:52Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

-            val currentValue = keyWithIndexToValue.get(currentKey, index)
-            if (removalCondition(currentValue)) {
-              return currentValue
+            val valuePair = keyWithIndexToValue.get(currentKey, index)


Can valuePair be null here? Just like before, keyWithIndexToValue.get can return null, but I'm not sure whether in this particular spot it would return null.

I think it would only return null when there's corruption on the state store - hasMoreValuesForCurrentKey == true denotes the index is valid (not an out of index), then the return value is not expected to be null.

vanzin · 2019-11-05T22:01:55Z

Well, looks ok to the best of my knowledge (and given all the "can't do this with the current state store" replies). I'll give others one last chance to comment before pushing in a day or two.

HeartSaVioR · 2019-11-11T23:35:54Z

Looks like no comment from others so far; shall we consider merging this?

vanzin · 2019-11-11T23:46:56Z

Yes. Merging to master.

HeartSaVioR · 2019-11-12T01:21:32Z

Thanks for reviewing and merging!

dongjoon-hyun · 2020-01-22T19:54:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .createWithDefault(2)

+  val STREAMING_JOIN_STATE_FORMAT_VERSION =
+    buildConf("spark.sql.streaming.join.stateFormatVersion")


Hi, @tgravescs . The incompatibility of state format was the reason of not having this in branch-2.4.

dongjoon-hyun · 2020-01-22T19:56:44Z

Hi, All.
How do you think about the backporting? Since there was a question on this from @tgravescs in the mailing list, I want to make this sure.

On the 2.4 release I didn't see an explanation on https://issues.apache.org/jira/browse/SPARK-26154 why it can't be back ported

HeartSaVioR · 2020-01-22T22:50:31Z

We fixed the issue with "destructive way" on existing query as the state in existing query cannot be corrected. Please read below migration guide I added for this PR.

Spark 3.0 fixes the correctness issue on Stream-stream outer join, which changes the schema of state. (SPARK-26154 for more details) Spark 3.0 will fail the query if you start your query from checkpoint constructed from Spark 2.x which uses stream-stream outer join. Please discard the checkpoint and replay previous inputs to recalculate outputs.

I'm not sure we can bring this change to "bugfix" version and force end users to follow the guide.

dongjoon-hyun · 2020-01-22T23:13:09Z

As written in the migration guide, the situation is the same in 3.0.0 with old stored data, isn't it?

AFAIK, there is no available correctness fix for the existing stored data (and queries).
What we need in 2.4.5 is the same thing with 3.0.0 (a correct behavior of new queries).
Also, although we backport this with stateFormatVersion=1, it works as a legacy buggy mode, doesn't it?

HeartSaVioR · 2020-01-22T23:21:29Z

As written in the migration guide, the situation is the same in 3.0.0 with old stored data, isn't it?

End users might not be mad if "major" version requires them to lose something. They still have Spark 2.x version line to deny the change and take the risk. If we put this to 2.4.5, they have no way to deny.

Also, although we backport this with stateFormatVersion=1, it works as a legacy buggy mode, doesn't it?

I guess not. Spark will "fail" the query when Spark detects that the query is using state format version 1 and stream-stream outer join is being used. That's what I said "destructive".

HeartSaVioR · 2020-01-22T23:25:39Z

So unlike other states having versions which can be co-existed, for stream-stream outer join, there're only "valid" and "invalid" state format which cannot be co-existed. Why we still keep state format 1 is that stream-stream inner join is not affected by this bug and we don't want to let the case also discard the state. It will affect too many existing queries and I'd like to reduce the impact.

dongjoon-hyun · 2020-01-22T23:42:52Z

Got it. Thank you for the detailed explanation.

tgravescs · 2020-01-22T23:45:34Z

@HeartSaVioR would you mind putting a comment in the jira (like you describe above) describing why it can't be back ported so its easier for us to filter and target the jiras?

HeartSaVioR · 2020-01-22T23:47:49Z

OK I'll migrating my comments to JIRA issue as well. Thanks for guidance.

HeartSaVioR · 2020-01-22T23:53:48Z

https://issues.apache.org/jira/browse/SPARK-26154?focusedCommentId=17021627&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021627

HeartSaVioR · 2020-01-22T23:55:41Z

Btw, if we plan to have Spark 2.5 then makes sense to port back there (as end users still have option to reside on 2.4.x) but I'm not seeing such plan as of now.

tgravescs · 2020-01-23T00:01:40Z

thanks

tdas · 2020-03-27T21:57:45Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

      CheckAnswer(1.to(1000): _*))
  }
+
+  test("SPARK-26187 restore the stream-stream inner join query from Spark 2.4") {


@HeartSaVioR Question on this test. The point of this test is to ensure that Spark 3.x can read join state from Spark 2.4. But since the format was changed for stream-stream outer joins (that is, only outer joins use the new format), shouldn't we do this compatibility test for outer joins?

Please find SPARK-26187 restore the stream-stream outer join query from Spark 2.4 in this suite. (Yeah twos are unintentionally a bit far away, my bad. If moving test will help readability I'll do it as follow-up PR. Please let me know how you think.)

One thing, the format is changed for stream-stream joins instead of only outer joins, to allow changing the type of join inner <-> left/right outer, which is technically possible (with such correctness issue) in previous state and now possible without correctness issue.

Spark will fail the query if end users try to read state version 1 for stream-stream outer join, but will still allow state version 1 for stream-stream inner join to reduce the impact. Once they try to change the join type to outer then they'll have error message as covered in the test I mentioned above.

Thank you for the clarification.

[SPARK-26154][SS] Streaming left/right outer join should not return o…

2544f2c

…uter nulls for already matched rows

HeartSaVioR mentioned this pull request Oct 14, 2019

[SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows #23634

Closed

HeartSaVioR commented Oct 14, 2019

View reviewed changes

Update the comment as it is proven to have edge case and we are takin…

5f0988e

…g other approach

HeartSaVioR commented Oct 14, 2019

View reviewed changes

dongjoon-hyun added the STRUCTURED STREAMING label Oct 14, 2019

vanzin reviewed Oct 14, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

vanzin reviewed Oct 24, 2019

View reviewed changes

Address part of review comments

31ca780

vanzin reviewed Oct 25, 2019

View reviewed changes

fail the query instead of logging error

3b9c565

Fix UT and separate UT to inner/outer join case

c4eb26e

vanzin reviewed Oct 28, 2019

View reviewed changes

Rename class to be more generic / less specific to the implementation…

1fce902

… detail

vanzin reviewed Oct 31, 2019

View reviewed changes

HeartSaVioR added 2 commits November 1, 2019 10:27

Reflect part of review comments (easy ones)

cb17333

Introduce individual class instead of tuple

7213d42

vanzin reviewed Nov 4, 2019

View reviewed changes

vanzin closed this in c941362 Nov 11, 2019

HeartSaVioR deleted the SPARK-26154-shorten-alternative branch November 12, 2019 02:44

dongjoon-hyun reviewed Jan 22, 2020

View reviewed changes

tdas reviewed Mar 27, 2020

View reviewed changes

[SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows #26108

[SPARK-26154][SS] Streaming left/right outer join should not return outer nulls for already matched rows #26108

Uh oh!

Conversation

HeartSaVioR commented Oct 14, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Oct 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR Oct 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

SparkQA commented Oct 14, 2019

Uh oh!

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Oct 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Oct 25, 2019

Uh oh!

SparkQA commented Oct 25, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Oct 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

HeartSaVioR commented Oct 14, 2019 •

edited

Loading

HeartSaVioR Oct 14, 2019 •

edited

Loading

HeartSaVioR Oct 25, 2019 •

edited

Loading

HeartSaVioR Oct 26, 2019 •

edited

Loading

HeartSaVioR Oct 28, 2019 •

edited

Loading

HeartSaVioR Nov 1, 2019 •

edited

Loading