[SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) #26076

JoshRosen · 2019-10-09T20:44:32Z

What changes were proposed in this pull request?

This PR fixes a thread-safety bug in SparkSession.createDataset(Seq): if the caller-supplied Encoder is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads.

Here is an example demonstrating the problem:

import org.apache.spark.sql._

val enc = implicitly[Encoder[(Int, Int)]]

val datasets = (1 to 100).par.map { _ =>
  val pairs = (1 to 100).map(x => (x, x))
  spark.createDataset(pairs)(enc)
}

datasets.reduce(_ union _).collect().foreach {
  pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair")
}

Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled.

This bug is similar to SPARK-22355 / #19577, a similar problem in Dataset.collect().

The fix implemented here is based on #24735's updated version of the Datataset.collect() bugfix: use .copy(). For consistency, I used same code comment / explanation as that PR.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tested manually using the example listed above.

Thanks to @smcnamara-stripe for identifying this bug.

…ataset(Seq)

JoshRosen · 2019-10-09T20:45:13Z

~~I'll submit a separate patch for 2.4.x.~~

Actually, this is a clean merge with 2.4.x, so we can merge this PR to both branches.

zsxwing · 2019-10-09T20:54:52Z

the caller-supplied Encoder is used in multiple threads

I would say this is a user error. But I agree that it's pretty easy to overlook. IMO, I made such mistake when I first used Encoder. But if we would like to fix all these problems, all public APIs accepting Encoder will need the copy. I did some research about this and found some noticeable performance regression in our internal benchmark. That's why I finally just submitted #25209 to make users easy to copy an Encoder instead.

JoshRosen · 2019-10-09T21:36:27Z

But if we would like to fix all these problems, all public APIs accepting Encoder will need the copy.

I think that most existing uses of Encoders are de-facto thread-safe because either (a) the use occurs inside of a Spark task and task gets its own fresh copy of the Encoder when the Task is deserialized or (b) the use occurs on the driver but the code calls call resolveAndBind (which internally performs a copy) prior to using the Encoder.

Given this, I suspect that this might be the only non-thread-safe Encoder usage in Spark (excluding code which is only used in Spark's unit tests). I don't think that we need to introduce similar copying in other public APIs.

I did some research about this and found some noticeable performance regression in our internal benchmark.

What do you think about improving the performance / reducing the cost of .copy() by refactoring the ExpressionEncoder class such that (a) all of the immutable vals become fields of the case class, (b) the current constructor becomes a .apply() on the companion object and the case class constructor becomes private, and (c) resolveAndBind calls the companion object constructor instead of copy()? Given this, I think copy() could be really cheap, effectively giving us a fresh copy of the internal mutable state but copying all other immutable attributes without performing any re-resolution, analysis, attribute binding, etc.

If we do that, we'd be able to defensively copy at very low cost (e.g. one object allocation) and then could copy-by-default and free users from having to worry about thread-safety.

I think that's a potentially huge win from a developer productivity point-of-view: the cost / toil of having to worry about thread-unsafe code is a tax placed on end users and creates a developer education / training burden, so I think it's worth attempting to eliminate this entire class of pitfall.

cloud-fan · 2019-10-10T00:10:03Z

reducing the cost of .copy() by refactoring the ExpressionEncoder class

That sounds like a good idea to me. Can we do that first?

JoshRosen · 2019-10-10T22:26:58Z

That sounds like a good idea to me. Can we do that first?

I'll prototype this. If I get it working then I'll open a second PR and will ping / link it here.

dongjoon-hyun · 2020-01-13T05:32:10Z

Hi, @JoshRosen . Is there any update on this PR?

JoshRosen · 2020-01-19T06:08:47Z

Hi @dongjoon-hyun,

I spent some time prototyping a refactoring of ExpressionEncoder which separates the mutable and immutable state (in order to significantly reduce the cost of .copy()). This is doable but ends up involving a lot of code movement and potentially some duplication (since some helper functions are in both the constructor and after construction). I think that's definitely the right long-term approach but I'll need some more time to figure out a minimally-invasive and clean way of making that change.

dongjoon-hyun · 2020-01-22T04:11:30Z

Thank you for informing that, @JoshRosen . (I forgot to comment back here.)
I was interested in this because this is a correctness issue.

HyukjinKwon · 2020-02-28T02:43:20Z

ok to test

HyukjinKwon · 2020-02-28T02:44:01Z

Shall we merge this one first since 3.0 release is pretty close now?

HyukjinKwon

Looks okay as a bandaid fix at this moment.

HyukjinKwon · 2020-02-28T02:46:43Z

WDYT @zsxwing, @JoshRosen, @dongjoon-hyun?

SparkQA · 2020-02-28T07:28:24Z

Test build #119064 has finished for PR 26076 at commit 62de678.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-02-28T08:28:02Z

retest this please

SparkQA · 2020-02-28T10:09:19Z

Test build #119081 has finished for PR 26076 at commit 62de678.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-02-28T11:25:57Z

retest this please

SparkQA · 2020-02-28T16:11:07Z

Test build #119088 has finished for PR 26076 at commit 62de678.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-03-02T01:18:56Z

Roughly synced with @JoshRosen offline. I am going to merge this.

Merged to master, branch-3.0 and branch-2.4.

### What changes were proposed in this pull request? This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads. Here is an example demonstrating the problem: ```scala import org.apache.spark.sql._ val enc = implicitly[Encoder[(Int, Int)]] val datasets = (1 to 100).par.map { _ => val pairs = (1 to 100).map(x => (x, x)) spark.createDataset(pairs)(enc) } datasets.reduce(_ union _).collect().foreach { pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair") } ``` Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled. This bug is similar to SPARK-22355 / #19577, a similar problem in `Dataset.collect()`. The fix implemented here is based on #24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3414) / explanation as that PR. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested manually using the example listed above. Thanks to smcnamara-stripe for identifying this bug. Closes #26076 from JoshRosen/SPARK-29419. Authored-by: Josh Rosen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit f4499f6) Signed-off-by: HyukjinKwon <[email protected]>

dongjoon-hyun · 2020-03-02T01:22:29Z

Thank you, @JoshRosen and @HyukjinKwon .

### What changes were proposed in this pull request? This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads. Here is an example demonstrating the problem: ```scala import org.apache.spark.sql._ val enc = implicitly[Encoder[(Int, Int)]] val datasets = (1 to 100).par.map { _ => val pairs = (1 to 100).map(x => (x, x)) spark.createDataset(pairs)(enc) } datasets.reduce(_ union _).collect().foreach { pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair") } ``` Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled. This bug is similar to SPARK-22355 / apache#19577, a similar problem in `Dataset.collect()`. The fix implemented here is based on apache#24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3414) / explanation as that PR. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Tested manually using the example listed above. Thanks to smcnamara-stripe for identifying this bug. Closes apache#26076 from JoshRosen/SPARK-29419. Authored-by: Josh Rosen <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

[SPARK-29419] For thread-safety, copy encoder in SparkSession.createD…

62de678

…ataset(Seq)

JoshRosen added the SQL label Oct 9, 2019

JoshRosen requested review from cloud-fan and zsxwing October 9, 2019 20:44

JoshRosen changed the title ~~[SPARK-29419][SQL] Fix Encoder thread-safety issue in createDataset(Seq)~~ [SPARK-29419][SQL] Fix Encoder thread-safety bugissue in createDataset(Seq) Oct 9, 2019

JoshRosen changed the title ~~[SPARK-29419][SQL] Fix Encoder thread-safety bugissue in createDataset(Seq)~~ [SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) Oct 9, 2019

HyukjinKwon approved these changes Feb 28, 2020

View reviewed changes

HyukjinKwon closed this in f4499f6 Mar 2, 2020

HyukjinKwon mentioned this pull request Apr 16, 2020

[SPARK-31450][SQL] Make ExpressionEncoder thread-safe #28223

Closed

[SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) #26076

[SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) #26076

Conversation

JoshRosen commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

JoshRosen commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zsxwing commented Oct 9, 2019

Uh oh!

JoshRosen commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Oct 10, 2019

Uh oh!

JoshRosen commented Oct 10, 2019

Uh oh!

dongjoon-hyun commented Jan 13, 2020

Uh oh!

JoshRosen commented Jan 19, 2020

Uh oh!

dongjoon-hyun commented Jan 22, 2020

Uh oh!

HyukjinKwon commented Feb 28, 2020

Uh oh!

HyukjinKwon commented Feb 28, 2020

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 28, 2020

Uh oh!

SparkQA commented Feb 28, 2020

Uh oh!

HyukjinKwon commented Feb 28, 2020

Uh oh!

SparkQA commented Feb 28, 2020

Uh oh!

HyukjinKwon commented Feb 28, 2020

Uh oh!

SparkQA commented Feb 28, 2020

Uh oh!

HyukjinKwon commented Mar 2, 2020

Uh oh!

dongjoon-hyun commented Mar 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

JoshRosen commented Oct 9, 2019 •

edited

Loading

JoshRosen commented Oct 9, 2019 •

edited

Loading

JoshRosen commented Oct 9, 2019 •

edited

Loading

dongjoon-hyun commented Mar 2, 2020 •

edited

Loading