-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29419][SQL] Fix Encoder thread-safety bug in createDataset(Seq) #26076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Actually, this is a clean merge with 2.4.x, so we can merge this PR to both branches. |
I would say this is a user error. But I agree that it's pretty easy to overlook. IMO, I made such mistake when I first used |
I think that most existing uses of Encoders are de-facto thread-safe because either (a) the use occurs inside of a Spark task and task gets its own fresh copy of the Encoder when the Given this, I suspect that this might be the only non-thread-safe Encoder usage in Spark (excluding code which is only used in Spark's unit tests). I don't think that we need to introduce similar copying in other public APIs.
What do you think about improving the performance / reducing the cost of If we do that, we'd be able to defensively copy at very low cost (e.g. one object allocation) and then could copy-by-default and free users from having to worry about thread-safety. I think that's a potentially huge win from a developer productivity point-of-view: the cost / toil of having to worry about thread-unsafe code is a tax placed on end users and creates a developer education / training burden, so I think it's worth attempting to eliminate this entire class of pitfall. |
That sounds like a good idea to me. Can we do that first? |
I'll prototype this. If I get it working then I'll open a second PR and will ping / link it here. |
|
Hi, @JoshRosen . Is there any update on this PR? |
|
Hi @dongjoon-hyun, I spent some time prototyping a refactoring of |
|
Thank you for informing that, @JoshRosen . (I forgot to comment back here.) |
|
ok to test |
|
Shall we merge this one first since 3.0 release is pretty close now? |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay as a bandaid fix at this moment.
|
WDYT @zsxwing, @JoshRosen, @dongjoon-hyun? |
|
Test build #119064 has finished for PR 26076 at commit
|
|
retest this please |
|
Test build #119081 has finished for PR 26076 at commit
|
|
retest this please |
|
Test build #119088 has finished for PR 26076 at commit
|
|
Roughly synced with @JoshRosen offline. I am going to merge this. Merged to master, branch-3.0 and branch-2.4. |
### What changes were proposed in this pull request?
This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads.
Here is an example demonstrating the problem:
```scala
import org.apache.spark.sql._
val enc = implicitly[Encoder[(Int, Int)]]
val datasets = (1 to 100).par.map { _ =>
val pairs = (1 to 100).map(x => (x, x))
spark.createDataset(pairs)(enc)
}
datasets.reduce(_ union _).collect().foreach {
pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair")
}
```
Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled.
This bug is similar to SPARK-22355 / #19577, a similar problem in `Dataset.collect()`.
The fix implemented here is based on #24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3414) / explanation as that PR.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Tested manually using the example listed above.
Thanks to smcnamara-stripe for identifying this bug.
Closes #26076 from JoshRosen/SPARK-29419.
Authored-by: Josh Rosen <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit f4499f6)
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?
This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads.
Here is an example demonstrating the problem:
```scala
import org.apache.spark.sql._
val enc = implicitly[Encoder[(Int, Int)]]
val datasets = (1 to 100).par.map { _ =>
val pairs = (1 to 100).map(x => (x, x))
spark.createDataset(pairs)(enc)
}
datasets.reduce(_ union _).collect().foreach {
pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair")
}
```
Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled.
This bug is similar to SPARK-22355 / #19577, a similar problem in `Dataset.collect()`.
The fix implemented here is based on #24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3414) / explanation as that PR.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Tested manually using the example listed above.
Thanks to smcnamara-stripe for identifying this bug.
Closes #26076 from JoshRosen/SPARK-29419.
Authored-by: Josh Rosen <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit f4499f6)
Signed-off-by: HyukjinKwon <[email protected]>
|
Thank you, @JoshRosen and @HyukjinKwon . |
### What changes were proposed in this pull request?
This PR fixes a thread-safety bug in `SparkSession.createDataset(Seq)`: if the caller-supplied `Encoder` is used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads.
Here is an example demonstrating the problem:
```scala
import org.apache.spark.sql._
val enc = implicitly[Encoder[(Int, Int)]]
val datasets = (1 to 100).par.map { _ =>
val pairs = (1 to 100).map(x => (x, x))
spark.createDataset(pairs)(enc)
}
datasets.reduce(_ union _).collect().foreach {
pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair")
}
```
Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled.
This bug is similar to SPARK-22355 / apache#19577, a similar problem in `Dataset.collect()`.
The fix implemented here is based on apache#24735's updated version of the `Datataset.collect()` bugfix: use `.copy()`. For consistency, I used same [code comment](https://github.com/apache/spark/blob/d841b33ba3a9b0504597dbccd4b0d11fa810abf3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3414) / explanation as that PR.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Tested manually using the example listed above.
Thanks to smcnamara-stripe for identifying this bug.
Closes apache#26076 from JoshRosen/SPARK-29419.
Authored-by: Josh Rosen <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
This PR fixes a thread-safety bug in
SparkSession.createDataset(Seq): if the caller-suppliedEncoderis used in multiple threads then createDataset's usage of the encoder may lead to incorrect / corrupt results because the Encoder's internal mutable state will be updated from multiple threads.Here is an example demonstrating the problem:
Before this PR's change, the above example fails because Spark produces corrupted records where different input records' fields have been co-mingled.
This bug is similar to SPARK-22355 / #19577, a similar problem in
Dataset.collect().The fix implemented here is based on #24735's updated version of the
Datataset.collect()bugfix: use.copy(). For consistency, I used same code comment / explanation as that PR.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Tested manually using the example listed above.
Thanks to @smcnamara-stripe for identifying this bug.