Skip to content

Conversation

@hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

This PR moves the ExpressionEncoder.toRow and ExpressionEncoder.fromRow functions into their own function objects(ExpressionEncoder.Serializer & ExpressionEncoder.Deserializer). This effectively makes the ExpressionEncoder stateless, thread-safe and (more) reusable. The function objects are not thread safe, however they are documented as such and should be used in a more limited scope (making it easier to reason about thread safety).

Why are the changes needed?

ExpressionEncoders are not thread-safe. We had various (nasty) bugs because of this.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

@hvanhovell
Copy link
Contributor Author

Reviewers please first take a look at the ExpressionEncoder class and specifically the naming of the function objects.

* Function that deserializes an [[InternalRow]] into an object of type `T`. Instances of this
* class are not meant to be thread-safe.
*/
abstract class Deserializer[T] extends (InternalRow => T)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation note. I opted to go with abstract classes so we can get monomorphic call sites in many cases.

@SparkQA
Copy link

SparkQA commented Apr 15, 2020

Test build #121320 has finished for PR 28223 at commit cc8400c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 15, 2020

Test build #121319 has finished for PR 28223 at commit a7f948c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 15, 2020

Test build #121313 has finished for PR 28223 at commit e6c3391.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class Deserializer[T] extends (InternalRow => T)
  • abstract class Serializer[T] extends (T => InternalRow)

@SparkQA
Copy link

SparkQA commented Apr 15, 2020

Test build #121316 has finished for PR 28223 at commit b7699a2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 15, 2020

Test build #121327 has finished for PR 28223 at commit 2c402f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class Deserializer[T] extends (InternalRow => T) with Serializable
  • abstract class Serializer[T] extends (T => InternalRow) with Serializable

} catch {
case e: Exception =>
throw new RuntimeException(s"Error while encoding: $e\n" +
def createSerializer(): Serializer[T] = new Serializer[T] {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently relies on use serializing the enclosing encoder as well. We technically don't need the entire encoder but only a couple fields. I could move this class into the companion object and just use the fields I need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea that would be better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also thinking about if we just need to pass the original (de)serializer expressions and do the optimization inside Serializer and Deserializer lazily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some cost to the optimization. So I would like to do it only once.

@dongjoon-hyun
Copy link
Member

Thank you, @hvanhovell . Do we need a benchmark about this change?

@hvanhovell
Copy link
Contributor Author

@dongjoon-hyun it should be a bit of a lateral move performance wise. The expensive bit is generating the code and compiling it, and we are definitely not avoiding that. It might be a bit quicker because it does not excessively copy the expression encoder (which is not for free because of the work which is done in the constructor). What would you like to see benchmarked?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Apr 15, 2020

Got it. Thank you for confirmation, @hvanhovell . I was just wondering. Your comment is enough for me.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like in a good shape already.

} catch {
case e: Exception =>
throw new RuntimeException(s"Error while decoding: $e\n" +
s"${deserializer.simpleString(SQLConf.get.maxToStringFields)}", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As pointed already, some fields like deserializer are in enclosing encoder, and so currently looks like we will serialize entire encoder? Actually we did serialize entire encoder currently but yea it is better we can get rid of unnecessary.


private def initialize(): Unit = {
inputRow = new GenericInternalRow(1)
extractProjection = GenerateUnsafeProjection.generate(optimizedSerializer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't this two be lazy val? performance concerns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, lazy vals are not free and the extractProjection is on the hot path.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 16, 2020

cc also @JoshRosen and @zsxwing from #26076

Comment on lines 60 to 63
* == Implementation ==
* - Encoders are not required to be thread-safe and thus they do not need to use locks to guard
* against concurrent access if they reuse internal buffers to improve performance.
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this means Encoders must be thread-safe? Do we need explicit comment for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be thread-safe.

@srowen
Copy link
Member

srowen commented Apr 16, 2020

I feel like @JoshRosen looked at this too a while ago

@SparkQA
Copy link

SparkQA commented Apr 16, 2020

Test build #121361 has finished for PR 28223 at commit 5bcff74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 16, 2020

Test build #121364 has finished for PR 28223 at commit bef4d8d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 17, 2020

Test build #121376 has finished for PR 28223 at commit ced396f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you, @hvanhovell and all.
Merged to master/3.0.

dongjoon-hyun pushed a commit that referenced this pull request Apr 17, 2020
### What changes were proposed in this pull request?
This PR moves the `ExpressionEncoder.toRow` and `ExpressionEncoder.fromRow` functions into their own function objects(`ExpressionEncoder.Serializer` & `ExpressionEncoder.Deserializer`). This effectively makes the `ExpressionEncoder` stateless, thread-safe and (more) reusable. The function objects are not thread safe, however they are documented as such and should be used in a more limited scope (making it easier to reason about thread safety).

### Why are the changes needed?
ExpressionEncoders are not thread-safe. We had various (nasty) bugs because of this.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #28223 from hvanhovell/SPARK-31450.

Authored-by: herman <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit fab4ca5)
Signed-off-by: Dongjoon Hyun <[email protected]>
@HeartSaVioR
Copy link
Contributor

Late LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants