Skip to content

Conversation

@marmbrus
Copy link
Contributor

@marmbrus marmbrus commented Oct 8, 2015

This PR is a first cut at code generating an encoder that takes a Scala Product type and converts it directly into the tungsten binary format. This is done through the addition of a new set of expression that can be used to invoke methods on raw JVM objects, extracting fields and converting the result into the required format. These can then be used directly in an UnsafeProjection allowing us to leverage the existing encoding logic.

According to some simple benchmarks, this can significantly speed up conversion (~4x). However, replacing CatalystConverters is deferred to a later PR to keep this PR at a reasonable size.

case class SomeInts(a: Int, b: Int, c: Int, d: Int, e: Int)

val data = SomeInts(1, 2, 3, 4, 5)
val encoder = ProductEncoder[SomeInts]
val converter = CatalystTypeConverters.createToCatalystConverter(ScalaReflection.schemaFor[SomeInts].dataType)


(1 to 5).foreach {iter =>
  benchmark(s"converter $iter") {
    var i = 100000000
    while (i > 0) {
      val res = converter(data).asInstanceOf[InternalRow]
      assert(res.getInt(0) == 1)
      assert(res.getInt(1) == 2)
      i -= 1
    }
  }

  benchmark(s"encoder $iter") {
    var i = 100000000
    while (i > 0) {
      val res = encoder.toRow(data)
      assert(res.getInt(0) == 1)
      assert(res.getInt(1) == 2)
      i -= 1
    }
  }
}

Results:

[info] converter 1: 7170ms
[info] encoder 1: 1888ms
[info] converter 2: 6763ms
[info] encoder 2: 1824ms
[info] converter 3: 6912ms
[info] encoder 3: 1802ms
[info] converter 4: 7131ms
[info] encoder 4: 1798ms
[info] converter 5: 7350ms
[info] encoder 5: 1912ms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, probably too late to change this now :)

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43376 has finished for PR 9019 at commit 768055d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HasWeightCol(Params):
    • class IsotonicRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol,
    • class IsotonicRegressionModel(JavaModel):
    • trait Encoder[T]
    • case class ClassEncoder[T](
    • case class Average(child: Expression) extends DeclarativeAggregate
    • case class Count(child: Expression) extends DeclarativeAggregate
    • case class First(child: Expression) extends DeclarativeAggregate
    • case class Last(child: Expression) extends DeclarativeAggregate
    • case class Max(child: Expression) extends DeclarativeAggregate
    • case class Min(child: Expression) extends DeclarativeAggregate
    • abstract class StddevAgg(child: Expression) extends DeclarativeAggregate
    • case class Sum(child: Expression) extends DeclarativeAggregate
    • case class StaticInvoke(
    • case class Invoke(
    • case class NewInstance(
    • case class UnwrapOption(
    • case class LambdaVariable(value: String, isNull: String, dataType: DataType) extends Expression
    • case class MapObjects(
    • case class RoundRobinPartitioning(numPartitions: Int) extends Partitioning
    • case class Coalesce(numPartitions: Int, child: SparkPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43378 has finished for PR 9019 at commit 89d35cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait Encoder[T]
    • case class ClassEncoder[T](
    • case class StaticInvoke(
    • case class Invoke(
    • case class NewInstance(
    • case class UnwrapOption(
    • case class LambdaVariable(value: String, isNull: String, dataType: DataType) extends Expression
    • case class MapObjects(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we use a LambdaVariable to let genFunction have a way to access every element of the input data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, this is how we link whatever code the inner expression generates into the loop.

@yhuai
Copy link
Contributor

yhuai commented Oct 8, 2015

Overall looks good! Left a few clarification questions.

@yhuai
Copy link
Contributor

yhuai commented Oct 8, 2015

Since this is the infrastructural work of encoder/decoder, let's merge this and use a follow-up pr to address the comments. So, other people can start to play with it.

@asfgit asfgit closed this in 9e66a53 Oct 8, 2015
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe extend LeafExpression?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah you remind me that we have Unevaluable, which is used for expressions that do not support code-gen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a LeafExpression as there are children (fixed in the followup). Its not Unevaluable, but instead only supports codegen.

@marmbrus marmbrus deleted the productEncoder branch March 8, 2016 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants