[SPARK-19634][ML] Multivariate summarizer - dataframes API #17419

thunterdb · 2017-03-24T23:25:16Z

What changes were proposed in this pull request?

This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics. This should resolve some performance issues related to computing unrequested metrics.

Furthermore, it uses the BLAS API to the extent possible, so that the given code should be efficient for the dense case.

How was this patch tested?

This patch includes most of the tests of the RDD-based. It compares results against the existing MultivariateOnlineSummarizer as well as adding more tests.

This patch also includes some documentation for some low-level constructs such as TypedImperativeAggregate.

Performance

I have not run tests against the existing implementation. However, this patch uses the recommended low-level SQL APIs, so it should be interesting to compare both implementation in that respect.

Thanks to @hvanhovell and Cheng Liang for suggestions on SparkSQL.

SparkQA · 2017-03-25T00:53:50Z

Test build #75188 has finished for PR 17419 at commit 58b17dc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-25T00:55:18Z

Test build #75187 has finished for PR 17419 at commit 35eaeb0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-25T01:06:04Z

Test build #75189 has finished for PR 17419 at commit ffe5cfe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

thunterdb · 2017-03-27T21:06:12Z

@sethah it would have been nice, but I do not think we should merge it this late into the release cycle.

thunterdb · 2017-03-28T00:18:44Z

I have added a small perf test to find the performance bottlenecks. Note that this test works on the worst case (vectors of size 1) from the perspective of overhead. Here are the numbers I currently get. I will profile the code to see if there are some obvious targets for optimization:

[min ~ median ~ max], higher is better:

RDD = [2482 ~ 46150 ~ 48354] records / milli
dataframe (variance only) = [4217 ~ 4557 ~ 4848] records / milli
dataframe = [2887 ~ 4420 ~ 4717] records / milli

SparkQA · 2017-03-28T02:20:57Z

Test build #75285 has finished for PR 17419 at commit 662f62c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-03-29T07:49:19Z

RDD = [2482 ~ 46150 ~ 48354] records / milli

The number is so varied?

Looks like RDD is faster than dataframe version 10 times...

kiszk · 2017-03-29T17:48:22Z

mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala

+    val rdd1 = sc.parallelize(1 to n).map { idx =>
+      OldVectors.dense(idx.toDouble)
+    }
+    val trieouts = 10


I think that 10 times without warmup is too small for performance measurement.
Can we use Benchmark class or add warmup run as Benchmark class does?

I did not try about that class, thanks. It should stabilize the results.

When results are stabilized, I think that it would be good to keep results with ignore("benchmark") as other benchmarks does.

viirya · 2017-03-30T04:27:13Z

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala

+private[ml]
+object SummaryBuilderImpl extends Logging {
+
+  def implementedMetrics: Seq[String] = allMetrics.map(_._1).sorted


lazy val should be enough.

viirya · 2017-03-30T04:59:03Z

mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala

+
+  private def b(x: Array[Double]): Vector = Vectors.dense(x)
+
+  private def l(x: Array[Long]): Vector = b(x.map(_.toDouble))


And this. It is better to rename it.

viirya · 2017-03-30T07:27:23Z

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala

+      // All the fields that we compute on demand:
+      // TODO: the most common case is dense vectors. In that case we should
+      // directly use BLAS instructions instead of iterating through a scala iterator.
+      v.foreachActive { (index, value) =>


RDD's Summarizer doesn't have BLAS optimization actually. So this may not be the reason for the performance gap.

Oh yes it does not. Note that the benchmark below is works with vectors of size 1, so as to analyze the overhead of dataframes vs RDDs. I will put a more realistic benchmark later.

viirya · 2017-03-30T07:35:20Z

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala

+    override def update(buff: Buffer, row: InternalRow): Buffer = {
+      // Unsafe rows do not play well with UDTs, it seems.
+      // Directly call the deserializer.
+      val v = udt.deserialize(row.getStruct(0, udt.sqlType.size))


This should hurt performance.

When deserializing a vector, we copy (indices and) values array from the unsafe row.

I think

val v = udt.deserialize(row.getStruct(0, udt.sqlType.size))

has some problems.
We cannot directly use getter method on row parameter passed in, because the ordinal of the column we want to get depends on the underlying catalyst layer, instead, we should use:

val datum = child.eval(row) val featureVector = udt.deserialize(datum)

to get column value.

If we want to use weight column when summarizing, I think we can define the UDAF as:
summary(featureCol, weightCol)
and in the constructor of MetricsAggregate pass the weight column in.
Example code as following:

case class MetricsAggregate( requested: Seq[Metrics], featureExpr: Expression, // feature column expr weightExpr: Expression, // weight column expr mutableAggBufferOffset: Int, inputAggBufferOffset: Int) extends TypedImperativeAggregate[Buffer] { override def children: Seq[Expression] = featureExpr :: weightExpr :: Nil override def update(buff: Buffer, row: InternalRow): Buffer = { val featureVector = udt.deserialize(featureExpr.eval(row)) val weight = weightExpr.eval(row) Buffer.updateInPlace(buff, featureVector, weight) buff } .... } def summary(featureCol: Column, weightCol: Column): Column = { val agg = MetricsAggregate( requestedMetrics, featureCol.expr, weightCol.expr, mutableAggBufferOffset = 0, inputAggBufferOffset = 0) new Column(AggregateExpression(agg, mode = Complete, isDistinct = false)) } // handle the case user do not specify weight column def summary(featureCol: Column): Column = { summary(featureCol, lit(1.0)) }

cc @cloud-fan @liancheng @yanboliang

+1 @WeichenXu123.

Dataframes = [2766.648008567718 ~ 5091.204527768661 ~ 5716.359795809639] records / milli

thunterdb · 2017-03-30T23:42:31Z

I looked a bit deeper into the performance aspect. Here are some quick insights:

there was an immediate bottleneck in VectorUDT, which boosts the performance already by 3x
it is not clear if switching to pure Breeze operations helps given the overhead for tiny vectors. I will need to do more analysis on larger vectors.
now, most of the time is roughly split between ObjectAggregationIterator.processInputs (40%), some codegen'ed expression (20%) and our own MetricsAggregate.update (35%)

That benchmark focuses on the overhead of catalyst. I will do another benchmark with dense vectors to see how it fares in practice with more real data.

SparkQA · 2017-03-30T23:55:48Z

Test build #75406 has finished for PR 17419 at commit a569dac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-08T17:34:30Z

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala

+
+  private case class MetricsAggregate(
+      requested: Seq[Metrics],
+      startBuffer: Buffer,


we should not pass around the startBuffer, but create an initial one in createAggregationBuffer

WeichenXu123 · 2017-07-20T05:36:53Z

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala

+    override def update(buff: Buffer, row: InternalRow): Buffer = {
+      // Unsafe rows do not play well with UDTs, it seems.
+      // Directly call the deserializer.
+      val v = udt.deserialize(row.getStruct(0, udt.sqlType.size))


I think

val v = udt.deserialize(row.getStruct(0, udt.sqlType.size))

has some problems.
We cannot directly use getter method on row parameter passed in, because the ordinal of the column we want to get depends on the underlying catalyst layer, instead, we should use:

val datum = child.eval(row) val featureVector = udt.deserialize(datum)

to get column value.

WeichenXu123 · 2017-07-20T05:38:19Z

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala

+    override def update(buff: Buffer, row: InternalRow): Buffer = {
+      // Unsafe rows do not play well with UDTs, it seems.
+      // Directly call the deserializer.
+      val v = udt.deserialize(row.getStruct(0, udt.sqlType.size))


If we want to use weight column when summarizing, I think we can define the UDAF as:
summary(featureCol, weightCol)
and in the constructor of MetricsAggregate pass the weight column in.
Example code as following:

case class MetricsAggregate( requested: Seq[Metrics], featureExpr: Expression, // feature column expr weightExpr: Expression, // weight column expr mutableAggBufferOffset: Int, inputAggBufferOffset: Int) extends TypedImperativeAggregate[Buffer] { override def children: Seq[Expression] = featureExpr :: weightExpr :: Nil override def update(buff: Buffer, row: InternalRow): Buffer = { val featureVector = udt.deserialize(featureExpr.eval(row)) val weight = weightExpr.eval(row) Buffer.updateInPlace(buff, featureVector, weight) buff } .... } def summary(featureCol: Column, weightCol: Column): Column = { val agg = MetricsAggregate( requestedMetrics, featureCol.expr, weightCol.expr, mutableAggBufferOffset = 0, inputAggBufferOffset = 0) new Column(AggregateExpression(agg, mode = Complete, isDistinct = false)) } // handle the case user do not specify weight column def summary(featureCol: Column): Column = { summary(featureCol, lit(1.0)) }

cc @cloud-fan @liancheng @yanboliang

WeichenXu123 · 2017-07-20T05:44:52Z

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala

+    }
+
+    private def updateInPlaceDense(buffer: Buffer, v: DenseVector, w: Double): Unit = {
+      val epsi = Double.MinPositiveValue


Does the purpose of the code using breeze here to use BLAS to improve performance ?
BUT in breeze implementation ops between vectors do not use BLAS, instead in breeze it use cForRange.
cc @yanboliang

I think the intention here is to make sequential operation conveniently and efficiently. AFAIK, cForRange is also very efficient. Thanks.

WeichenXu123 · 2017-07-20T05:46:31Z

mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala

+      times_df ::= dt
+      // scalastyle:off
+      print("Dataframes", times_df)
+      // scalastyle:on


The print statement should move out from the loop.

WeichenXu123 · 2017-07-20T06:19:02Z

As the dataframe version is much slower than RDD version (currently test against vector of size 1)
I also guess there is some performance issue in ObjectAggregationIterator.processInput()
in the following code block:

  private def processInputs(): Unit = {
    // ...
    if (groupingExpressions.isEmpty) {
      // If there is no grouping expressions, we can just reuse the same buffer over and over again.
      val groupingKey = groupingProjection.apply(null)
      val buffer: InternalRow = getAggregationBufferByKey(hashMap, groupingKey)
      while (inputRows.hasNext) {
        val newInput = safeProjection(inputRows.next()) 
        processRow(buffer, newInput)
      }
    }
    ...

This statement val newInput = safeProjection(inputRows.next()) maybe do some redundant data copy (for work-around some bugs?)

cc @cloud-fan @liancheng

liancheng · 2017-07-20T19:01:32Z

@WeichenXu123 and I did some profiling using jvisualvm and found that 40% of the time is spent in the copy performed by this safeProjection. This is a known issue used to fight against the false sharing issue @cloud-fan and I hit before.

@cloud-fan tried to fix this issue in #15082 but that PR didn't work out due to some other concerns (I can't remember all the details now).

@cloud-fan, any ideas about improving ObjectHashAggregateExec (e.g. adding code generation support)?

cloud-fan · 2017-07-22T09:55:50Z

The copy problem is fixed in #18483 , I think we can remove this workaround in ObjectHashAggregateExec.

…ash aggregate ## What changes were proposed in this pull request? In #18483 , we fixed the data copy bug when saving into `InternalRow`, and removed all workarounds for this bug in the aggregate code path. However, the object hash aggregate was missed, this PR fixes it. This patch is also a requirement for #17419 , which shows that DataFrame version is slower than RDD version because of this issue. ## How was this patch tested? existing tests Author: Wenchen Fan <[email protected]> Closes #18712 from cloud-fan/minor.

WeichenXu123 · 2017-07-24T22:42:12Z

mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala

+  }
+
+  def structureForMetrics(metrics: Seq[Metrics]): StructType = {
+    val dct = allMetrics.map { case (n, m, dt, _) => m -> (n, dt) }.toMap


m -> (n, dt) This expr has some syntax problem.
The scala complier will turn -> into .-> method and will treat n, dt as two parameters.
And this cause compiling error (after rebase this PR with master)
we can use:
m -> (n -> dt) or (m, (n, dt)) or m -> ((n, dt))
instead.

thunterdb · 2017-08-01T22:03:16Z

I am going to close this PR, since this is being taken over by @WeichenXu123 in #18798 .

## What changes were proposed in this pull request? This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics. ## How was this patch tested? Testcases added. ## Performance Resolve several performance issues in apache#17419, further optimization pending on SQL team's work. One of the SQL layer performance issue related to these feature has been resolved in apache#18712, thanks liancheng and cloud-fan ### Performance data (test on my laptop, use 2 partitions. tries out = 20, warm up = 10) The unit of test results is records/milliseconds (higher is better) Vector size/records number | 1/10000000 | 10/1000000 | 100/1000000 | 1000/100000 | 10000/10000 ----|------|----|---|----|---- Dataframe | 15149 | 7441 | 2118 | 224 | 21 RDD from Dataframe | 4992 | 4440 | 2328 | 320 | 33 raw RDD | 53931 | 20683 | 3966 | 528 | 53 Author: WeichenXu <[email protected]> Closes apache#18798 from WeichenXu123/SPARK-19634-dataframe-summarizer.

thunterdb added 24 commits March 3, 2017 10:36

work

f3fa658

work on the test suite

7539835

last work

673943f

work on using imperative aggregators

202b672

Merge remote-tracking branch 'upstream/master' into 19634

be01981

more work on summarizer

a983284

work

647a4fe

changes

3c4bef7

Merge remote-tracking branch 'upstream/master' into 19634

56390cc

cleanup

c3f236c

debugging

ef955c0

work

a04f923

Merge remote-tracking branch 'upstream/master' into 19634

946d490

debug

201eb77

trying to debug serialization issue

f4dec88

better tests

4af0f47

changes

9f29030

debugging

e9877dc

more tests and debugging

3a11d02

fixed tests

6d26c17

doc

35eaeb0

cleanups

58b17dc

cleanups

18078c1

Cleanups

ffe5cfe

thunterdb changed the title ~~[SPARK-19634][ML][WIP] Multivariate summarizer - dataframes API~~ [SPARK-19634][ML] Multivariate summarizer - dataframes API Mar 24, 2017

thunterdb added 2 commits March 24, 2017 16:43

Cleanups

41f4be6

Cleanups

ba200bb

small test to find perf issues

662f62c

kiszk reviewed Mar 29, 2017

View reviewed changes

viirya reviewed Mar 30, 2017

View reviewed changes

thunterdb added 2 commits March 30, 2017 15:10

Current speed:

96be071

Dataframes = [2766.648008567718 ~ 5091.204527768661 ~ 5716.359795809639] records / milli

BLAS calls for dense interface

a569dac

cloud-fan reviewed May 8, 2017

View reviewed changes

WeichenXu123 reviewed Jul 20, 2017

View reviewed changes

cloud-fan mentioned this pull request Jul 22, 2017

[SPARK-17528][SQL][followup] remove unnecessary data copy in object hash aggregate #18712

Closed

WeichenXu123 reviewed Jul 24, 2017

View reviewed changes

WeichenXu123 mentioned this pull request Aug 1, 2017

[SPARK-19634][ML] Multivariate summarizer - dataframes API #18798

Closed

thunterdb closed this Aug 1, 2017


		private def b(x: Array[Double]): Vector = Vectors.dense(x)

		private def l(x: Array[Long]): Vector = b(x.map(_.toDouble))

[SPARK-19634][ML] Multivariate summarizer - dataframes API #17419

[SPARK-19634][ML] Multivariate summarizer - dataframes API #17419

Uh oh!

Conversation

thunterdb commented Mar 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Performance

Uh oh!

SparkQA commented Mar 25, 2017

Uh oh!

SparkQA commented Mar 25, 2017

Uh oh!

SparkQA commented Mar 25, 2017

Uh oh!

thunterdb commented Mar 27, 2017

Uh oh!

thunterdb commented Mar 28, 2017

Uh oh!

SparkQA commented Mar 28, 2017

Uh oh!

viirya commented Mar 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Mar 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Jul 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thunterdb commented Mar 30, 2017

Uh oh!

SparkQA commented Mar 30, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Jul 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Jul 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liancheng commented Jul 20, 2017

Uh oh!

cloud-fan commented Jul 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thunterdb commented Aug 1, 2017

Uh oh!

thunterdb commented Mar 24, 2017 •

edited

Loading

viirya commented Mar 29, 2017 •

edited

Loading

viirya Mar 30, 2017 •

edited

Loading

WeichenXu123 Jul 20, 2017 •

edited

Loading

WeichenXu123 Jul 20, 2017 •

edited

Loading

WeichenXu123 commented Jul 20, 2017 •

edited

Loading