-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19634][ML] Multivariate summarizer - dataframes API #18798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19634][ML] Multivariate summarizer - dataframes API #18798
Conversation
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a pass by myself.
| ("numNonZeros", NumNonZeros, arrayLType, Seq(ComputeNNZ)), | ||
| ("max", Max, arrayDType, Seq(ComputeMax, ComputeNNZ)), | ||
| ("min", Min, arrayDType, Seq(ComputeMin, ComputeNNZ)), | ||
| ("normL2", NormL2, arrayDType, Seq(ComputeM2)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note Max/Min computation depend on ComputeNNZ because we use an optimization in SummarizerBuffer.update, only update non zero element. So need NNZ statistics to get the the final Max/Min. This is similar in MultivariateOnlineSummarizer.
| case object ComputeMin extends ComputeMetrics | ||
|
|
||
| class SummarizerBuffer( | ||
| requestedMetrics: Seq[Metrics], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SummarizerBuffer is similar to MultivariateOnlineSummarizer. But it has some new features:
- support computing only part of the metrics, saving the buffer memory cost.
- support a optimized input interface, take advantage of saving data copy.
|
|
||
| override def size: Int = _size | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I do not use VectorUDT.deserialize but directly manipulate the UnsafeArrayData coming from InternalRow(In dataframe, it is UnsafeRow actually), it can avoid data copy. cc @liancheng @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this comment to the source code itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 @thunterdb
| val ois = new ObjectInputStream(bis) | ||
| ois.readObject().asInstanceOf[SummarizerBuffer] | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will optimize serialize/deserialize by ByteBuffer later. Though it is not the bottleneck currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems function serialize/deserialize only called once for each partition, so I agree it's not bottleneck.
| */ | ||
| @Since("2.2.0") | ||
| def summary(featuresCol: Column, weightCol: Column): Column | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Support weightCol parameter in a convenient way.
| def size: Int | ||
| } | ||
|
|
||
| private[this] val udt = new VectorUDT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some better way to get the object of VectorUDT ? cc @cloud-fan
|
|
||
| // TODO: this test should not be committed. It is here to isolate some performance hotspots. | ||
| test("perf test") { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance test, on vector size from 1 to 10000. (This part code should be removed before the PR merged).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not remove it, we will need it later. I think there is a way to tag the test as a performance harness, but I could not remember how. @liancheng , do you have some suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep the performance test results as ignore("performance test") and update it when we get improvement. You can refer what Spark SQL did.
|
@WeichenXu123 thanks! Can you post some performance numbers as well? |
|
Test build #80126 has finished for PR 18798 at commit
|
|
performance data attached. cc @thunterdb @jkbradley |
|
Thank you for the performance numbers @WeichenXu123 , I have a couple of comments:
If we trust these numbers, the overall conclusion is that the SQL interface adds a 2x-3x performance overhead over RDDs for the time being. @cloud-fan @liancheng are there still some low hanging fruits that could be merged into SQL? This state of affair is of course far from great, but I am in favor of merging this piece and improve it iteratively with the help of the SQL team, as this code is easy to benchmark and representative of the rest of MLlib, once we start to rely more on dataframe and catalysts, and less on RDDs. @yanboliang @viirya @kiszk what are your thoughts? |
|
cc @hvanhovell as well. |
| * Users should not directly create such builders, but instead use one of the methods in | ||
| * [[Summarizer]]. | ||
| */ | ||
| @Since("2.2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not going to be 2.2 anymore
thunterdb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeichenXu123 thanks a lot, I only have a few comments. Someone else should take a look, as I am the original author of this code.
| /** | ||
| * Add a new sample to this summarizer, and update the statistical summary. | ||
| */ | ||
| def addRaw(instance: TraversableIndexedSeq, weight: Double): this.type = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please mention the type of the collection: TraversableIndexedSeq[_]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually surprised you do not get a performance drop instead of using a vector or an array. Shouldn't it be private, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I make the TraversableIndexedSeq vector directly backend on the UnsafeRow from dataframe. (It acts like the ByteBuffer in java) avoid to copy the array data to generate a vector....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeichenXu123 Could you add comment that we operate on raw data (UnsafeRow) directly to avoid copy values?
|
|
||
| // For test | ||
| def add(sample: Vector, weight: Double): this.type = { | ||
| val v = new TraversableIndexedSeq { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this is going to cause some boxing on the values, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes... this method is only used in testsuite (let writing testing more convenient).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's only used in testsuite, make comment more clear.
| this | ||
| } | ||
|
|
||
| def addRaw(instance: TraversableIndexedSeq): this.type = addRaw(instance, 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to ask whether we need make the whole SummarizerBuffer to be private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to make the whole SummarizerBuffer private, as users will not use it.
|
|
||
| override def size: Int = _size | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this comment to the source code itself.
|
|
||
| // TODO: this test should not be committed. It is here to isolate some performance hotspots. | ||
| test("perf test") { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not remove it, we will need it later. I think there is a way to tag the test as a performance harness, but I could not remember how. @liancheng , do you have some suggestions?
|
|
@WeichenXu123 @thunterdb |
|
|
||
| override def size: Int = _size | ||
| } | ||
| val features = udt.deserialize(featuresDatum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to use VectorUDT.deserialize (Because last version code here won't improve performance markedly but increase code complexity.
|
Test build #80359 has finished for PR 18798 at commit
|
|
Test build #80363 has finished for PR 18798 at commit
|
thunterdb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeichenXu123 thanks, this looks ready except for one comment about performance.
@yanboliang , can you merge this PR after this is addressed, if this looks good for you? Thank you in advance.
| * @return a builder. | ||
| * @throws IllegalArgumentException if one of the metric names is not understood. | ||
| */ | ||
| @Since("2.3.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put a comment about performance to indicate that it is about 3x slower than using the RDD interface.
| * val meanDF = dataframe.select(Summarizer.mean($"features")) | ||
| * val Row(mean_) = meanDF.first() | ||
| * }}} | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put a comment about performance here.
| * val Row(mean_) = meanDF.first() | ||
| * }}} | ||
| * | ||
| * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will trace this the performance gap here with related sql layer improvement in the future.
|
Test build #80403 has finished for PR 18798 at commit
|
|
Jenkins, test this please. |
|
Test build #80407 has finished for PR 18798 at commit
|
|
@yanboliang do you feel comfortable to merge this PR? I think that all the questions have been addressed. |
|
@thunterdb I'm on travel these days, will do a final pass and merge it on next Monday/Tuesday. Thanks. |
|
Sorry can we make the performance data clear? Currently it doesn't say what the unit of the numbers is. |
|
@viirya Sure! comment updated. |
| override def children: Seq[Expression] = featuresExpr :: weightExpr :: Nil | ||
|
|
||
| override def update(state: SummarizerBuffer, row: InternalRow): SummarizerBuffer = { | ||
| // val features = udt.deserialize(featuresExpr.eval(row)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line?
|
@WeichenXu123 Thanks! Looks good. |
| // val features = udt.deserialize(featuresExpr.eval(row)) | ||
| val featuresDatum = featuresExpr.eval(row).asInstanceOf[InternalRow] | ||
|
|
||
| val features = udt.deserialize(featuresDatum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deserialize on each Vector should be a bottleneck.
| /** | ||
| * Add a new sample to this summarizer, and update the statistical summary. | ||
| */ | ||
| def add(instance: Vector, weight: Double): this.type = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the usage of this Vector, I think we can directly work on the serialized Vector data (size, indices, values). It should reduce much of time on deserialization of Vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya I have tried your suggestion in the previous version code, but it do not bring performance advantage.
You can check my previous version code (in this commit "optimize summarizer buffer") and run tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I saw it. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If directly work on serialized data (UnsafeArrayData), it only avoid the array copy(which save little time), but brings extra cost when calling UnsafeArrayData.getDouble)
and it will increase code complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask how the performance test runs? Especially for the RDD part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya
modify ignore("performance test") to test("performance test")
then run test: SummarizerSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I didn't review the test thoroughly.
| new SummaryBuilderImpl(typedMetrics, computeMetrics) | ||
| } | ||
|
|
||
| def mean(col: Column): Column = getSingleMetric(col, "mean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add Since.
| * The metrics that are currently implemented. | ||
| */ | ||
| sealed trait Metrics extends Serializable | ||
| case object Mean extends Metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep these case objects private?
| override def children: Seq[Expression] = featuresExpr :: weightExpr :: Nil | ||
|
|
||
| override def update(state: SummarizerBuffer, row: InternalRow): SummarizerBuffer = { | ||
| // val features = udt.deserialize(featuresExpr.eval(row)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the way commented out be more succinct?
| } | ||
|
|
||
| override def merge(state: SummarizerBuffer, | ||
| other: SummarizerBuffer): SummarizerBuffer = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align
| throw new TestFailedException(Some(s"Failure with hint $hint"), Some(tfe), 1) | ||
| } | ||
| } | ||
| /* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove them if it's not used.
|
|
||
| // TODO: this test should not be committed. It is here to isolate some performance hotspots. | ||
| ignore("performance test") { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you paste your performance test result here? Just like here.
|
@WeichenXu123 I left some minor comments, otherwise, LGTM. Thanks. |
|
@yanboliang I will update ASAP, thanks! |
|
Test build #80668 has finished for PR 18798 at commit
|
|
Test build #80669 has finished for PR 18798 at commit
|
|
Test build #80671 has finished for PR 18798 at commit
|
yanboliang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'll merge it tomorrow if there is no further discussion. Thanks for all.
|
Thank you @yanboliang. |
|
Merged into master, thanks for all. |
What changes were proposed in this pull request?
This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics.
How was this patch tested?
Testcases added.
Performance
Resolve several performance issues in #17419, further optimization pending on SQL team's work. One of the SQL layer performance issue related to these feature has been resolved in #18712, thanks @liancheng and @cloud-fan
Performance data
(test on my laptop, use 2 partitions. tries out = 20, warm up = 10)
The unit of test results is records/milliseconds (higher is better)