-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18261][Structured Streaming] Add statistics to MemorySink for joining #15786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #68209 has finished for PR 15786 at commit
|
| memStream.crossJoin(memStream.withColumnRenamed("value", "value2")).as[(Int, Int)], | ||
| (1, 1), (1, 2), (2, 1), (2, 2)) | ||
|
|
||
| query.stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put this in a try-finally block please?
| query.stop() | ||
| } | ||
|
|
||
| test("MemoryPlan statistics for joining") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would also be nicer to just implement a test on the MemoryPlan itself, without needing to start a stream, purely on statistics calculation
|
Test build #68221 has finished for PR 15786 at commit
|
|
comments addressed; @brkyvz would you take another look |
|
|
||
| private val sizePerRow = sink.schema.toAttributes.map(_.dataType.defaultSize).sum | ||
|
|
||
| override def statistics: Statistics = Statistics(sizePerRow * sink.allData.size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everywhere else, this is defined as a lazy val. Can you make it such that:
override lazy val statistics: Statistics = {
val sizePerRow = sink.schema.toAttributes.map(_.dataType.defaultSize).sum
Statistics(sizeInBytes = sizePerRow * sink.allData.size)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @brkyvz. I'm afraid it should not be lazy val as the statistics should vary from batch to batch -- sink.allData.sizeshould vary from batch to batch. sizePerRow, however, does not vary from batch to batch, thus it's private val as it is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah! you're right! good catch!
| } | ||
|
|
||
| test("MemoryPlan statistics") { | ||
| implicit val schema = new StructType().add(new StructField("value", IntegerType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! This will be much faster than a test where we actually join DataFrames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to do this
|
@zsxwing could you also take a look |
|
LGTM - merging in master/branch-2.1. |
…joining ## What changes were proposed in this pull request? Right now, there is no way to join the output of a memory sink with any table: > UnsupportedOperationException: LeafNode MemoryPlan must implement statistics This patch adds statistics to MemorySink, making joining snapshots of memory streams with tables possible. ## How was this patch tested? Added a test case. Author: Liwei Lin <[email protected]> Closes #15786 from lw-lin/memory-sink-stat. (cherry picked from commit c1a0c66) Signed-off-by: Reynold Xin <[email protected]>
…joining ## What changes were proposed in this pull request? Right now, there is no way to join the output of a memory sink with any table: > UnsupportedOperationException: LeafNode MemoryPlan must implement statistics This patch adds statistics to MemorySink, making joining snapshots of memory streams with tables possible. ## How was this patch tested? Added a test case. Author: Liwei Lin <[email protected]> Closes apache#15786 from lw-lin/memory-sink-stat.
What changes were proposed in this pull request?
Right now, there is no way to join the output of a memory sink with any table:
This patch adds statistics to MemorySink, making joining snapshots of memory streams with tables possible.
How was this patch tested?
Added a test case.