-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[WIP][SPARK-32847][SS] Add DataStreamWriterV2 API #29715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc. @rdblue @cloud-fan @brkyvz Let me leave some background about the PR in current status... I focused on adding the API similar to So the PR in current state doesn't touch the logical plan and try to address the surface, so that the necessary changes for logical plan can be discussed and handled later without changing the API introducing here (hopefully). If we want to require the PR to include the changes for logical plan, I may need to have some more time to look into the details of logical plan and propose some approach (or it should be even better if we discussed something already). |
|
|
||
| @throws[NoSuchTableException] | ||
| @throws[TimeoutException] | ||
| def truncateAndAppend(): StreamingQuery = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the name of the "output mode", the name should be complete, but I feel it's a bit weird so I just take to name closer to the actual behavior.
|
Test build #128520 has finished for PR 29715 at commit
|
|
The |
|
Thanks for the input. My initial goal was to enable reading catalog table in SS query, so I didn't touch the other stuff from DataStreamWriter. I borrowed the concept of representing "save mode" as the end chain of the method, but that's also OK for me if we'd like to keep the I think there're some points to consider while designing:
We already know about the case of "update as append" (output mode for the result table is update but the sink does the append) for DSv2, but in reality, most built-in sinks are doing the append for any mode (even complete mode), just because we did for Spark 2.x. DSv1 is even more problematic, the interface is designed to only append, but there's no limitation of the output mode for DSv1 sink. I think we won't support DSv1 in DataStreamWriterV2, but mismatch still remains in DSv2. Do we want to keep the mismatch forever, or fix it at least in DSv2? (Kafka is an one of examples - Kafka sink shouldn't allow update and complete mode. I think we did the right fix but the compatibility messed up.)
Given the current status of SS development, I don't think continuous mode would leverage the output mode in near future. (That said, output mode is not needed.) I'm not sure that will be valid in near future - if it is, we may be able to split builders for micro-batch and continuous mode and remove output mode for continuous mode. (TBH, I'm wondering continuous mode is being used in production - the mode is introduced in Spark 2.3, and no one has been claimed to graduate continuous mode from experimental. No contributor has been caring about it. Is that something we might be able to consider retiring to reduce complexity?)
Without the clear answer on considerations it would be hard to construct a good API. |
|
So this I think we should at least put something in the PR description to explain: what are the problems of |
No. That's a side improvement which can be dropped, not a major goal. As I commented, fixing the problems on DataStreamWriter isn't the purpose of introducing DataStreamWriterV2. This is rather providing symmetric user experience between batch and streaming, as with DataFrameWriterV2 end users can go through running batch query with catalog table on writer side, whereas streaming query doesn't have something to enable this. The problems I described in previous comment are simply the problems on Structured Streaming - let me explain at the end of comment, as it might be going to be out of topic. I see DataFrameWriterV2 has integrated lots of other benefits (more fluent, logical plan on write node, etc.) which should be great to have in DataStreamWriterV2, but I think they're not a key part of *WriterV2. Supporting catalog table is simply the major reason to have it. Regarding the problems on Structured Streaming - I kicked the incomplete state support on continuous mode out from Structured Streaming, but I basically concerns about "continuous mode" itself, as it's rather applying hacks to workaround architectural limitation. (+ No one cares about it in community.) And as I had initiated discussion earlier (and has been commented in various PRs), I think complete mode should be kicked out as well. The mode addresses some limited cases but is treated as one of valid modes which adds much complexity - some operations which basically shouldn't be supported in streaming query are supported under complete mode, and vice versa. Because the mode doesn't fit naturally. It's useful for now because Spark doesn't support true update mode on sink - and once Spark can support update mode on sink, content in external storage should be just equivalent to what the complete mode provides, without having to dump all of the outputs. (Or that's just because of missing feature - queryable state.) Probably we can simulate complete mode via having a special stateful operator which only works with update mode. Specific to micro-batch, supporting DSv1 is also a major headache - lots of pattern matchings in MicroBatchExecution are to support DSv1, and even there're workarounds applied for DSv1 (e.g #29700). I remember the answer in discussion thread that DSv1 for streaming data source is not exposed to the public API which is great news, but I see no action/plan to get rid of it. Is there something DSv2 cannot cover the functionality which is possible in DSv1? If then why not prioritize to address the problem? |
No, the major reason is to get rid of |
|
Oh OK. Thanks for the input. That's a good point and I agree with the point DataFrameWriter can work with catalog table hence DataStreamWriter also can. (Though I still feel the clear benefit of DataFrameWriterV2 is that it "only" needs to deal with DSv2.) Let me try to deal with it first, and revisit if I can think of better ideas on improving UX on DataStreamWriterV2. |
NOTE: This covers the proposal of adding DataStreamWriterV2 API, which isn't discussed yet. I expect this PR should go over some discussion/review and these phases may change the PR significantly, so I'm marking the PR as "WIP" as of now. Once we agree about the direction, I'll change the state of the PR and fill the content of PR description.
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?