[SPARK-32896][SS] Add DataStreamWriter.saveAsTable API #29767

HeartSaVioR · 2020-09-16T03:13:22Z

What changes were proposed in this pull request?

This PR proposes to add DataStreamWriter.saveAsTable to specify the output "table" to write from the streaming query.

Why are the changes needed?

For now, there's no way to write to the table (especially catalog table) even the table is capable to handle streaming write, so even with Spark 3, writing to the catalog table via SS should go through the DataStreamWriter.format(provider) and wish the provider can handle it as same as we do with catalog table.

With the new API, we can directly point to the catalog table which supports streaming write. Some of usages are covered with tests - simply saying, end users can do the following:

// assuming `testcat` is a custom catalog, and `ns` is a namespace in the catalog
spark.sql("CREATE TABLE testcat.ns.table1 (id bigint, data string) USING foo")

val query = inputDF
      .writeStream
      .option(...)
      .saveAsTable("testcat.ns.table1")

Does this PR introduce any user-facing change?

Yes, as this adds a new public API in DataStreamWriter. This doesn't bring backward incompatible change.

How was this patch tested?

New unit tests.

HeartSaVioR · 2020-09-16T03:19:22Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala

+    checkAnswer(spark.table(tableIdentifier), Seq.empty)
+
+    withTempDir { checkpointDir =>
+      val exc = intercept[AnalysisException] {


This is because file provider based table is V1 which doesn't have capability of streaming write. I hope this is OK, rather than struggling to convert it into Sink and making it work anyway.

This is a new API. I'm OK with not supporting streaming v1 sink at first.

SparkQA · 2020-09-16T03:21:57Z

Test build #128736 has finished for PR 29767 at commit bfba28b.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-09-16T03:22:23Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

-      } else {
-        extraOptions + ("path" -> path.get)
-      }
+    val queryName = extraOptions.get("queryName")


The huge diff comes from refactor - I have to go with refactoring as the statement for StreamingQueryManager.startQuery() are all duplicated and I was about to add one more duplication.

The actual change is only performed for source == SOURCE_NAME_TABLE.

SparkQA · 2020-09-16T07:05:03Z

Test build #128739 has finished for PR 29767 at commit f557696.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-09-16T07:23:47Z

retest this, please

SparkQA · 2020-09-16T11:44:57Z

Test build #128751 has finished for PR 29767 at commit f557696.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-09-16T12:23:37Z

retest this, please

SparkQA · 2020-09-16T18:57:21Z

Test build #128767 has finished for PR 29767 at commit f557696.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-09-16T22:32:05Z

cc. @cloud-fan @tdas @zsxwing @gaborgsomogyi @xuanyuanking

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

cloud-fan · 2020-09-17T06:37:01Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

+        case NonSessionCatalogAndIdentifier(catalog, ident) =>
+          catalog.asTableCatalog.loadTable(ident)
+
+        case SessionCatalogAndIdentifier(catalog, ident) =>


shall we just use CatalogAndIdentifier?

Is it OK to skip the namespace length check in SessionCatalogAndIdentifier here?

It's OK. V2SessionCatalog.loadTable checks the namespace as well.

cloud-fan · 2020-09-17T06:39:29Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala

  }
 }
+
+class DataStreamWriterWithTableSuite extends StreamTest with BeforeAndAfter {


shall we move it to a new file?

cloud-fan · 2020-09-17T06:42:15Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala

+
+      runTestWithStreamAppend(tableIdentifier)
+    } finally {
+      spark.conf.unset(V2_SESSION_CATALOG_IMPLEMENTATION.key)


In after we clear out everything, so this is not needed.

SparkQA · 2020-09-17T07:05:01Z

Test build #128801 has finished for PR 29767 at commit 6444a1e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-17T08:44:35Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

+      val tableInstance = df.sparkSession.sessionState.sqlParser
+        .parseMultipartIdentifier(tableName) match {
+
+        case NonSessionCatalogAndIdentifier(catalog, ident) =>


this is not needed anymore.

cloud-fan · 2020-09-17T08:47:48Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

+        case CatalogAndIdentifier(catalog, ident) =>
+          catalog.asTableCatalog.loadTable(ident)
+
+        case other =>


See DataFrameWriterV2.scala#L52, we can simply write

val CatalogAndIdentifier(catalog, identifier) = ...parseMultipartIdentifier(tableName) val table = catalog.asTableCatalog.loadTable(identifier)

xuanyuanking

Just some small comments.

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

xuanyuanking · 2020-09-17T10:56:29Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

+      import df.sparkSession.sessionState.analyzer.CatalogAndIdentifier
+
+      import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+      val CatalogAndIdentifier(catalog, identifier) = df.sparkSession.sessionState.sqlParser


Is it possible to get a temp view here and what the behavior should be?

I just checked it roughly, and looks like temporary view is not loaded by loadTable - it throws NoSuchTableException in V2SessionCatalog.

test("write to temporary view shouldn't be allowed") { val tableIdentifier = "table_name" val tempViewIdentifier = "temp_view" spark.sql(s"CREATE TABLE $tableIdentifier (id bigint, data string) USING parquet") checkAnswer(spark.table(tableIdentifier), Seq.empty) spark.sql(s"SELECT id, data FROM $tableIdentifier").createOrReplaceTempView(tempViewIdentifier) // spark.sql(s"CREATE TEMPORARY VIEW $tempViewIdentifier AS SELECT id, data FROM $tableIdentifier") withTempDir { checkpointDir => val exc = intercept[AnalysisException] { runStreamQueryAppendMode("default." + tempViewIdentifier, checkpointDir, Seq.empty, Seq.empty) } assert(exc.getMessage.contains("doesn't support streaming write")) } }

fails with "Table default.temp_view not found;" did not contain "doesn't support streaming write".

For sure I think this is desired behavior, as it's a view. Even it can load the (temp) view, capability shouldn't have write related flags.

I'm playing a bit more with view, and unlike temporary view it seems to be loaded via loadTable. Now checking capability.

Temp view doesn't belong to any catalog, it belongs to a session. DataFrameWriter.insertInto can insert to a temp view as well (only if the temp view is a single data source scan node), and probably DataStreamWriter.table should support it as well.

Looks like it requires handling V1Table after loadTable (for view), as well as pattern match with AsTableIdentifier(tableIdentifier) (for temporary view).

In either way, I see DataFrameWriter leverages UnresolvedRelation to defer resolution, but streaming query doesn't add a writer node in logical plan and passes the actual table instance (either SupportsWrite for V2 or Sink for V1) directly, so the situation looks to be a bit different. Probably another reason to add writer node before analyzing?

(Btw, interesting one to test even on batch query. Probably I'd test with creating temp view with V2 table and try to write. If that would work for DataFrameWriter.insertInto, that's probably one thing which DataFrameWriterV2 may not support as of now, as it doesn't have fail-back to V1 path.)

For temporary view, this change makes the test work:

spark.table(tableIdentifier).createOrReplaceTempView(tempViewIdentifier) // or // spark.read.table(tableIdentifier).createOrReplaceTempView(tempViewIdentifier) Seq((1, "a"), (2, "b"), (3, "c")).toDF().write.insertInto(tempViewIdentifier)

but I'm not sure about the coverage - it sounds to me that temp view as just an alias of the table is only supported for insertInto.

(only if the temp view is a single data source scan node)

As I mentioned before, the temp view must be very simple, like spark.table(name) or CREATE TEMP VIEW v USING parquet OPTIONS(...)

I believe there are tests, but I don't remember where they are. You can update ResolveRelations to drop the support of inserting temp views, and see which tests fail.

For this particular PR, I'm OK to not support temp view for now, as we need to refactor it a little bit and have a logical plan for streaming write. But for consistency with other places that lookup a table, we should still lookup temp views, and just fail if a temp view is returned.

Done in e7cd27d - now it looks up (global) temp view directly and provide error message a bit clearer. Also added relevant tests.

That said, I can't find the logic for fail-back in DataFrameWriterV2. It simply looks up with catalog, which temp view will not be found. Do I understand correctly, and if then is it a desired/expected behavior?

I think we should fix DataFrameWriterV2 as well, to fail if the table name refers to a temp view. cc @rdblue

Filed SPARK-32960 and submitted a PR. (#29830) Please take a look and let me know whether it follows your suggestion properly or not. Thanks!

SparkQA · 2020-09-17T12:27:43Z

Test build #128804 has finished for PR 29767 at commit 7d4f6a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DataStreamWriterWithTableSuite extends StreamTest with BeforeAndAfter

SparkQA · 2020-09-17T14:17:09Z

Test build #128809 has finished for PR 29767 at commit 6c040f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-17T15:58:24Z

Test build #128814 has finished for PR 29767 at commit 1179c2f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2020-09-18T01:36:49Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

+      import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits._
+      val sink = tableInstance match {
+        case t: SupportsWrite if t.supports(STREAMING_WRITE) => t
+        case t => throw new AnalysisException("Table doesn't support streaming " +


s"Table $tableName doesn't support streaming "?

viirya · 2020-09-18T01:41:25Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

-        useTempCheckpointLocation = true,
-        trigger = trigger)
+      startQuery(sink, extraOptions)
    } else if (source == "foreachBatch") {


SOURCE_NAME_FOREACH_BATCH?

Ah I missed after doing another approach of refactoring. I'll fix. Thanks!

viirya · 2020-09-18T01:41:45Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

-      resultDf.createOrReplaceTempView(query.name)
-      query
+      startQuery(sink, extraOptions, Some(resultDf), recoverFromCheckpoint = recoverFromChkpoint)
    } else if (source == "foreach") {


SOURCE_NAME_FOREACH?

viirya · 2020-09-18T01:42:02Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

+      }
+
+      startQuery(sink, extraOptions)
+    } else if (source == "memory") {


SOURCE_NAME_MEMORY?

viirya · 2020-09-18T01:45:38Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala

      override def overwrite(filters: Array[Filter]): WriteBuilder = {
        assert(writer == Append)
        writer = new Overwrite(filters)
+        // streaming writer doesn't have equivalent semantic


Hmm, does this mean for streaming case, we won't reach here?

Yes, at least for now. (If I understand correctly.) If we would like to be sure, we may be able to assign dummy one and throw error on calling buildForStreaming(). Probably it'd be much clearer.

Just changed to explicitly fail for the case.

cloud-fan · 2020-09-18T06:07:43Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

+      // Currently we don't create a logical streaming writer node in logical plan, so cannot rely
+      // on analyzer to resolve it. Directly lookup only for temp view to provide clearer message.
+      // TODO (SPARK-27484): we should add the writing node before the plan is analyzed.
+      if (isTempView(df.sparkSession, identifier.asMultipartIdentifier)) {


This is incorrect. The identifier is for a specific catalog. e.g. for cat1.t1, the identifier is t1. cat1.t1 is not a temp view, but t1 might be.

We should check temp view with the original table name.

Please correct me if I'm missing here. The reason I pass all parts in identifier here is to cover global temp view, which uses global temp db. Dropping the db name (if it isn't from global temp db) is performed in isTempView.

I pass all parts in identifier

But you did not... The catalog name is missing, so you may mistakenly treat a table as temp view, e.g. cat1.t1 if t1 is the name of a temp view.

My bad. Thanks for explaining. I see the failing case when catalog "exists" for the head of identifier; let me fix it immediately.

SparkQA · 2020-09-18T06:50:39Z

Test build #128855 has finished for PR 29767 at commit e7cd27d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-18T07:05:02Z

Test build #128844 has finished for PR 29767 at commit 1197e0b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-10-07T18:22:11Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamTableAPISuite.scala

+    }
+  }
+
+  test("write: write to temporary view isn't allowed yet") {


Thank you for adding this explicitly.

dongjoon-hyun · 2020-10-07T18:45:41Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

-        outputMode,
-        useTempCheckpointLocation = true,
-        trigger = trigger)
+      startQuery(sink, extraOptions)


@HeartSaVioR . For CaseInsensitiveMap, def toMap: Map[String, T] = originalMap. It seems that we need toMap explicitly here as we did line 385. (cc @cloud-fan )

startQuery(sink, optionsWithPath.originalMap)

Previously, I and @cloud-fan hit case-sensitivity issues in another JIRAs due to this. Please make it sure that this PR doesn't re-introduce it because AS-IS PR switches extraOptions.toMap -> extraOptions silently.

If you already checked that, please add a test case for that. Or, we just use the old way extraOptions.toMap to avoid any side effect.

Ah OK thanks for pointing out. Nice finding. I'll just explicitly call .toMap as it was.

nice catch @dongjoon-hyun !

HeartSaVioR · 2020-10-08T06:07:58Z

Thanks for reviewing. Addressed review comments. Please take a look again.

cloud-fan · 2020-10-08T06:29:52Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

+      newOptions: CaseInsensitiveMap[String],
+      recoverFromCheckpoint: Boolean = true): StreamingQuery = {
+    val options = newOptions.originalMap
+    val queryName = options.get("queryName")


previously it was extraOptions.get("queryName"), we should follow it to get the queryName option case insensitively.

OK I'll let it as it is.

cloud-fan · 2020-10-08T06:29:58Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

+      recoverFromCheckpoint: Boolean = true): StreamingQuery = {
+    val options = newOptions.originalMap
+    val queryName = options.get("queryName")
+    val checkpointLocation = options.get("checkpointLocation")


cloud-fan · 2020-10-08T06:32:24Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

+    val checkpointLocation = options.get("checkpointLocation")
+    val useTempCheckpointLocation = SOURCES_ALLOW_ONE_TIME_QUERY.contains(source)
+
+    df.sparkSession.sessionState.streamingQueryManager.startQuery(


We can follow the previous code style

...startQuery( newOptions.get("queryName"), newOptions.get("checkpointLocation"), df, newOptions. originalMap, ...

OK let me keep it as it is.

SparkQA · 2020-10-08T06:53:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34149/

SparkQA · 2020-10-08T07:05:02Z

Test build #129544 has finished for PR 29767 at commit 716a615.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-08T07:05:02Z

Test build #129547 has finished for PR 29767 at commit 3f80c4f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-10-08T07:07:31Z

retest this, please

SparkQA · 2020-10-08T07:19:50Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34149/

SparkQA · 2020-10-08T07:49:55Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34153/

SparkQA · 2020-10-08T08:07:59Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34153/

SparkQA · 2020-10-08T08:08:42Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34154/

SparkQA · 2020-10-08T08:27:05Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34154/

SparkQA · 2020-10-08T11:39:08Z

Test build #129548 has finished for PR 29767 at commit 3f80c4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you so much, @HeartSaVioR and @cloud-fan .
Merged to master for Apache Spark 3.1.0.

HeartSaVioR · 2020-10-09T13:31:26Z

Thanks all for reviewing and merging!

cloud-fan · 2020-10-12T05:55:37Z

We need to update the PR title, it's saveAsTable not table.

HeartSaVioR · 2020-10-12T06:53:38Z

You're right. Would you like to go with revert & another PR, or it's just for information? Either is fine for me.

cloud-fan · 2020-10-12T07:02:53Z

Just update the title is good enough, it's only about the commit message

HeartSaVioR · 2020-10-12T07:26:40Z

Ah OK. Thanks for the guidance. I've updated the PR title and description as well, as the usage is a bit different from before.

dongjoon-hyun · 2020-10-12T07:37:00Z

Sorry for missing that.

probot-autolabeler bot added SQL STRUCTURED STREAMING labels Sep 16, 2020

HeartSaVioR commented Sep 16, 2020

View reviewed changes

cloud-fan reviewed Sep 17, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala Outdated Show resolved Hide resolved

HeartSaVioR force-pushed the SPARK-32896 branch from 2284a4d to 6444a1e Compare September 17, 2020 06:33

cloud-fan reviewed Sep 17, 2020

View reviewed changes

xuanyuanking reviewed Sep 17, 2020

View reviewed changes

viirya reviewed Sep 18, 2020

View reviewed changes

cloud-fan reviewed Sep 18, 2020

View reviewed changes

dongjoon-hyun reviewed Oct 7, 2020

View reviewed changes

Reflect review comments

716a615

cloud-fan reviewed Oct 8, 2020

View reviewed changes

Another reflection

3f80c4f

dongjoon-hyun approved these changes Oct 9, 2020

View reviewed changes

dongjoon-hyun closed this in edb140e Oct 9, 2020

HeartSaVioR changed the title ~~[SPARK-32896][SS] Add DataStreamWriter.table API~~ [SPARK-32896][SS] Add DataStreamWriter.saveAsTable API Oct 12, 2020

HeartSaVioR deleted the SPARK-32896 branch October 12, 2020 22:08

xuanyuanking mentioned this pull request Nov 27, 2020

[SPARK-33577][SS] Add support for V1Table in stream writer table API and create table if not exist by default #30521

Closed

[SPARK-32896][SS] Add DataStreamWriter.saveAsTable API #29767

[SPARK-32896][SS] Add DataStreamWriter.saveAsTable API #29767

Uh oh!

Conversation

HeartSaVioR commented Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 16, 2020

Uh oh!

HeartSaVioR commented Sep 16, 2020

Uh oh!

SparkQA commented Sep 16, 2020

Uh oh!

HeartSaVioR commented Sep 16, 2020

Uh oh!

SparkQA commented Sep 16, 2020

Uh oh!

HeartSaVioR commented Sep 16, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuanyuanking left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Sep 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 17, 2020

Uh oh!

HeartSaVioR commented Sep 16, 2020 •

edited

Loading

HeartSaVioR Sep 17, 2020 •

edited

Loading

HeartSaVioR Sep 17, 2020 •

edited

Loading

HeartSaVioR Sep 18, 2020 •

edited

Loading

HeartSaVioR Sep 18, 2020 •

edited

Loading

HeartSaVioR Sep 18, 2020 •

edited

Loading