[SPARK-32885][SS] Add DataStreamReader.table API #29756

xuanyuanking · 2020-09-15T06:25:36Z

What changes were proposed in this pull request?

This pr aims to add a new table API in DataStreamReader, which is similar to the table API in DataFrameReader.

Why are the changes needed?

Users can directly use this API to get a Streaming DataFrame on a table. Below is a simple example:

Application 1 for initializing and starting the streaming job:

val path = "/home/yuanjian.li/runtime/to_be_deleted"
val tblName = "my_table"

// Write some data to `my_table`
spark.range(3).write.format("parquet").option("path", path).saveAsTable(tblName)

// Read the table as a streaming source, write result to destination directory
val table = spark.readStream.table(tblName)
table.writeStream.format("parquet").option("checkpointLocation", "/home/yuanjian.li/runtime/to_be_deleted_ck").start("/home/yuanjian.li/runtime/to_be_deleted_2")

Application 2 for appending new data:

// Append new data into the path
spark.range(5).write.format("parquet").option("path", "/home/yuanjian.li/runtime/to_be_deleted").mode("append").save()

Check result:

// The desitination directory should contains all written data
spark.read.parquet("/home/yuanjian.li/runtime/to_be_deleted_2").show()

Does this PR introduce any user-facing change?

Yes, a new API added.

How was this patch tested?

New UT added and integrated testing.

xuanyuanking · 2020-09-15T06:27:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+          case NonSessionCatalogAndIdentifier(catalog, ident) =>
+            CatalogV2Util.loadTable(catalog, ident) match {
+              case Some(table) =>
+                Some(StreamingRelationV2(


With the refactory of StreamingRelationV2(#29633), we can directly create it in the catalyst.

shall we just add a isStreaming flag in lookupV2Relation, to unify the code a bit more?

Sure, done in 305c316

xuanyuanking · 2020-09-15T06:29:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

+  private def getStreamingRelation(
+      table: CatalogTable,
+      extraOptions: CaseInsensitiveStringMap): StreamingRelation = {
+    val dsOptions = DataSourceUtils.generateDatasourceOptions(extraOptions, table)


Keep the same behavior with DataFrameReader.table on respecting options(#29712)

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

SparkQA · 2020-09-15T07:04:39Z

Test build #128699 has finished for PR 29756 at commit 43a371d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-15T12:09:45Z

Test build #128701 has finished for PR 29756 at commit 97c3e0b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2020-09-16T02:49:24Z

cc @zsxwing @cloud-fan @HeartSaVioR @HyukjinKwon

HeartSaVioR · 2020-09-16T03:33:55Z

Thanks for the patch. I was also about to find the missing spot (as I commented in #29715) and this PR looks to fulfill the needs. I'll take a look soon.

HeartSaVioR

I'll review once there're enough tests to cover the functionality, as that would save plenty of times on reviewing.

...rc/test/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamProviderSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala

HeartSaVioR · 2020-09-16T03:48:21Z

Btw to expand the tests you may also want to have the changes on InMemoryTable. Would it be good for you to go through my PR first and rebase, or let me extract the part to the separate PR?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

viirya · 2020-09-16T06:39:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala

        child.collect {
          // Disallow creating permanent views based on temporary views.
-          case UnresolvedRelation(nameParts, _) if catalog.isTempView(nameParts) =>
+          case UnresolvedRelation(nameParts, _, _) if catalog.isTempView(nameParts) =>


isStreaming = false only?

We can create a temp view based on streaming relation, so it should be kept as a full match?

viirya · 2020-09-16T06:52:01Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

+   * Define a Streaming DataFrame on a Table. The DataSource corresponding to the table should
+   * support streaming mode.


If the data source doesn't support, what will happen?

It will check in MicroBatchExecution/ContinuousExecution if the data source doesn't support.

zsxwing · 2020-09-17T00:29:28Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala

  }
 }
+
+class DataStreamTableAPISuite extends StreamTest with BeforeAndAfter {


nit: could you move this to a new file when you add more tests?

Sure, adding more tests now.

xuanyuanking · 2020-09-17T10:44:21Z

Would it be good for you to go through my PR first and rebase, or let me extract the part to the separate PR?

@HeartSaVioR Sure, let's go through the writer's side first. I think it's ok to rebase or resolve conflicts.

cloud-fan · 2020-09-17T11:42:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

    def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
-      case u @ UnresolvedRelation(ident, _) =>
+      case u @ UnresolvedRelation(ident, _, _) =>
        lookupTempView(ident).getOrElse(u)


shall we fail if the temp view is not a streaming plan but the isStreaming flag is true?

Probably vice versa.

I'm not sure about the other way around. We have SparkSession.table, which can read both batch or streaming temp views. We shouldn't break it.

It's a bit weird if DataFramaReader.table can read streaming temp view, but this is the existing behavior and probably is fine.

Just curious, is it end user's responsibility to know about whether the temp view is from batch or streaming, so that they can correctly call write or writeStream? Without thinking of SparkSession.table I assume it's clear as end user will match the reader side and writer side clearly (read/write or readStream/writeStream), and it looks a bit confusing.

If DataFrameReader.table allows streaming temp view, then I guess read/writeStream pair is possible which is a bit confusing. (or does it change the plan to the batch one magically?)

I think it's less confusing if DataFrameReader.table fails on reading streaming temp view while SparkSession.table works. But it's a breaking change and we at least shouldn't do it in this PR.

OK I agree it should be better to fix in different PR. I still think we shouldn't support confused things just because we supported them. The fix would depend on change of the PR (isStreaming flag) - I'll wait for this PR and try to fix it after the PR.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

cloud-fan · 2020-09-17T11:55:09Z

...rc/test/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamProviderSuite.scala

    )
  }

+  test("stream table API support") {


can we move DataStreamTableAPISuite to a new file and move this test to there as well?

Yes, done in 9004fba

SparkQA · 2020-09-17T17:03:13Z

Test build #128820 has finished for PR 29756 at commit 305c316.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-21T13:50:52Z

Test build #128931 has finished for PR 29756 at commit 9004fba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamTableAPISuite.scala

SparkQA · 2020-09-23T10:50:33Z

Test build #129019 has finished for PR 29756 at commit 03a91f8.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
trait V2TableWithV1Fallback extends Table

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TableCapabilityCheck.scala

sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamTableAPISuite.scala

SparkQA · 2020-09-23T12:50:15Z

Test build #129020 has finished for PR 29756 at commit fad1976.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait V2TableWithV1Fallback extends Table

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V1Table.scala

cloud-fan

LGTM except some minor comments

SparkQA · 2020-09-24T07:05:02Z

Test build #129062 has finished for PR 29756 at commit 97761d2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait V2TableWithV1Fallback extends Table

SparkQA · 2020-09-24T11:33:03Z

Test build #129069 has finished for PR 29756 at commit d2eb23f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-25T06:50:19Z

thanks, merging to master!

probot-autolabeler bot added SQL STRUCTURED STREAMING labels Sep 15, 2020

xuanyuanking commented Sep 15, 2020

View reviewed changes

viirya reviewed Sep 15, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala Outdated Show resolved Hide resolved

HeartSaVioR reviewed Sep 16, 2020

View reviewed changes

...rc/test/scala/org/apache/spark/sql/execution/streaming/sources/RateStreamProviderSuite.scala Outdated Show resolved Hide resolved

sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 16, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 16, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 16, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala Outdated Show resolved Hide resolved

viirya reviewed Sep 16, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

viirya reviewed Sep 16, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

viirya reviewed Sep 16, 2020

View reviewed changes

zsxwing reviewed Sep 17, 2020

View reviewed changes

cloud-fan reviewed Sep 17, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 17, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 17, 2020

View reviewed changes

cloud-fan reviewed Sep 21, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 21, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Show resolved Hide resolved

cloud-fan reviewed Sep 21, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamTableAPISuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 21, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamTableAPISuite.scala Show resolved Hide resolved

cloud-fan reviewed Sep 21, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamTableAPISuite.scala Show resolved Hide resolved

HeartSaVioR mentioned this pull request Sep 23, 2020

[SPARK-32896][SS] Add DataStreamWriter.saveAsTable API #29767

Closed

xuanyuanking added 5 commits September 23, 2020 15:56

initail commit

43063eb

fix

426a855

address comment

7d74345

Address comment & add test

6043455

address comments

fad1976

xuanyuanking force-pushed the SPARK-32885 branch from 03a91f8 to fad1976 Compare September 23, 2020 07:58

cloud-fan reviewed Sep 23, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Show resolved Hide resolved

cloud-fan reviewed Sep 23, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 23, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 23, 2020

View reviewed changes

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TableCapabilityCheck.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 23, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamTableAPISuite.scala Outdated Show resolved Hide resolved

address comments

97761d2

cloud-fan reviewed Sep 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V1Table.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Sep 24, 2020

View reviewed changes

address comments

d2eb23f

cloud-fan closed this in 9e6882f Sep 25, 2020

ulysses-you mentioned this pull request Oct 8, 2020

[SPARK-32743][SQL] Add distinct info at UnresolvedFunction toString #29586

Closed

xuanyuanking deleted the SPARK-32885 branch November 2, 2020 10:36

xuanyuanking mentioned this pull request Nov 2, 2020

[SPARK-33244][SQL] Unify the code paths for spark.table and spark.read.table #30148

Closed

		* Define a Streaming DataFrame on a Table. The DataSource corresponding to the table should
		* support streaming mode.

[SPARK-32885][SS] Add DataStreamReader.table API #29756

[SPARK-32885][SS] Add DataStreamReader.table API #29756

Uh oh!

Conversation

xuanyuanking commented Sep 15, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Sep 15, 2020

Uh oh!

SparkQA commented Sep 15, 2020

Uh oh!

xuanyuanking commented Sep 16, 2020

Uh oh!

HeartSaVioR commented Sep 16, 2020

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HeartSaVioR commented Sep 16, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuanyuanking commented Sep 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 17, 2020

Uh oh!

SparkQA commented Sep 21, 2020

zsxwing Sep 17, 2020 •

edited

Loading

HeartSaVioR Sep 17, 2020 •

edited

Loading