[SPARK-23952] remove type parameter in DataReaderFactory #21029

cloud-fan · 2018-04-10T15:30:44Z

What changes were proposed in this pull request?

This API change is inspired by the problems we meet when migrating streaming and file-based data sources to the data souce v2 API.

For the streaming side, we need a variant of the DataReader/WriterFactory(see an example). This brings a lot of trouble for scanning/writing optimized data format like InternalRow, ColumnarBatch, etc.

These special scanning/writing interfaces are defined like

interface SupportsScanColumnarBatch {
  List<DataReaderFactory<UnsafeRow>> createUnsafeRowReaderFactories();
}

This can't work with ContinuousDataReaderFactory at all, or we have to do runtime type cast and make the variant extends DataReader/WriterFactory. We have the same problem on the write path too.

For the file-based data source side, we have a problem with code duplication. Let's take ORC data source as an example. To support both unsafe row and columnar batch scan, we need something like

class OrcUnsafeRowDataReader extends DataReader[UnsafeRow] {
  ...
}

class OrcColumnarBatchDataReader extends DataReader[ColumnarBatch] {
  ...
}

class OrcUnsafeRowFactory(...) extends DataReaderFactory[UnsafeRow] {
  def createDataReader ...
}

class OrcColumnarBatchFactory(...) extends DataReaderFactory[ColumnarBatch] {
  def createDataReader ...
}

class OrcDataSourceReader extends DataSourceReader {
  def createUnsafeRowFactories = ... // logic to prepare the parameters and create factories

  def createColumnarBatchFactories = ... // logic to prepare the parameters and create factories
}

You can see that we have duplicated logic for preparing parameters and defining the factory. After this change, we can simplify the code to

class OrcReaderFactory(...) extends DataReaderFactory {
  def createUnsafeRowReader ...

  def createColumnarBatchReader ...
}

class OrcDataSourceReader extends DataSourceReader {
  def createReadFactories = ... // logic to prepare the parameters and create factories
}

The proposed change is: remove the type parameter and embed the special scanning/writing format to the factory. e.g.

interface DataReaderFactory {
  DataFormat dataFormat;

  default DataReader<Row> createRowDataReader() {
    throw new IllegalStateException(
      "createRowDataReader must be implemented if the data format is ROW.");
  }

  default DataReader<UnsafeRow> createUnsafeRowDataReader() {
    throw new IllegalStateException(
      "createUnsafeRowDataReader must be implemented if the data format is UNSAFE_ROW.");
  }

  default DataReader<ColumnarBatch> createColumnarBatchDataReader() {
    throw new IllegalStateException(
      "createColumnarBatchDataReader must be implemented if the data format is COLUMNAR_BATCH.");
  }
}

A potential benefit after this change: now it's up to the factory to decide which data format to use(UnsafeRow, ColumnarBatch, etc.). Which means, it's possible to allow different data partitions to be scanned with different formats. Some hybrid storage may keep realtime data in row format, and history data in columnar format, and it can fit the new API well.

TODO:
apply this change to the write path(next PR)

How was this patch tested?

existing tests

cloud-fan · 2018-04-10T15:32:13Z

cc @rxin @jose-torres @rdblue @gengliangwang

cloud-fan · 2018-04-10T15:34:15Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala

cc @jose-torres do you know what's missing for this?

I don't, because I'm not really sure how it works in the batch case. How does it work to do

new DataSourceRDD(sparkContext, batchReaderFactories).asInstanceOf[RDD[InternalRow]]

when the type parameter of batchReaderFactories doesn't match InternalRow?

We use a type erase hack, and lie to the Scala compiler that we are outputting InternalRow. At runtime, we cast the data to ColumnarBatch in codegen.

Then the missing piece is codegen. This is difficult because the continuous stream reader does a lot of auxiliary work, so I don't know if it will happen in the near future.

I've thought about this further. Shouldn't it be trivial to write a wrapper that simply converts a DataReader[ColumnarBatch] to a DataReader[InternalRow]? If so then we can easily support it after the current PR.

cloud-fan · 2018-04-10T15:36:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala

I have seen this pattern many times, the java List is a little trouble because it's invariance. Shall we change the interface to use array?

cc @rdblue @jose-torres

I'd like that, but I don't know if that would make things harder for data source implementers working in Java.

Array is a java-friendly type.

SparkQA · 2018-04-10T15:40:20Z

Test build #89133 has finished for PR 21029 at commit d44105d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-04-10T16:37:43Z

+1
From the PR #20933, we can see that there is a lot of common code between DataReaderFactory[ColumnarBatch] and DataReaderFactory[UnsafeRow], with the current Factory method pattern.
This change makes data source implementation easier, and we don't need to do runtime type cast.

gengliangwang

In DataReader.java:

/**
 * A data reader returned by {@link DataReaderFactory#createDataReader()} and is responsible for
 * outputting data for a RDD partition.
 *
 * Note that, Currently the type `T` can only be {@link org.apache.spark.sql.Row} for normal data
 * source readers, or {@link org.apache.spark.sql.catalyst.expressions.UnsafeRow} for data source
 * readers that mix in {@link SupportsScanUnsafeRow}.
 */

The first and last @link needs update.

cloud-fan · 2018-04-16T13:37:08Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ContinuousDataReaderFactory.java

cc @jose-torres , seems this method is never used.

Fine to remove this. We've deferred or reworked all of the things that were going to use this method; it makes sense to rethink how to provide this functionality after the rest is polished and stable-ish.

SparkQA · 2018-04-16T15:19:01Z

Test build #89400 has finished for PR 21029 at commit dac302f.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-16T16:22:30Z

Test build #89401 has finished for PR 21029 at commit 556eef2.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-17T06:11:15Z

Test build #89430 has finished for PR 21029 at commit 36031f2.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-17T10:52:11Z

Test build #89436 has finished for PR 21029 at commit b5c3b39.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-17T14:14:59Z

Test build #89458 has finished for PR 21029 at commit c5071b4.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-17T17:41:35Z

Test build #89461 has finished for PR 21029 at commit 1ae4b6d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-18T07:05:02Z

Test build #89482 has finished for PR 21029 at commit 18e391a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-18T07:46:47Z

retest this please

SparkQA · 2018-04-18T09:52:48Z

Test build #89490 has finished for PR 21029 at commit 18e391a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-18T13:17:17Z

retest this please

SparkQA · 2018-04-18T16:49:25Z

Test build #89505 has finished for PR 21029 at commit 18e391a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-04-18T23:16:16Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReaderFactory.java

+   *   <li>@{@link DataFormat#COLUMNAR_BATCH}: {@link #createColumnarBatchDataReader()}</li>
+   * </ul>
+   */
+  DataFormat dataFormat();


If the data format is determined when the factory is created, then I don't see why it is necessary to change the API. This just makes it more confusing.

rdblue · 2018-04-18T23:16:58Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/MicroBatchReader.java


 package org.apache.spark.sql.sources.v2.reader.streaming;

+import java.util.Optional;


Nit: this is a cosmetic change that should be reverted before committing.

rdblue · 2018-04-18T23:17:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala

+      case DataFormat.COLUMNAR_BATCH =>
+        new DataReaderIterator(factory.createColumnarBatchDataReader())
+          // TODO: remove this type erase hack.
+          .asInstanceOf[DataReaderIterator[UnsafeRow]]


Isn't this change intended to avoid these casts?

rdblue · 2018-04-18T23:18:22Z

...rnal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala

        range, executorKafkaParams, pollTimeoutMs, failOnDataLoss, reuseKafkaConsumer)
    }
-    factories.map(_.asInstanceOf[DataReaderFactory[UnsafeRow]]).asJava
+    factories.map(_.asInstanceOf[DataReaderFactory]).asJava


Why is this cast necessary?

cloud-fan changed the title ~~remove type parameter in DataReaderFactory~~ [WIP][SPARK-23952] remove type parameter in DataReaderFactory Apr 10, 2018

cloud-fan commented Apr 10, 2018

View reviewed changes

gengliangwang requested changes Apr 12, 2018

View reviewed changes

gengliangwang mentioned this pull request Apr 13, 2018

[SPARK-23817][SQL]Migrate ORC file format read path to data source V2 #20933

Closed

cloud-fan commented Apr 16, 2018

View reviewed changes

cloud-fan force-pushed the dsv2 branch from d44105d to dac302f Compare April 16, 2018 15:12

cloud-fan changed the title ~~[WIP][SPARK-23952] remove type parameter in DataReaderFactory~~ [SPARK-23952] remove type parameter in DataReaderFactory Apr 16, 2018

cloud-fan force-pushed the dsv2 branch from dac302f to 556eef2 Compare April 16, 2018 16:10

cloud-fan force-pushed the dsv2 branch from 556eef2 to 36031f2 Compare April 17, 2018 05:54

cloud-fan force-pushed the dsv2 branch from 36031f2 to b5c3b39 Compare April 17, 2018 07:18

cloud-fan force-pushed the dsv2 branch from b5c3b39 to c5071b4 Compare April 17, 2018 14:03

cloud-fan force-pushed the dsv2 branch from c5071b4 to 1ae4b6d Compare April 17, 2018 14:53

remove type parameter in DataReaderFactory

18e391a

cloud-fan force-pushed the dsv2 branch from 1ae4b6d to 18e391a Compare April 18, 2018 03:49

rdblue reviewed Apr 18, 2018

View reviewed changes

cloud-fan closed this Jul 6, 2018


		package org.apache.spark.sql.sources.v2.reader.streaming;

		import java.util.Optional;

[SPARK-23952] remove type parameter in DataReaderFactory #21029

[SPARK-23952] remove type parameter in DataReaderFactory #21029

Uh oh!

Conversation

cloud-fan commented Apr 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Apr 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 10, 2018

Uh oh!

gengliangwang commented Apr 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 16, 2018

Uh oh!

SparkQA commented Apr 16, 2018

Uh oh!

SparkQA commented Apr 17, 2018

Uh oh!

SparkQA commented Apr 17, 2018

Uh oh!

SparkQA commented Apr 17, 2018

Uh oh!

SparkQA commented Apr 17, 2018

Uh oh!

SparkQA commented Apr 18, 2018

Uh oh!

cloud-fan commented Apr 18, 2018

Uh oh!

SparkQA commented Apr 18, 2018

Uh oh!

cloud-fan commented Apr 18, 2018

Uh oh!

SparkQA commented Apr 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

cloud-fan commented Apr 10, 2018 •

edited

Loading

cloud-fan Apr 10, 2018 •

edited

Loading

gengliangwang commented Apr 10, 2018 •

edited

Loading

gengliangwang left a comment •

edited

Loading