-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23952] remove type parameter in DataReaderFactory #21029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jose-torres do you know what's missing for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't, because I'm not really sure how it works in the batch case. How does it work to do
new DataSourceRDD(sparkContext, batchReaderFactories).asInstanceOf[RDD[InternalRow]]
when the type parameter of batchReaderFactories doesn't match InternalRow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use a type erase hack, and lie to the Scala compiler that we are outputting InternalRow. At runtime, we cast the data to ColumnarBatch in codegen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the missing piece is codegen. This is difficult because the continuous stream reader does a lot of auxiliary work, so I don't know if it will happen in the near future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've thought about this further. Shouldn't it be trivial to write a wrapper that simply converts a DataReader[ColumnarBatch] to a DataReader[InternalRow]? If so then we can easily support it after the current PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen this pattern many times, the java List is a little trouble because it's invariance. Shall we change the interface to use array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like that, but I don't know if that would make things harder for data source implementers working in Java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array is a java-friendly type.
|
Test build #89133 has finished for PR 21029 at commit
|
|
+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In DataReader.java:
/**
* A data reader returned by {@link DataReaderFactory#createDataReader()} and is responsible for
* outputting data for a RDD partition.
*
* Note that, Currently the type `T` can only be {@link org.apache.spark.sql.Row} for normal data
* source readers, or {@link org.apache.spark.sql.catalyst.expressions.UnsafeRow} for data source
* readers that mix in {@link SupportsScanUnsafeRow}.
*/
The first and last @link needs update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jose-torres , seems this method is never used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine to remove this. We've deferred or reworked all of the things that were going to use this method; it makes sense to rethink how to provide this functionality after the rest is polished and stable-ish.
|
Test build #89400 has finished for PR 21029 at commit
|
|
Test build #89401 has finished for PR 21029 at commit
|
|
Test build #89430 has finished for PR 21029 at commit
|
|
Test build #89436 has finished for PR 21029 at commit
|
|
Test build #89458 has finished for PR 21029 at commit
|
|
Test build #89461 has finished for PR 21029 at commit
|
|
Test build #89482 has finished for PR 21029 at commit
|
|
retest this please |
|
Test build #89490 has finished for PR 21029 at commit
|
|
retest this please |
|
Test build #89505 has finished for PR 21029 at commit
|
| * <li>@{@link DataFormat#COLUMNAR_BATCH}: {@link #createColumnarBatchDataReader()}</li> | ||
| * </ul> | ||
| */ | ||
| DataFormat dataFormat(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the data format is determined when the factory is created, then I don't see why it is necessary to change the API. This just makes it more confusing.
|
|
||
| package org.apache.spark.sql.sources.v2.reader.streaming; | ||
|
|
||
| import java.util.Optional; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this is a cosmetic change that should be reverted before committing.
| case DataFormat.COLUMNAR_BATCH => | ||
| new DataReaderIterator(factory.createColumnarBatchDataReader()) | ||
| // TODO: remove this type erase hack. | ||
| .asInstanceOf[DataReaderIterator[UnsafeRow]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this change intended to avoid these casts?
| range, executorKafkaParams, pollTimeoutMs, failOnDataLoss, reuseKafkaConsumer) | ||
| } | ||
| factories.map(_.asInstanceOf[DataReaderFactory[UnsafeRow]]).asJava | ||
| factories.map(_.asInstanceOf[DataReaderFactory]).asJava |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this cast necessary?
What changes were proposed in this pull request?
This API change is inspired by the problems we meet when migrating streaming and file-based data sources to the data souce v2 API.
For the streaming side, we need a variant of the
DataReader/WriterFactory(see an example). This brings a lot of trouble for scanning/writing optimized data format likeInternalRow,ColumnarBatch, etc.These special scanning/writing interfaces are defined like
This can't work with
ContinuousDataReaderFactoryat all, or we have to do runtime type cast and make the variant extendsDataReader/WriterFactory. We have the same problem on the write path too.For the file-based data source side, we have a problem with code duplication. Let's take ORC data source as an example. To support both unsafe row and columnar batch scan, we need something like
You can see that we have duplicated logic for preparing parameters and defining the factory. After this change, we can simplify the code to
The proposed change is: remove the type parameter and embed the special scanning/writing format to the factory. e.g.
A potential benefit after this change: now it's up to the factory to decide which data format to use(UnsafeRow, ColumnarBatch, etc.). Which means, it's possible to allow different data partitions to be scanned with different formats. Some hybrid storage may keep realtime data in row format, and history data in columnar format, and it can fit the new API well.
TODO:
apply this change to the write path(next PR)
How was this patch tested?
existing tests