-
Notifications
You must be signed in to change notification settings - Fork 74
Add support for reading parquet file thanks to arrow-dataset #576 #577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,6 +6,11 @@ import kotlinx.datetime.LocalTime | |
import kotlinx.datetime.toKotlinLocalDate | ||
import kotlinx.datetime.toKotlinLocalDateTime | ||
import kotlinx.datetime.toKotlinLocalTime | ||
import org.apache.arrow.dataset.file.FileFormat | ||
import org.apache.arrow.dataset.file.FileSystemDatasetFactory | ||
import org.apache.arrow.dataset.jni.DirectReservationListener | ||
import org.apache.arrow.dataset.jni.NativeMemoryPool | ||
import org.apache.arrow.dataset.scanner.ScanOptions | ||
import org.apache.arrow.memory.RootAllocator | ||
import org.apache.arrow.vector.BigIntVector | ||
import org.apache.arrow.vector.BitVector | ||
|
@@ -414,3 +419,27 @@ internal fun DataFrame.Companion.readArrowImpl( | |
return flattened.concatKeepingSchema() | ||
} | ||
} | ||
|
||
internal fun DataFrame.Companion.readArrowDataset( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it be worth it to expose this one to the public API as well? That way people could also supply .orc files for instance. We could also keep it internal and add a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. *readArrowDatasetImpl |
||
fileUri: String, | ||
fileFormat: FileFormat, | ||
nullability: NullabilityOptions = NullabilityOptions.Infer, | ||
): AnyFrame { | ||
val scanOptions = ScanOptions(32768) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I saw this 32,768 number in the Arrow docs as well. Do you have any idea what this does and if we should allow the user to change it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to the documenatation Batchsize is the |
||
RootAllocator().use { allocator -> | ||
FileSystemDatasetFactory( | ||
allocator, | ||
NativeMemoryPool.createListenable(DirectReservationListener.instance()), | ||
fileFormat, | ||
fileUri, | ||
).use { datasetFactory -> | ||
datasetFactory.finish().use { dataset -> | ||
dataset.newScan(scanOptions).use { scanner -> | ||
scanner.scanBatches().use { reader -> | ||
return readArrow(reader, nullability) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can better call |
||
} | ||
} | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like other
readXxx()
functions in the library, we probably need overloads for providing the path/url in the typesPath
,File
,URL
,String
(as url).I also noticed Parquet files can be partitioned, so consist of multiple files. It seems like
FileSystemDatasetFactory
can also take an array of uris. If this is common, we should probably allow multiple uri's, so perhaps make itvararg
? What do you think about that?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I Know arrow-dataset does not support http URL for the moment but support many file systems like s3, gcs, hdfs ...
I propose to add methods with Path and File for local files. For http URL it's possible to download the file in the tmp file load the dataframe and delete the file. For other urls we can let arrow-dataset try to load the urls and throw an error if it doesn't.
I think I can easily add a vararg for URL, File or Path and using the
FileSystemDatasetFactory
constructor that you mention with multiple uris.