Adjustable capacity/batch to create a dataset with IOTensor

In IOTensor's C++ implementation, data records are read as chunks. This should greatly improvement performance where small data record types (e.g., each record is an integer or float32 number).

At the moment the chunk size is hard coded as 4096 (arbitrary). See comment https://github.com/tensorflow/io/pull/437#discussion_r317241305

However, it is possible to make this number adjustable. And this number does not have to be fixed. It could be a sequence with exactly each chunk size for each step in every element of the sequence.

A good way is to create a C++ kernel op that use the output of `Init` function (which is a `resource` type) as the input, and ask for the chunk size. The returned chunk size could be a 1-D tensor that specifies each chunk size in steps.

Then in the python size, the 1-D tensor of the chunk-size could be used to generate a sequence as a dataset, and this dataset could apply a `map` function so that the chunk size could be passed when `GetItem` kernel op is applied.

This is especially helpful for file formats that have natural chunks. For example, Avro have sync mark naturally splits. Parquet have RowGroup that naturally splits as well. Think ArrowBatch could be applied as well.

/cc @BryanCutler 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adjustable capacity/batch to create a dataset with IOTensor #445

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adjustable capacity/batch to create a dataset with IOTensor #445

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions