Skip to content

Adjustable capacity/batch to create a dataset with IOTensor #445

@yongtang

Description

@yongtang

In IOTensor's C++ implementation, data records are read as chunks. This should greatly improvement performance where small data record types (e.g., each record is an integer or float32 number).

At the moment the chunk size is hard coded as 4096 (arbitrary). See comment #437 (comment)

However, it is possible to make this number adjustable. And this number does not have to be fixed. It could be a sequence with exactly each chunk size for each step in every element of the sequence.

A good way is to create a C++ kernel op that use the output of Init function (which is a resource type) as the input, and ask for the chunk size. The returned chunk size could be a 1-D tensor that specifies each chunk size in steps.

Then in the python size, the 1-D tensor of the chunk-size could be used to generate a sequence as a dataset, and this dataset could apply a map function so that the chunk size could be passed when GetItem kernel op is applied.

This is especially helpful for file formats that have natural chunks. For example, Avro have sync mark naturally splits. Parquet have RowGroup that naturally splits as well. Think ArrowBatch could be applied as well.

/cc @BryanCutler

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions