Skip to content

Conversation

@yongtang
Copy link
Member

This PR tries to expose a tfio.IOTensor which could be applied to and io related data which are indexable (getitem and len)

The idea is to bind __getitem__ and __len__ to kernel ops in run time, so that is is not necessarily to read everything in memory.

The first file format is the WAV file. With tfio.IOTensor dtype and shape are exposed with __getitem__ and __len__.

Further, a rate property has been exposed specifically for Audio/WAV file which gives sample rate.

This tfio.IOTensor only works in eager mode.

In additional this PR also converts WavDataset to use IOTensor (instead of direct C++ implementation).

This PR also carries #420.

Note as was discussed, rebatch has been dropped. Instead, a PR to core tensorflow repo will be opened.

Signed-off-by: Yong Tang [email protected]

@yongtang
Copy link
Member Author

@BryanCutler I reorganized the python class as was suggested, please take a look.

@yongtang
Copy link
Member Author

@BryanCutler @terrytangyuan Some note about the changes in this PR:

  1. This PR adds JSONIOTensor which actually is built on Apache Arrow C++ (didn't realize Arrow already support so many formats). - We probably could take another look and see what else in Arrow could be build into tfio.
  2. This PR updates KafkaDataset which keep the old API but used the C++ implementation of KafkaIterable. So KafkaDataset is used for iterations, and passing to tf.keras directly (if data has already been preprocessed).
  3. This PR adds KafkaIOTensor and use the same C++ implementation of KafkaIterable. It adds a thin layer to store data in memory. This is used for indexing and slicing, and any complicated feature engineering (that could not be done with just a iterable).
  4. You can convert IOTensor to dataset(), this is for people to already done heavy feature engineering such as normalize over summation, full range shuffling, etc.

One final note, is that KafkaIterable is about 200 lines of C++, while in comparison, the old handcrafted KafkaDataset C++ is about 450+ lines of C++. I think this is a nice code reduction.

Please take a look and see if this is OK. If it is fine, I am planning to roll out the new implementation to most of the remaining ops.

@yongtang
Copy link
Member Author

@BryanCutler @terrytangyuan one final note is that, all internal implementation batches and caches a large chunk automatically so I would assume there will be a slight improvement in performance. This is especially the case for non-image files where each element is very small (such as 4 bytes for an integer).

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had some general questions but overall looks really good! There is a lot to digest here, but we can discuss later so not to block other pending PRs.

return _BaseIOTensorDataset(
self.spec, self._resource, self._function)

class _ColumnIOTensor(_BaseIOTensor):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does a ColumnIOTensor have a relationship with TableIOTensor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler ColumnIOTensor is essentially a one datatype single tensor/array.

I could not find a better name. Maybe some suggestions?

See some of the discussions in #315 (comment)

shared_name="%s/%s" % (subscription, uuid.uuid4().hex))

capacity = 4096
dataset = tf.compat.v2.data.Dataset.range(0, sys.maxsize, capacity)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is to make a continuous stream with chunks the size of capacity? Is capacity going to be configurable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler Yes it could be easily adjustable, and it could even be a 1-d array (than a constant). Added an issue #445 for that. Will try to write an example once I find some time.

subscription, metadata=metadata,
container=scope,
shared_name="%s/%s" % (subscription, uuid.uuid4().hex))
print("VVV: ", dtypes, shapes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a leftover print statement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler Thanks. Removed.

int column_index = columns_index_[i];
::tensorflow::DataType dtype;
switch (table_->column(column_index)->type()->id()) {
case ::arrow::Type::BOOL:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you use arrow::adapters::tensorflow::GetTensorFlowType here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanCutler It is a little complicated as GetTensorFlowType is in a header file in arrow library. So directly include the header in two .cc files will not work. I created a wrapper instead to avoid linking issues.

@jiachengxu
Copy link
Contributor

Hi @yongtang, It is so great to have from_json! Here are some of my thoughts:

  • As you mentioned in Expose tfio.IOTensor class and from_audio and tfio.IOTensor.to_dataset() #420 and also in this implementation of from_json, arrow is a good fit to handle splittable JSON(ndjson). I am thinking about if I should also switch to use arrow for the list_json_columns and read_json ops. Since the arrow uses rapidjson underneath, and according to some experiments https://github.com/mloskot/json_benchmark, maybe arrow cound give better performance.
  • The pure JSON is kind of special, it is not splittable and indexible, so I am thinking about that maybe if it is impossible to implement from_json for pure JSON because I think current from_json is from_ndjson indeed.

This PR tries to expose a tfio.IOTensor which could be applied to and io related data which are indexable (__getitem__ and __len__)

The idea is to bind __getitem__ and __len__ to kernel ops in run time, so that is is not necessarily to read everything in memory.

The first file format is the WAV file. With tfio.IOTensor dtype and shape are exposed with __getitem__ and __len__.

Further, a rate property has been exposed specifically for Audio/WAV file which gives sample rate.

This tfio.IOTensor only works in eager mode.

In additional this PR also converts WavDataset to use IOTensor (instead of direct C++ implementation).

This PR also carries 420.

Note as was discussed, rebatch has been dropped. Instead, a PR to core tensorflow repo will be opened.

Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
intend to deprecate old KafkaDataset soon.

Signed-off-by: Yong Tang <[email protected]>
This is build around the same code base as KafkaDataset C++.

Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
@yongtang
Copy link
Member Author

@BryanCutler I plan to merge this PR shortly. It may not be perfect through I think we could just move forward and polish in follow up PRs (might be many).

Created one issue #445 to track capacity and translation of batch into an array of capacities (instead of one constant).

Also added a comment in #315 (comment) to expand the discussion.

@yongtang yongtang merged commit 26442dc into tensorflow:master Aug 24, 2019
@yongtang yongtang deleted the io_tensor branch August 24, 2019 17:46
@yongtang
Copy link
Member Author

@jiachengxu I think list_json_columns could be deprecated, as it meant to be a workaround to give user an easy way to check for columns. Instead it could be integrated into IOTensor and it automatically gives you columns (and more meta data).

The json and ndjson could be consolidated with one flag passed to from_json to control if root element is to be parse or not. (ndjson does not have root element, json does have one root element).

i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
…t() (tensorflow#437)

* Expose tfio.IOTensor class and from_audio and tfio.IOTensor.to_dataset()

This PR tries to expose a tfio.IOTensor which could be applied to and io related data which are indexable (__getitem__ and __len__)

The idea is to bind __getitem__ and __len__ to kernel ops in run time, so that is is not necessarily to read everything in memory.

The first file format is the WAV file. With tfio.IOTensor dtype and shape are exposed with __getitem__ and __len__.

Further, a rate property has been exposed specifically for Audio/WAV file which gives sample rate.

This tfio.IOTensor only works in eager mode.

In additional this PR also converts WavDataset to use IOTensor (instead of direct C++ implementation).

This PR also carries 420.

Note as was discussed, rebatch has been dropped. Instead, a PR to core tensorflow repo will be opened.

Signed-off-by: Yong Tang <[email protected]>

* Remove Iterable from reference

Signed-off-by: Yong Tang <[email protected]>

* Pylint fix

Signed-off-by: Yong Tang <[email protected]>

* Add a decorator so that it could be picked up by __repr__ automatically

Signed-off-by: Yong Tang <[email protected]>

* Fix python 3 issue

Signed-off-by: Yong Tang <[email protected]>

* Add KafkaDataset to tensorflow_io.core.python.ops.kafka_ops.KafkaDataset

intend to deprecate old KafkaDataset soon.

Signed-off-by: Yong Tang <[email protected]>

* Add KafkaIOTensor which stores data in memory (so that it is indexable)

This is build around the same code base as KafkaDataset C++.

Signed-off-by: Yong Tang <[email protected]>

* Deprecate WAVDataset, and pylint fix

Signed-off-by: Yong Tang <[email protected]>

* Remove leftover print

Signed-off-by: Yong Tang <[email protected]>

* Import GetTensorFlowType and GetArrowType

Signed-off-by: Yong Tang <[email protected]>

* Fix kokoro version

Signed-off-by: Yong Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants