Expose tfio.IOTensor class and from_audio and tfio.IOTensor.to_dataset() #437

yongtang · 2019-08-22T02:24:43Z

This PR tries to expose a tfio.IOTensor which could be applied to and io related data which are indexable (getitem and len)

The idea is to bind __getitem__ and __len__ to kernel ops in run time, so that is is not necessarily to read everything in memory.

The first file format is the WAV file. With tfio.IOTensor dtype and shape are exposed with __getitem__ and __len__.

Further, a rate property has been exposed specifically for Audio/WAV file which gives sample rate.

This tfio.IOTensor only works in eager mode.

In additional this PR also converts WavDataset to use IOTensor (instead of direct C++ implementation).

This PR also carries #420.

Note as was discussed, rebatch has been dropped. Instead, a PR to core tensorflow repo will be opened.

Signed-off-by: Yong Tang [email protected]

yongtang · 2019-08-22T02:25:19Z

@BryanCutler I reorganized the python class as was suggested, please take a look.

yongtang · 2019-08-22T19:28:26Z

@BryanCutler @terrytangyuan Some note about the changes in this PR:

This PR adds JSONIOTensor which actually is built on Apache Arrow C++ (didn't realize Arrow already support so many formats). - We probably could take another look and see what else in Arrow could be build into tfio.
This PR updates KafkaDataset which keep the old API but used the C++ implementation of KafkaIterable. So KafkaDataset is used for iterations, and passing to tf.keras directly (if data has already been preprocessed).
This PR adds KafkaIOTensor and use the same C++ implementation of KafkaIterable. It adds a thin layer to store data in memory. This is used for indexing and slicing, and any complicated feature engineering (that could not be done with just a iterable).
You can convert IOTensor to dataset(), this is for people to already done heavy feature engineering such as normalize over summation, full range shuffling, etc.

One final note, is that KafkaIterable is about 200 lines of C++, while in comparison, the old handcrafted KafkaDataset C++ is about 450+ lines of C++. I think this is a nice code reduction.

Please take a look and see if this is OK. If it is fine, I am planning to roll out the new implementation to most of the remaining ops.

yongtang · 2019-08-22T20:06:56Z

@BryanCutler @terrytangyuan one final note is that, all internal implementation batches and caches a large chunk automatically so I would assume there will be a slight improvement in performance. This is especially the case for non-image files where each element is very small (such as 4 bytes for an integer).

BryanCutler

I just had some general questions but overall looks really good! There is a lot to digest here, but we can discuss later so not to block other pending PRs.

BryanCutler · 2019-08-23T17:54:49Z

tensorflow_io/core/python/ops/io_tensor_ops.py

+    return _BaseIOTensorDataset(
+        self.spec, self._resource, self._function)
+
+class _ColumnIOTensor(_BaseIOTensor):


So does a ColumnIOTensor have a relationship with TableIOTensor?

@BryanCutler ColumnIOTensor is essentially a one datatype single tensor/array.

I could not find a better name. Maybe some suggestions?

See some of the discussions in #315 (comment)

BryanCutler · 2019-08-23T17:57:38Z

tensorflow_io/core/python/ops/kafka_dataset_ops.py

+          shared_name="%s/%s" % (subscription, uuid.uuid4().hex))
+
+      capacity = 4096
+      dataset = tf.compat.v2.data.Dataset.range(0, sys.maxsize, capacity)


So this is to make a continuous stream with chunks the size of capacity? Is capacity going to be configurable?

@BryanCutler Yes it could be easily adjustable, and it could even be a 1-d array (than a constant). Added an issue #445 for that. Will try to write an example once I find some time.

BryanCutler · 2019-08-23T17:58:21Z

tensorflow_io/core/python/ops/kafka_io_tensor_ops.py

+          subscription, metadata=metadata,
+          container=scope,
+          shared_name="%s/%s" % (subscription, uuid.uuid4().hex))
+      print("VVV: ", dtypes, shapes)


is this a leftover print statement?

@BryanCutler Thanks. Removed.

BryanCutler · 2019-08-23T18:02:11Z

tensorflow_io/json/kernels/json_kernels.cc

+      int column_index = columns_index_[i];
+      ::tensorflow::DataType dtype;
+      switch (table_->column(column_index)->type()->id()) {
+      case ::arrow::Type::BOOL:


could you use arrow::adapters::tensorflow::GetTensorFlowType here?

@BryanCutler It is a little complicated as GetTensorFlowType is in a header file in arrow library. So directly include the header in two .cc files will not work. I created a wrapper instead to avoid linking issues.

jiachengxu · 2019-08-24T06:00:44Z

Hi @yongtang, It is so great to have from_json! Here are some of my thoughts:

As you mentioned in Expose tfio.IOTensor class and from_audio and tfio.IOTensor.to_dataset() #420 and also in this implementation of from_json, arrow is a good fit to handle splittable JSON(ndjson). I am thinking about if I should also switch to use arrow for the list_json_columns and read_json ops. Since the arrow uses rapidjson underneath, and according to some experiments https://github.com/mloskot/json_benchmark, maybe arrow cound give better performance.
The pure JSON is kind of special, it is not splittable and indexible, so I am thinking about that maybe if it is impossible to implement from_json for pure JSON because I think current from_json is from_ndjson indeed.

This PR tries to expose a tfio.IOTensor which could be applied to and io related data which are indexable (__getitem__ and __len__) The idea is to bind __getitem__ and __len__ to kernel ops in run time, so that is is not necessarily to read everything in memory. The first file format is the WAV file. With tfio.IOTensor dtype and shape are exposed with __getitem__ and __len__. Further, a rate property has been exposed specifically for Audio/WAV file which gives sample rate. This tfio.IOTensor only works in eager mode. In additional this PR also converts WavDataset to use IOTensor (instead of direct C++ implementation). This PR also carries 420. Note as was discussed, rebatch has been dropped. Instead, a PR to core tensorflow repo will be opened. Signed-off-by: Yong Tang <[email protected]>

Signed-off-by: Yong Tang <[email protected]>

intend to deprecate old KafkaDataset soon. Signed-off-by: Yong Tang <[email protected]>

This is build around the same code base as KafkaDataset C++. Signed-off-by: Yong Tang <[email protected]>

Signed-off-by: Yong Tang <[email protected]>

yongtang · 2019-08-24T17:44:45Z

@BryanCutler I plan to merge this PR shortly. It may not be perfect through I think we could just move forward and polish in follow up PRs (might be many).

Created one issue #445 to track capacity and translation of batch into an array of capacities (instead of one constant).

Also added a comment in #315 (comment) to expand the discussion.

yongtang · 2019-08-24T17:49:10Z

@jiachengxu I think list_json_columns could be deprecated, as it meant to be a workaround to give user an easy way to check for columns. Instead it could be integrated into IOTensor and it automatically gives you columns (and more meta data).

The json and ndjson could be consolidated with one flag passed to from_json to control if root element is to be parse or not. (ndjson does not have root element, json does have one root element).

…t() (tensorflow#437) * Expose tfio.IOTensor class and from_audio and tfio.IOTensor.to_dataset() This PR tries to expose a tfio.IOTensor which could be applied to and io related data which are indexable (__getitem__ and __len__) The idea is to bind __getitem__ and __len__ to kernel ops in run time, so that is is not necessarily to read everything in memory. The first file format is the WAV file. With tfio.IOTensor dtype and shape are exposed with __getitem__ and __len__. Further, a rate property has been exposed specifically for Audio/WAV file which gives sample rate. This tfio.IOTensor only works in eager mode. In additional this PR also converts WavDataset to use IOTensor (instead of direct C++ implementation). This PR also carries 420. Note as was discussed, rebatch has been dropped. Instead, a PR to core tensorflow repo will be opened. Signed-off-by: Yong Tang <[email protected]> * Remove Iterable from reference Signed-off-by: Yong Tang <[email protected]> * Pylint fix Signed-off-by: Yong Tang <[email protected]> * Add a decorator so that it could be picked up by __repr__ automatically Signed-off-by: Yong Tang <[email protected]> * Fix python 3 issue Signed-off-by: Yong Tang <[email protected]> * Add KafkaDataset to tensorflow_io.core.python.ops.kafka_ops.KafkaDataset intend to deprecate old KafkaDataset soon. Signed-off-by: Yong Tang <[email protected]> * Add KafkaIOTensor which stores data in memory (so that it is indexable) This is build around the same code base as KafkaDataset C++. Signed-off-by: Yong Tang <[email protected]> * Deprecate WAVDataset, and pylint fix Signed-off-by: Yong Tang <[email protected]> * Remove leftover print Signed-off-by: Yong Tang <[email protected]> * Import GetTensorFlowType and GetArrowType Signed-off-by: Yong Tang <[email protected]> * Fix kokoro version Signed-off-by: Yong Tang <[email protected]>

yongtang requested review from BryanCutler and terrytangyuan August 22, 2019 02:24

yongtang mentioned this pull request Aug 22, 2019

Expose tfio.IOTensor class and from_audio and tfio.IOTensor.to_dataset() #420

Closed

yongtang force-pushed the io_tensor branch from 9d157b0 to 2458835 Compare August 22, 2019 15:42

yongtang force-pushed the io_tensor branch from 30bc3b3 to 525d55f Compare August 22, 2019 23:57

This was referenced Aug 23, 2019

Add tfio.IOTensor.from_prometheus support #438

Merged

Add tfio.IOTensor.from_parquet support #439

Merged

[WIP] Add FileDataset to read whole content of file into tf.data pipeline #366

Closed

Add tfio.IOTensor.from_avro support #440

Merged

BryanCutler approved these changes Aug 23, 2019

View reviewed changes

This was referenced Aug 24, 2019

Add tfio.IOTensor.from_hdf5 support #441

Merged

Add tfio.IOTensor.from_feather support #442

Merged

Add tfio.IOTensor.from_csv support (experimental with Apache Arrow's CSV parser) #443

Merged

yongtang mentioned this pull request Aug 24, 2019

Adjustable capacity/batch to create a dataset with IOTensor #445

Open

yongtang force-pushed the io_tensor branch from 525d55f to 2c547c6 Compare August 24, 2019 15:39

yongtang added 10 commits August 24, 2019 15:49

Remove Iterable from reference

1527f0e

Signed-off-by: Yong Tang <[email protected]>

Pylint fix

83450ac

Signed-off-by: Yong Tang <[email protected]>

Add a decorator so that it could be picked up by __repr__ automatically

eb19c0c

Signed-off-by: Yong Tang <[email protected]>

Fix python 3 issue

3a4870a

Signed-off-by: Yong Tang <[email protected]>

Add KafkaDataset to tensorflow_io.core.python.ops.kafka_ops.KafkaDataset

c1323ed

intend to deprecate old KafkaDataset soon. Signed-off-by: Yong Tang <[email protected]>

Add KafkaIOTensor which stores data in memory (so that it is indexable)

5df11e9

This is build around the same code base as KafkaDataset C++. Signed-off-by: Yong Tang <[email protected]>

Deprecate WAVDataset, and pylint fix

4051010

Signed-off-by: Yong Tang <[email protected]>

Remove leftover print

5eb166d

Signed-off-by: Yong Tang <[email protected]>

Import GetTensorFlowType and GetArrowType

a8af64a

Signed-off-by: Yong Tang <[email protected]>

yongtang force-pushed the io_tensor branch from 2c547c6 to a8af64a Compare August 24, 2019 15:49

Fix kokoro version

b1f5508

Signed-off-by: Yong Tang <[email protected]>

yongtang mentioned this pull request Aug 24, 2019

Standardize columnized dataset? #315

Open

yongtang merged commit 26442dc into tensorflow:master Aug 24, 2019

yongtang deleted the io_tensor branch August 24, 2019 17:46

Expose tfio.IOTensor class and from_audio and tfio.IOTensor.to_dataset() #437

Expose tfio.IOTensor class and from_audio and tfio.IOTensor.to_dataset() #437

Uh oh!

Conversation

yongtang commented Aug 22, 2019

Uh oh!

yongtang commented Aug 22, 2019

Uh oh!

yongtang commented Aug 22, 2019

Uh oh!

yongtang commented Aug 22, 2019

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiachengxu commented Aug 24, 2019

Uh oh!

yongtang commented Aug 24, 2019

Uh oh!

yongtang commented Aug 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants