-
Notifications
You must be signed in to change notification settings - Fork 307
Expose tfio.IOTensor class and from_audio and tfio.IOTensor.to_dataset() #437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@BryanCutler I reorganized the python class as was suggested, please take a look. |
|
@BryanCutler @terrytangyuan Some note about the changes in this PR:
One final note, is that Please take a look and see if this is OK. If it is fine, I am planning to roll out the new implementation to most of the remaining ops. |
|
@BryanCutler @terrytangyuan one final note is that, all internal implementation batches and caches a large chunk automatically so I would assume there will be a slight improvement in performance. This is especially the case for non-image files where each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had some general questions but overall looks really good! There is a lot to digest here, but we can discuss later so not to block other pending PRs.
| return _BaseIOTensorDataset( | ||
| self.spec, self._resource, self._function) | ||
|
|
||
| class _ColumnIOTensor(_BaseIOTensor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does a ColumnIOTensor have a relationship with TableIOTensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler ColumnIOTensor is essentially a one datatype single tensor/array.
I could not find a better name. Maybe some suggestions?
See some of the discussions in #315 (comment)
| shared_name="%s/%s" % (subscription, uuid.uuid4().hex)) | ||
|
|
||
| capacity = 4096 | ||
| dataset = tf.compat.v2.data.Dataset.range(0, sys.maxsize, capacity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is to make a continuous stream with chunks the size of capacity? Is capacity going to be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler Yes it could be easily adjustable, and it could even be a 1-d array (than a constant). Added an issue #445 for that. Will try to write an example once I find some time.
| subscription, metadata=metadata, | ||
| container=scope, | ||
| shared_name="%s/%s" % (subscription, uuid.uuid4().hex)) | ||
| print("VVV: ", dtypes, shapes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a leftover print statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler Thanks. Removed.
| int column_index = columns_index_[i]; | ||
| ::tensorflow::DataType dtype; | ||
| switch (table_->column(column_index)->type()->id()) { | ||
| case ::arrow::Type::BOOL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you use arrow::adapters::tensorflow::GetTensorFlowType here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler It is a little complicated as GetTensorFlowType is in a header file in arrow library. So directly include the header in two .cc files will not work. I created a wrapper instead to avoid linking issues.
|
Hi @yongtang, It is so great to have
|
This PR tries to expose a tfio.IOTensor which could be applied to and io related data which are indexable (__getitem__ and __len__) The idea is to bind __getitem__ and __len__ to kernel ops in run time, so that is is not necessarily to read everything in memory. The first file format is the WAV file. With tfio.IOTensor dtype and shape are exposed with __getitem__ and __len__. Further, a rate property has been exposed specifically for Audio/WAV file which gives sample rate. This tfio.IOTensor only works in eager mode. In additional this PR also converts WavDataset to use IOTensor (instead of direct C++ implementation). This PR also carries 420. Note as was discussed, rebatch has been dropped. Instead, a PR to core tensorflow repo will be opened. Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
intend to deprecate old KafkaDataset soon. Signed-off-by: Yong Tang <[email protected]>
This is build around the same code base as KafkaDataset C++. Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
Signed-off-by: Yong Tang <[email protected]>
|
@BryanCutler I plan to merge this PR shortly. It may not be perfect through I think we could just move forward and polish in follow up PRs (might be many). Created one issue #445 to track capacity and translation of batch into an array of capacities (instead of one constant). Also added a comment in #315 (comment) to expand the discussion. |
|
@jiachengxu I think The json and ndjson could be consolidated with one flag passed to |
…t() (tensorflow#437) * Expose tfio.IOTensor class and from_audio and tfio.IOTensor.to_dataset() This PR tries to expose a tfio.IOTensor which could be applied to and io related data which are indexable (__getitem__ and __len__) The idea is to bind __getitem__ and __len__ to kernel ops in run time, so that is is not necessarily to read everything in memory. The first file format is the WAV file. With tfio.IOTensor dtype and shape are exposed with __getitem__ and __len__. Further, a rate property has been exposed specifically for Audio/WAV file which gives sample rate. This tfio.IOTensor only works in eager mode. In additional this PR also converts WavDataset to use IOTensor (instead of direct C++ implementation). This PR also carries 420. Note as was discussed, rebatch has been dropped. Instead, a PR to core tensorflow repo will be opened. Signed-off-by: Yong Tang <[email protected]> * Remove Iterable from reference Signed-off-by: Yong Tang <[email protected]> * Pylint fix Signed-off-by: Yong Tang <[email protected]> * Add a decorator so that it could be picked up by __repr__ automatically Signed-off-by: Yong Tang <[email protected]> * Fix python 3 issue Signed-off-by: Yong Tang <[email protected]> * Add KafkaDataset to tensorflow_io.core.python.ops.kafka_ops.KafkaDataset intend to deprecate old KafkaDataset soon. Signed-off-by: Yong Tang <[email protected]> * Add KafkaIOTensor which stores data in memory (so that it is indexable) This is build around the same code base as KafkaDataset C++. Signed-off-by: Yong Tang <[email protected]> * Deprecate WAVDataset, and pylint fix Signed-off-by: Yong Tang <[email protected]> * Remove leftover print Signed-off-by: Yong Tang <[email protected]> * Import GetTensorFlowType and GetArrowType Signed-off-by: Yong Tang <[email protected]> * Fix kokoro version Signed-off-by: Yong Tang <[email protected]>
This PR tries to expose a tfio.IOTensor which could be applied to and io related data which are indexable (getitem and len)
The idea is to bind
__getitem__and__len__to kernel ops in run time, so that is is not necessarily to read everything in memory.The first file format is the WAV file. With tfio.IOTensor dtype and shape are exposed with
__getitem__and__len__.Further, a rate property has been exposed specifically for Audio/WAV file which gives sample rate.
This tfio.IOTensor only works in eager mode.
In additional this PR also converts WavDataset to use IOTensor (instead of direct C++ implementation).
This PR also carries #420.
Note as was discussed, rebatch has been dropped. Instead, a PR to core tensorflow repo will be opened.
Signed-off-by: Yong Tang [email protected]