Skip to content

Commit 2ad1337

Browse files
committed
Add some documentation
Signed-off-by: Yong Tang <[email protected]>
1 parent e966c20 commit 2ad1337

File tree

1 file changed

+183
-1
lines changed

1 file changed

+183
-1
lines changed

tensorflow_io/core/python/ops/io_tensor.py

Lines changed: 183 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,189 @@ def __repr__(self):
100100

101101

102102
class IOTensor(_IOBaseTensor):
103-
"""IOTensor"""
103+
"""IOTensor
104+
105+
An `IOTensor` is a tensor with data backed by IO operations. For example,
106+
an `AudioIOTensor` is a tensor with data from an audio file, a
107+
`KafkaIOTensor` is a tensor with data from reading the messages of a Kafka
108+
stream server.
109+
110+
There are two types of `IOTensor`, a normal `IOTensor` which itself is
111+
indexable, or a degenerated `IOIterableTensor` which only supports
112+
accessing the tensor iteratively.
113+
114+
Since `IOTensor` is indexable, it support `__getitem__()` and
115+
`__len__()` methods in Python. In other words, it is a subclass of
116+
`collections.abc.Sequence`.
117+
118+
Example:
119+
120+
```python
121+
>>> import tensorflow_io as tfio
122+
>>>
123+
>>> samples = tfio.IOTensor.from_audio("sample.wav")
124+
>>> print(samples[1000:1005])
125+
... tf.Tensor(
126+
... [[-3]
127+
... [-7]
128+
... [-6]
129+
... [-6]
130+
... [-5]], shape=(5, 1), dtype=int16)
131+
```
132+
133+
A `IOIterableTensor` is really a subclass of `collections.abc.Iterable`.
134+
It provides a `__iter__()` method that could be used (through `iter`
135+
indirectly) to access data in an iterative fashion.
136+
137+
Example:
138+
139+
```python
140+
>>> import tensorflow_io as tfio
141+
>>>
142+
>>> kafka = tfio.IOTensor.from_kafka("test", eof=True)
143+
>>> for message in kafka:
144+
>>> print(message)
145+
... tf.Tensor(['D0'], shape=(1,), dtype=string)
146+
... tf.Tensor(['D1'], shape=(1,), dtype=string)
147+
... tf.Tensor(['D2'], shape=(1,), dtype=string)
148+
... tf.Tensor(['D3'], shape=(1,), dtype=string)
149+
... tf.Tensor(['D4'], shape=(1,), dtype=string)
150+
```
151+
152+
### Indexable vs. Iterable
153+
154+
While many IO formats are natually considered as iterable only, in most
155+
of the situations they could still be accessed by indexing through certain
156+
workaround. For example, a Kafka stream is not directly indexable yet the
157+
stream could be saved in memory or disk to allow indexing. Another example
158+
is the packet capture (PCAP) file in networking area. The packets inside
159+
a PCAP file is concatenated sequentially. Since each packets could have
160+
a variable length, the only way to access each packet is to read one
161+
packet at a time. If the PCAP file is huge (e.g., hundreds of GBs or even
162+
TBs), it may not be realistic (or necessarily) to save the index of every
163+
packet in memory. We could consider PCAP format as iterable only.
164+
165+
As we could see the availability of memory size could be a factor to decide
166+
if a format is indexable or not. However, this factor could also be blurred
167+
as well in distributed computing. One common case is the file format that
168+
might be splittable where a file could be split into multiple chunks
169+
(without read the whole file) with no data overlapping in between those
170+
chunks. For example, a text file could be reliably split into multiple
171+
chunks with line feed (LF) as the boundary. Processing of chunks could then
172+
be distributed across a group of compute node to speed up (by reading small
173+
chunks into memory). From that standpoint, we could still consider splittable
174+
formats as indexable.
175+
176+
For that reason our focus is `IOTensor` with convinience indexing and slicing
177+
through `__getitem__()` method.
178+
179+
### Lazy Read
180+
181+
One useful feature of `IOTensor` is the lazy read. Data inside a file is not
182+
read into memory until needed. This could be convenient where only a small
183+
segment of the data is needed. For example, a WAV file could be as big as
184+
GBs but in many cases only several seconds of samples are used for training
185+
or inference purposes.
186+
187+
While CPU memory is cheap nowadays, GPU memory is still considered as an
188+
expensive resource. It is also imperative to fit data in GPU memory for
189+
speed up purposes. From that perspective lazy read could be very helpful.
190+
191+
### Association of Meta Data
192+
193+
While a file format could consist of mostly numeric data, in may situations
194+
the meta data is important as well. For example, in audio file format the
195+
sample rate is a number that is necessary for almost everything. Association
196+
of the sample rate with the sample of int16 Tensor is more helpful,
197+
especially in eager mode.
198+
199+
Example:
200+
201+
```python
202+
>>> import tensorflow_io as tfio
203+
>>>
204+
>>> samples = tfio.IOTensor.from_audio("sample.wav")
205+
>>> print(samples.rate)
206+
... 44100
207+
```
208+
209+
### Nested Element Structure
210+
211+
The concept of `IOTensor` is not limited to a Tensor of single data type.
212+
It supports nested element structure which could consists of many
213+
components and complex structures. The exposed API such as `shape()` or
214+
`dtype()` will display the shape and data type of an individual Tensor,
215+
or a nested structure of shape and data types for components of a
216+
composite Tensor.
217+
218+
Example:
219+
220+
```python
221+
>>> import tensorflow_io as tfio
222+
>>>
223+
>>> samples = tfio.IOTensor.from_audio("sample.wav")
224+
>>> print(samples.shape)
225+
... (22050, 2)
226+
>>> print(samples.dtype)
227+
... <dtype: 'int32'>
228+
>>>
229+
>>> features = tfio.IOTensor.from_json("feature.json")
230+
>>> print(features.shape)
231+
... (TensorShape([Dimension(2)]), TensorShape([Dimension(2)]))
232+
>>> print(features.dtype)
233+
... (tf.float64, tf.int64)
234+
```
235+
236+
### Access Columns of Tabular Data Formats
237+
238+
May file formats such as Parquet or Json are considered as Tabular because
239+
they consists of columns in a table. With `IOTensor` it is possible to
240+
access individual columns through `__call__()`.
241+
242+
Example:
243+
244+
```python
245+
>>> import tensorflow_io as tfio
246+
>>>
247+
>>> features = tfio.IOTensor.from_json("feature.json")
248+
>>> print(features.shape)
249+
... (TensorShape([Dimension(2)]), TensorShape([Dimension(2)]))
250+
>>> print(features.dtype)
251+
... (tf.float64, tf.int64)
252+
>>>
253+
>>> print(features("floatfeature").shape)
254+
... (2,)
255+
>>> print(features("floatfeature").dtype)
256+
... <dtype: 'float64'>
257+
```
258+
259+
### Conversion from and to Tensor and Dataset
260+
261+
When needed, `IOTensor` could be converted into a `Tensor` (through
262+
`to_tensor()`, or a `tf.data.Dataset` (through `to_dataset()`, to
263+
suppor operations that is only available through `Tensor` or
264+
`tf.data.Dataset`.
265+
266+
Example:
267+
268+
```python
269+
>>> import tensorflow as tf
270+
>>> import tensorflow_io as tfio
271+
>>>
272+
>>> features = tfio.IOTensor.from_json("feature.json")
273+
>>>
274+
>>> features_tensor = features.to_tensor()
275+
>>> print(features_tensor())
276+
... (<tf.Tensor: id=21, shape=(2,), dtype=float64, numpy=array([1.1, 2.1])>, <tf.Tensor: id=22, shape=(2,), dtype=int64, numpy=array([2, 3])>)
277+
>>>
278+
>>> features_dataset = features.to_dataset()
279+
>>> print(features_dataset)
280+
... <_IOTensorDataset shapes: ((), ()), types: (tf.float64, tf.int64)>
281+
>>>
282+
>>> dataset = tf.data.Dataset.zip((features_dataset, labels_dataset))
283+
```
284+
285+
"""
104286

105287
#=============================================================================
106288
# Constructor (private)

0 commit comments

Comments
 (0)