@@ -100,7 +100,189 @@ def __repr__(self):
100100
101101
102102class IOTensor (_IOBaseTensor ):
103- """IOTensor"""
103+ """IOTensor
104+
105+ An `IOTensor` is a tensor with data backed by IO operations. For example,
106+ an `AudioIOTensor` is a tensor with data from an audio file, a
107+ `KafkaIOTensor` is a tensor with data from reading the messages of a Kafka
108+ stream server.
109+
110+ There are two types of `IOTensor`, a normal `IOTensor` which itself is
111+ indexable, or a degenerated `IOIterableTensor` which only supports
112+ accessing the tensor iteratively.
113+
114+ Since `IOTensor` is indexable, it support `__getitem__()` and
115+ `__len__()` methods in Python. In other words, it is a subclass of
116+ `collections.abc.Sequence`.
117+
118+ Example:
119+
120+ ```python
121+ >>> import tensorflow_io as tfio
122+ >>>
123+ >>> samples = tfio.IOTensor.from_audio("sample.wav")
124+ >>> print(samples[1000:1005])
125+ ... tf.Tensor(
126+ ... [[-3]
127+ ... [-7]
128+ ... [-6]
129+ ... [-6]
130+ ... [-5]], shape=(5, 1), dtype=int16)
131+ ```
132+
133+ A `IOIterableTensor` is really a subclass of `collections.abc.Iterable`.
134+ It provides a `__iter__()` method that could be used (through `iter`
135+ indirectly) to access data in an iterative fashion.
136+
137+ Example:
138+
139+ ```python
140+ >>> import tensorflow_io as tfio
141+ >>>
142+ >>> kafka = tfio.IOTensor.from_kafka("test", eof=True)
143+ >>> for message in kafka:
144+ >>> print(message)
145+ ... tf.Tensor(['D0'], shape=(1,), dtype=string)
146+ ... tf.Tensor(['D1'], shape=(1,), dtype=string)
147+ ... tf.Tensor(['D2'], shape=(1,), dtype=string)
148+ ... tf.Tensor(['D3'], shape=(1,), dtype=string)
149+ ... tf.Tensor(['D4'], shape=(1,), dtype=string)
150+ ```
151+
152+ ### Indexable vs. Iterable
153+
154+ While many IO formats are natually considered as iterable only, in most
155+ of the situations they could still be accessed by indexing through certain
156+ workaround. For example, a Kafka stream is not directly indexable yet the
157+ stream could be saved in memory or disk to allow indexing. Another example
158+ is the packet capture (PCAP) file in networking area. The packets inside
159+ a PCAP file is concatenated sequentially. Since each packets could have
160+ a variable length, the only way to access each packet is to read one
161+ packet at a time. If the PCAP file is huge (e.g., hundreds of GBs or even
162+ TBs), it may not be realistic (or necessarily) to save the index of every
163+ packet in memory. We could consider PCAP format as iterable only.
164+
165+ As we could see the availability of memory size could be a factor to decide
166+ if a format is indexable or not. However, this factor could also be blurred
167+ as well in distributed computing. One common case is the file format that
168+ might be splittable where a file could be split into multiple chunks
169+ (without read the whole file) with no data overlapping in between those
170+ chunks. For example, a text file could be reliably split into multiple
171+ chunks with line feed (LF) as the boundary. Processing of chunks could then
172+ be distributed across a group of compute node to speed up (by reading small
173+ chunks into memory). From that standpoint, we could still consider splittable
174+ formats as indexable.
175+
176+ For that reason our focus is `IOTensor` with convinience indexing and slicing
177+ through `__getitem__()` method.
178+
179+ ### Lazy Read
180+
181+ One useful feature of `IOTensor` is the lazy read. Data inside a file is not
182+ read into memory until needed. This could be convenient where only a small
183+ segment of the data is needed. For example, a WAV file could be as big as
184+ GBs but in many cases only several seconds of samples are used for training
185+ or inference purposes.
186+
187+ While CPU memory is cheap nowadays, GPU memory is still considered as an
188+ expensive resource. It is also imperative to fit data in GPU memory for
189+ speed up purposes. From that perspective lazy read could be very helpful.
190+
191+ ### Association of Meta Data
192+
193+ While a file format could consist of mostly numeric data, in may situations
194+ the meta data is important as well. For example, in audio file format the
195+ sample rate is a number that is necessary for almost everything. Association
196+ of the sample rate with the sample of int16 Tensor is more helpful,
197+ especially in eager mode.
198+
199+ Example:
200+
201+ ```python
202+ >>> import tensorflow_io as tfio
203+ >>>
204+ >>> samples = tfio.IOTensor.from_audio("sample.wav")
205+ >>> print(samples.rate)
206+ ... 44100
207+ ```
208+
209+ ### Nested Element Structure
210+
211+ The concept of `IOTensor` is not limited to a Tensor of single data type.
212+ It supports nested element structure which could consists of many
213+ components and complex structures. The exposed API such as `shape()` or
214+ `dtype()` will display the shape and data type of an individual Tensor,
215+ or a nested structure of shape and data types for components of a
216+ composite Tensor.
217+
218+ Example:
219+
220+ ```python
221+ >>> import tensorflow_io as tfio
222+ >>>
223+ >>> samples = tfio.IOTensor.from_audio("sample.wav")
224+ >>> print(samples.shape)
225+ ... (22050, 2)
226+ >>> print(samples.dtype)
227+ ... <dtype: 'int32'>
228+ >>>
229+ >>> features = tfio.IOTensor.from_json("feature.json")
230+ >>> print(features.shape)
231+ ... (TensorShape([Dimension(2)]), TensorShape([Dimension(2)]))
232+ >>> print(features.dtype)
233+ ... (tf.float64, tf.int64)
234+ ```
235+
236+ ### Access Columns of Tabular Data Formats
237+
238+ May file formats such as Parquet or Json are considered as Tabular because
239+ they consists of columns in a table. With `IOTensor` it is possible to
240+ access individual columns through `__call__()`.
241+
242+ Example:
243+
244+ ```python
245+ >>> import tensorflow_io as tfio
246+ >>>
247+ >>> features = tfio.IOTensor.from_json("feature.json")
248+ >>> print(features.shape)
249+ ... (TensorShape([Dimension(2)]), TensorShape([Dimension(2)]))
250+ >>> print(features.dtype)
251+ ... (tf.float64, tf.int64)
252+ >>>
253+ >>> print(features("floatfeature").shape)
254+ ... (2,)
255+ >>> print(features("floatfeature").dtype)
256+ ... <dtype: 'float64'>
257+ ```
258+
259+ ### Conversion from and to Tensor and Dataset
260+
261+ When needed, `IOTensor` could be converted into a `Tensor` (through
262+ `to_tensor()`, or a `tf.data.Dataset` (through `to_dataset()`, to
263+ suppor operations that is only available through `Tensor` or
264+ `tf.data.Dataset`.
265+
266+ Example:
267+
268+ ```python
269+ >>> import tensorflow as tf
270+ >>> import tensorflow_io as tfio
271+ >>>
272+ >>> features = tfio.IOTensor.from_json("feature.json")
273+ >>>
274+ >>> features_tensor = features.to_tensor()
275+ >>> print(features_tensor())
276+ ... (<tf.Tensor: id=21, shape=(2,), dtype=float64, numpy=array([1.1, 2.1])>, <tf.Tensor: id=22, shape=(2,), dtype=int64, numpy=array([2, 3])>)
277+ >>>
278+ >>> features_dataset = features.to_dataset()
279+ >>> print(features_dataset)
280+ ... <_IOTensorDataset shapes: ((), ()), types: (tf.float64, tf.int64)>
281+ >>>
282+ >>> dataset = tf.data.Dataset.zip((features_dataset, labels_dataset))
283+ ```
284+
285+ """
104286
105287 #=============================================================================
106288 # Constructor (private)
0 commit comments