Skip to content

Commit 1642da1

Browse files
authored
Rework on ParquetDataset for easy access and better cache size in eager mode (#384)
* Rework on ParquetDataset for easy access and better cache size in eager mode This fix is part of the effort to improve overall Dataset for easy access and better cache size in eager mode. See 382 and 366 for related discussions. In order to be able to read file either in filename or in mmeory, this PR adds an SizedRandomAccessFile which allows to provide an optional memory buffer as file content. This could be useful in process compression or archives where we could just read the uncompressed file content into memory. The preivous limitation in Dataset was that Dataset was a iterable so sequence length is unknown until graph runtime. In this PR, we provide an helper function to read the specs of parquet file and lenth is know. This also could open other avenues such as map parquet file with __getitem__ and __len__. Further, parquet file could be read into a Tensor and processed easily (such as pandas like API). The read_parquet_specs could be similarly applied to HDF5 which is more important: HDF5 could have dataset with different sizes. Summary: 1) Two basic C++ kernel ops are implemnted: read_parquet_specs and read_parquet 2) One ParquetDataset that is python implementation only (no C++ anymore) 3) ParquetDataset support eager and graph mode, in graph mode, dtype and shape are provided by user explicitly. In eager mode, only column name is needed. 4) read_parquet works in eager and graph mode, can read records either in full, or in slices 5) read_parquet_specs works in eager mode only (limitation). For cache batch vs. batch in tf.keras 1) Added a hidden `capacity` to adjust the cache batch size 2) batch to be passed in tf.keras is unrelated to `capacity`, but we could use `rebatch` to change at the end of the pipeline. 3) `capacity` could be padded to allow `rebatch` to only cut a slice over one chunk. If not padded to `batch_size` in tf.keras, then `rebatch` likely will copy over boundary. Signed-off-by: Yong Tang <[email protected]> * Fix build failures Signed-off-by: Yong Tang <[email protected]> * Rename read_parquet_columns => list_parquet_columns Signed-off-by: Yong Tang <[email protected]> * Remove batch args, and add test in graph mode Signed-off-by: Yong Tang <[email protected]>
1 parent 16169c1 commit 1642da1

File tree

9 files changed

+504
-507
lines changed

9 files changed

+504
-507
lines changed

tensorflow_io/core/BUILD

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,7 @@ cc_binary(
135135
"//tensorflow_io/json:json_ops",
136136
"//tensorflow_io/lmdb:lmdb_ops",
137137
"//tensorflow_io/mnist:mnist_ops",
138+
"//tensorflow_io/parquet:parquet_ops",
138139
"//tensorflow_io/prometheus:prometheus_ops",
139140
"//tensorflow_io/text:text_ops",
140141
"@libarchive",

tensorflow_io/parquet/BUILD

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,18 +7,16 @@ load(
77
"tf_io_copts",
88
)
99

10-
cc_binary(
11-
name = "python/ops/_parquet_ops.so",
10+
cc_library(
11+
name = "parquet_ops",
1212
srcs = [
13-
"kernels/parquet_input.cc",
13+
"kernels/parquet_kernels.cc",
1414
"ops/parquet_ops.cc",
1515
],
1616
copts = tf_io_copts(),
17-
linkshared = 1,
17+
linkstatic = True,
1818
deps = [
1919
"//tensorflow_io/core:dataset_ops",
2020
"@arrow",
21-
"@local_config_tf//:libtensorflow_framework",
22-
"@local_config_tf//:tf_header_lib",
2321
],
2422
)

tensorflow_io/parquet/__init__.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,18 +15,24 @@
1515
"""Parquet Dataset.
1616
1717
@@ParquetDataset
18+
@@read_parquet
19+
@@list_parquet_columns
1820
"""
1921

2022
from __future__ import absolute_import
2123
from __future__ import division
2224
from __future__ import print_function
2325

2426
from tensorflow_io.parquet.python.ops.parquet_ops import ParquetDataset
27+
from tensorflow_io.parquet.python.ops.parquet_ops import read_parquet
28+
from tensorflow_io.parquet.python.ops.parquet_ops import list_parquet_columns
2529

2630
from tensorflow.python.util.all_util import remove_undocumented
2731

2832
_allowed_symbols = [
2933
"ParquetDataset",
34+
"read_parquet",
35+
"list_parquet_columns",
3036
]
3137

3238
remove_undocumented(__name__, allowed_exception_list=_allowed_symbols)

tensorflow_io/parquet/kernels/parquet_input.cc

Lines changed: 0 additions & 315 deletions
This file was deleted.

0 commit comments

Comments
 (0)