Skip to content

Commit d82f8f7

Browse files
authored
Add Alibaba OSS protocol to PyArrowFileIO (#1392)
* Added force virtual addressing configuration for S3. Also added oss and r2 protocol. * Rewrote force virtual addressing as written in PyArrow documentation * Added the missing r2 key value in schema_to_file_io * Removed R2 protocol for now * Linter fix * Updated documentation for OSS support * Another linter fix
1 parent 2c972fa commit d82f8f7

File tree

3 files changed

+25
-1
lines changed

3 files changed

+25
-1
lines changed

mkdocs/docs/configuration.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ Iceberg works with the concept of a FileIO which is a pluggable module for readi
8888
- **file**: `PyArrowFileIO`
8989
- **hdfs**: `PyArrowFileIO`
9090
- **abfs**, **abfss**: `FsspecFileIO`
91+
- **oss**: `PyArrowFileIO`
9192

9293
You can also set the FileIO explicitly:
9394

@@ -115,6 +116,7 @@ For the FileIO there are several configuration options available:
115116
| s3.region | us-west-2 | Sets the region of the bucket |
116117
| s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy server to be used by the FileIO. |
117118
| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. |
119+
| s3.force-virtual-addressing | False | Whether to use virtual addressing of buckets. If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if endpoint_override is empty. This can be used for non-AWS backends that only support virtual hosted-style access. |
118120

119121
<!-- markdown-link-check-enable-->
120122

@@ -167,6 +169,22 @@ For the FileIO there are several configuration options available:
167169

168170
<!-- markdown-link-check-enable-->
169171

172+
### Alibaba Cloud Object Storage Service (OSS)
173+
174+
<!-- markdown-link-check-disable -->
175+
176+
PyIceberg uses [S3FileSystem](https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html) class to connect to OSS bucket as the service is [compatible with S3 SDK](https://www.alibabacloud.com/help/en/oss/developer-reference/use-amazon-s3-sdks-to-access-oss) as long as the endpoint is addressed with virtual hosted style.
177+
178+
| Key | Example | Description |
179+
| -------------------- | ------------------- | ------------------------------------------------ |
180+
| s3.endpoint | <https://s3.oss-your-bucket-region.aliyuncs.com/> | Configure an endpoint of the OSS service for the FileIO to access. Be sure to use S3 compatible endpoint as given in the example. |
181+
| s3.access-key-id | admin | Configure the static access key id used to access the FileIO. |
182+
| s3.secret-access-key | password | Configure the static secret access key used to access the FileIO. |
183+
| s3.session-token | AQoDYXdzEJr... | Configure the static session token used to access the FileIO. |
184+
| s3.force-virtual-addressing | True | Whether to use virtual addressing of buckets. This must be set to True as OSS can only be accessed with virtual hosted style address. |
185+
186+
<!-- markdown-link-check-enable-->
187+
170188
### PyArrow
171189

172190
<!-- markdown-link-check-disable -->

pyiceberg/io/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@
7474
S3_SIGNER_ENDPOINT_DEFAULT = "v1/aws/s3/sign"
7575
S3_ROLE_ARN = "s3.role-arn"
7676
S3_ROLE_SESSION_NAME = "s3.role-session-name"
77+
S3_FORCE_VIRTUAL_ADDRESSING = "s3.force-virtual-addressing"
7778
HDFS_HOST = "hdfs.host"
7879
HDFS_PORT = "hdfs.port"
7980
HDFS_USER = "hdfs.user"
@@ -304,6 +305,7 @@ def delete(self, location: Union[str, InputFile, OutputFile]) -> None:
304305
"s3": [ARROW_FILE_IO, FSSPEC_FILE_IO],
305306
"s3a": [ARROW_FILE_IO, FSSPEC_FILE_IO],
306307
"s3n": [ARROW_FILE_IO, FSSPEC_FILE_IO],
308+
"oss": [ARROW_FILE_IO],
307309
"gs": [ARROW_FILE_IO],
308310
"file": [ARROW_FILE_IO, FSSPEC_FILE_IO],
309311
"hdfs": [ARROW_FILE_IO],

pyiceberg/io/pyarrow.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,7 @@
102102
S3_ACCESS_KEY_ID,
103103
S3_CONNECT_TIMEOUT,
104104
S3_ENDPOINT,
105+
S3_FORCE_VIRTUAL_ADDRESSING,
105106
S3_PROXY_URI,
106107
S3_REGION,
107108
S3_ROLE_ARN,
@@ -350,7 +351,7 @@ def parse_location(location: str) -> Tuple[str, str, str]:
350351
return uri.scheme, uri.netloc, f"{uri.netloc}{uri.path}"
351352

352353
def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem:
353-
if scheme in {"s3", "s3a", "s3n"}:
354+
if scheme in {"s3", "s3a", "s3n", "oss"}:
354355
from pyarrow.fs import S3FileSystem
355356

356357
client_kwargs: Dict[str, Any] = {
@@ -373,6 +374,9 @@ def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSyste
373374
if session_name := get_first_property_value(self.properties, S3_ROLE_SESSION_NAME, AWS_ROLE_SESSION_NAME):
374375
client_kwargs["session_name"] = session_name
375376

377+
if force_virtual_addressing := self.properties.get(S3_FORCE_VIRTUAL_ADDRESSING):
378+
client_kwargs["force_virtual_addressing"] = property_as_bool(self.properties, force_virtual_addressing, False)
379+
376380
return S3FileSystem(**client_kwargs)
377381
elif scheme in ("hdfs", "viewfs"):
378382
from pyarrow.fs import HadoopFileSystem

0 commit comments

Comments
 (0)