Skip to content

Commit 0fab7b1

Browse files
author
Sreesh Maheshwar
committed
Merge branch 'main' into try-fix-keyword-thing
2 parents 9edb166 + ad8263b commit 0fab7b1

30 files changed

+1707
-862
lines changed

.github/workflows/pypi-build-artifacts.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ jobs:
6262
if: startsWith(matrix.os, 'ubuntu')
6363

6464
- name: Build wheels
65-
uses: pypa/[email protected].0
65+
uses: pypa/[email protected].1
6666
with:
6767
output-dir: wheelhouse
6868
config-file: "pyproject.toml"

.github/workflows/svn-build-artifacts.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ jobs:
5757
if: startsWith(matrix.os, 'ubuntu')
5858

5959
- name: Build wheels
60-
uses: pypa/[email protected].0
60+
uses: pypa/[email protected].1
6161
with:
6262
output-dir: wheelhouse
6363
config-file: "pyproject.toml"

dev/Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16-
FROM python:3.9-bullseye
16+
FROM python:3.12-bullseye
1717

1818
RUN apt-get -qq update && \
1919
apt-get -qq install -y --no-install-recommends \
@@ -63,7 +63,7 @@ RUN chmod u+x /opt/spark/sbin/* && \
6363

6464
RUN pip3 install -q ipython
6565

66-
RUN pip3 install "pyiceberg[s3fs,hive]==${PYICEBERG_VERSION}"
66+
RUN pip3 install "pyiceberg[s3fs,hive,pyarrow]==${PYICEBERG_VERSION}"
6767

6868
COPY entrypoint.sh .
6969
COPY provision.py .

mkdocs/docs/api.md

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1523,9 +1523,55 @@ print(ray_dataset.take(2))
15231523
]
15241524
```
15251525

1526+
### Bodo
1527+
1528+
PyIceberg interfaces closely with Bodo Dataframes (see [Bodo Iceberg Quick Start](https://docs.bodo.ai/latest/quick_start/quickstart_local_iceberg/)),
1529+
which provides a drop-in replacement for Pandas that applies query, compiler and HPC optimizations automatically.
1530+
Bodo accelerates and scales Python code from single laptops to large clusters without code rewrites.
1531+
1532+
<!-- prettier-ignore-start -->
1533+
1534+
!!! note "Requirements"
1535+
This requires [`bodo` to be installed](index.md).
1536+
1537+
```python
1538+
pip install pyiceberg['bodo']
1539+
```
1540+
<!-- prettier-ignore-end -->
1541+
1542+
A table can be read easily into a Bodo Dataframe to perform Pandas operations:
1543+
1544+
```python
1545+
df = table.to_bodo() # equivalent to `bodo.pandas.read_iceberg_table(table)`
1546+
df = df[df["trip_distance"] >= 10.0]
1547+
df = df[["VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime"]]
1548+
print(df)
1549+
```
1550+
1551+
This creates a lazy query, optimizes it, and runs it on all available cores (print triggers execution):
1552+
1553+
```python
1554+
VendorID tpep_pickup_datetime tpep_dropoff_datetime
1555+
0 2 2023-01-01 00:27:12 2023-01-01 00:49:56
1556+
1 2 2023-01-01 00:09:29 2023-01-01 00:29:23
1557+
2 1 2023-01-01 00:13:30 2023-01-01 00:44:00
1558+
3 2 2023-01-01 00:41:41 2023-01-01 01:19:32
1559+
4 2 2023-01-01 00:22:39 2023-01-01 01:30:45
1560+
... ... ... ...
1561+
245478 2 2023-01-31 22:32:57 2023-01-31 23:01:48
1562+
245479 2 2023-01-31 22:03:26 2023-01-31 22:46:13
1563+
245480 2 2023-01-31 23:25:56 2023-02-01 00:05:42
1564+
245481 2 2023-01-31 23:18:00 2023-01-31 23:46:00
1565+
245482 2 2023-01-31 23:18:00 2023-01-31 23:41:00
1566+
1567+
[245483 rows x 3 columns]
1568+
```
1569+
1570+
Bodo is optimized to take advantage of Iceberg features such as hidden partitioning and various statistics for efficient reads.
1571+
15261572
### Daft
15271573

1528-
PyIceberg interfaces closely with Daft Dataframes (see also: [Daft integration with Iceberg](https://www.getdaft.io/projects/docs/en/stable/integrations/iceberg/)) which provides a full lazily optimized query engine interface on top of PyIceberg tables.
1574+
PyIceberg interfaces closely with Daft Dataframes (see also: [Daft integration with Iceberg](https://docs.daft.ai/en/stable/io/iceberg/)) which provides a full lazily optimized query engine interface on top of PyIceberg tables.
15291575

15301576
<!-- prettier-ignore-start -->
15311577

mkdocs/docs/configuration.md

Lines changed: 90 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -339,40 +339,111 @@ catalog:
339339
340340
| Key | Example | Description |
341341
| ------------------- | -------------------------------- | -------------------------------------------------------------------------------------------------- |
342-
| uri | <https://rest-catalog/ws> | URI identifying the REST Server |
343-
| ugi | t-1234:secret | Hadoop UGI for Hive client. |
344-
| credential | t-1234:secret | Credential to use for OAuth2 credential flow when initializing the catalog |
345-
| token | FEW23.DFSDF.FSDF | Bearer token value to use for `Authorization` header |
342+
| uri | <https://rest-catalog/ws> | URI identifying the REST Server |
343+
| warehouse | myWarehouse | Warehouse location or identifier to request from the catalog service. May be used to determine server-side overrides, such as the warehouse location. |
344+
| snapshot-loading-mode | refs | The snapshots to return in the body of the metadata. Setting the value to `all` would return the full set of snapshots currently valid for the table. Setting the value to `refs` would load all snapshots referenced by branches or tags. |
345+
| `header.X-Iceberg-Access-Delegation` | `vended-credentials` | Signal to the server that the client supports delegated access via a comma-separated list of access mechanisms. The server may choose to supply access via any or none of the requested mechanisms. When using `vended-credentials`, the server provides temporary credentials to the client. When using `remote-signing`, the server signs requests on behalf of the client. (default: `vended-credentials`) |
346+
347+
#### Headers in REST Catalog
348+
349+
To configure custom headers in REST Catalog, include them in the catalog properties with `header.<Header-Name>`. This
350+
ensures that all HTTP requests to the REST service include the specified headers.
351+
352+
```yaml
353+
catalog:
354+
default:
355+
uri: http://rest-catalog/ws/
356+
credential: t-1234:secret
357+
header.content-type: application/vnd.api+json
358+
```
359+
360+
#### Authentication Options
361+
362+
##### OAuth2
363+
364+
| Key | Example | Description |
365+
| ------------------- | -------------------------------- | -------------------------------------------------------------------------------------------------- |
366+
| oauth2-server-uri | <https://auth-service/cc> | Authentication URL to use for client credentials authentication (default: uri + 'v1/oauth/tokens') |
367+
| token | FEW23.DFSDF.FSDF | Bearer token value to use for `Authorization` header |
368+
| credential | client_id:client_secret | Credential to use for OAuth2 credential flow when initializing the catalog |
346369
| scope | openid offline corpds:ds:profile | Desired scope of the requested security token (default : catalog) |
347370
| resource | rest_catalog.iceberg.com | URI for the target resource or service |
348371
| audience | rest_catalog | Logical name of target resource or service |
372+
373+
##### SigV4
374+
375+
| Key | Example | Description |
376+
| ------------------- | -------------------------------- | -------------------------------------------------------------------------------------------------- |
349377
| rest.sigv4-enabled | true | Sign requests to the REST Server using AWS SigV4 protocol |
350378
| rest.signing-region | us-east-1 | The region to use when SigV4 signing a request |
351379
| rest.signing-name | execute-api | The service signing name to use when SigV4 signing a request |
352-
| oauth2-server-uri | <https://auth-service/cc> | Authentication URL to use for client credentials authentication (default: uri + 'v1/oauth/tokens') |
353-
| snapshot-loading-mode | refs | The snapshots to return in the body of the metadata. Setting the value to `all` would return the full set of snapshots currently valid for the table. Setting the value to `refs` would load all snapshots referenced by branches or tags. |
354-
| warehouse | myWarehouse | Warehouse location or identifier to request from the catalog service. May be used to determine server-side overrides, such as the warehouse location. |
355380

356381
<!-- markdown-link-check-enable-->
357382

358-
#### Headers in RESTCatalog
383+
#### Common Integrations & Examples
359384

360-
To configure custom headers in RESTCatalog, include them in the catalog properties with the prefix `header.`. This
361-
ensures that all HTTP requests to the REST service include the specified headers.
385+
##### AWS Glue
362386

363387
```yaml
364388
catalog:
365-
default:
366-
uri: http://rest-catalog/ws/
367-
credential: t-1234:secret
368-
header.content-type: application/vnd.api+json
389+
s3_tables_catalog:
390+
type: rest
391+
uri: https://glue.<region>.amazonaws.com/iceberg
392+
warehouse: <account-id>:s3tablescatalog/<table-bucket-name>
393+
rest.sigv4-enabled: true
394+
rest.signing-name: glue
395+
rest.signing-region: <region>
396+
```
397+
398+
##### Unity Catalog
399+
400+
```yaml
401+
catalog:
402+
unity_catalog:
403+
type: rest
404+
uri: https://<workspace-url>/api/2.1/unity-catalog/iceberg-rest
405+
warehouse: <uc-catalog-name>
406+
token: <databricks-pat-token>
407+
```
408+
409+
##### R2 Data Catalog
410+
411+
```yaml
412+
catalog:
413+
r2_catalog:
414+
type: rest
415+
uri: <r2-catalog-uri>
416+
warehouse: <r2-warehouse-name>
417+
token: <r2-token>
369418
```
370419

371-
Specific headers defined by the RESTCatalog spec include:
420+
##### Lakekeeper
421+
422+
```yaml
423+
catalog:
424+
lakekeeper_catalog:
425+
type: rest
426+
uri: <lakekeeper-catalog-uri>
427+
warehouse: <lakekeeper-warehouse-name>
428+
credential: <client-id>:<client-secret>
429+
oauth2-server-uri: http://localhost:30080/realms/<keycloak-realm-name>/protocol/openid-connect/token
430+
scope: lakekeeper
431+
```
372432

373-
| Key | Options | Default | Description |
374-
| ------------------------------------ | ------------------------------------- | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
375-
| `header.X-Iceberg-Access-Delegation` | `{vended-credentials,remote-signing}` | `vended-credentials` | Signal to the server that the client supports delegated access via a comma-separated list of access mechanisms. The server may choose to supply access via any or none of the requested mechanisms |
433+
##### Apache Polaris
434+
435+
```yaml
436+
catalog:
437+
polaris_catalog:
438+
type: rest
439+
uri: https://<account>.snowflakecomputing.com/polaris/api/catalog
440+
warehouse: <polaris-catalog-name>
441+
credential: <client-id>:<client-secret>
442+
header.X-Iceberg-Access-Delegation: vended-credentials
443+
scope: PRINCIPAL_ROLE:ALL
444+
token-refresh-enabled: true
445+
py-io-impl: pyiceberg.io.fsspec.FsspecFileIO
446+
```
376447

377448
### SQL Catalog
378449

@@ -444,6 +515,7 @@ catalog:
444515
| hive.hive2-compatible | true | Using Hive 2.x compatibility mode |
445516
| hive.kerberos-authentication | true | Using authentication via Kerberos |
446517
| hive.kerberos-service-name | hive | Kerberos service name (default hive) |
518+
| ugi | t-1234:secret | Hadoop UGI for Hive client. |
447519

448520
When using Hive 2.x, make sure to set the compatibility flag:
449521

mkdocs/docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ You can mix and match optional dependencies depending on your needs:
5252
| pandas | Installs both PyArrow and Pandas |
5353
| duckdb | Installs both PyArrow and DuckDB |
5454
| ray | Installs PyArrow, Pandas, and Ray |
55+
| bodo | Installs Bodo |
5556
| daft | Installs Daft |
5657
| polars | Installs Polars |
5758
| s3fs | S3FS as a FileIO implementation to interact with the object store |

0 commit comments

Comments
 (0)