Skip to content

Commit 2836c4a

Browse files
authored
Small getting started guide on writes (#311)
1 parent 4f62d43 commit 2836c4a

File tree

5 files changed

+159
-30
lines changed

5 files changed

+159
-30
lines changed

mkdocs/docs/SUMMARY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717

1818
<!-- prettier-ignore-start -->
1919

20-
- [Home](index.md)
20+
- [Getting started](index.md)
2121
- [Configuration](configuration.md)
2222
- [CLI](cli.md)
2323
- [API](api.md)

mkdocs/docs/contributing.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,22 @@ For IDEA ≤2021 you need to install the [Poetry integration as a plugin](https:
5858

5959
Now you're set using Poetry, and all the tests will run in Poetry, and you'll have syntax highlighting in the pyproject.toml to indicate stale dependencies.
6060

61+
## Installation from source
62+
63+
Clone the repository for local development:
64+
65+
```sh
66+
git clone https://github.com/apache/iceberg-python.git
67+
cd iceberg-python
68+
pip3 install -e ".[s3fs,hive]"
69+
```
70+
71+
Install it directly for GitHub (not recommended), but sometimes handy:
72+
73+
```
74+
pip install "git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[s3fs]"
75+
```
76+
6177
## Linting
6278

6379
`pre-commit` is used for autoformatting and linting:

mkdocs/docs/index.md

Lines changed: 117 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,11 @@ hide:
2020
- limitations under the License.
2121
-->
2222

23-
# PyIceberg
23+
# Getting started with PyIceberg
2424

2525
PyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM.
2626

27-
## Install
27+
## Installation
2828

2929
Before installing PyIceberg, make sure that you're on an up-to-date version of `pip`:
3030

@@ -38,36 +38,126 @@ You can install the latest release version from pypi:
3838
pip install "pyiceberg[s3fs,hive]"
3939
```
4040

41-
Install it directly for GitHub (not recommended), but sometimes handy:
41+
You can mix and match optional dependencies depending on your needs:
42+
43+
| Key | Description: |
44+
| ------------ | -------------------------------------------------------------------- |
45+
| hive | Support for the Hive metastore |
46+
| glue | Support for AWS Glue |
47+
| dynamodb | Support for AWS DynamoDB |
48+
| sql-postgres | Support for SQL Catalog backed by Postgresql |
49+
| sql-sqlite | Support for SQL Catalog backed by SQLite |
50+
| pyarrow | PyArrow as a FileIO implementation to interact with the object store |
51+
| pandas | Installs both PyArrow and Pandas |
52+
| duckdb | Installs both PyArrow and DuckDB |
53+
| ray | Installs PyArrow, Pandas, and Ray |
54+
| s3fs | S3FS as a FileIO implementation to interact with the object store |
55+
| adlfs | ADLFS as a FileIO implementation to interact with the object store |
56+
| snappy | Support for snappy Avro compression |
57+
| gcs | GCS as the FileIO implementation to interact with the object store |
58+
59+
You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` to be able to fetch files from an object store.
60+
61+
## Connecting to a catalog
62+
63+
Iceberg leverages the [catalog to have one centralized place to organize the tables](https://iceberg.apache.org/catalog/). This can be a traditional Hive catalog to store your Iceberg tables next to the rest, a vendor solution like the AWS Glue catalog, or an implementation of Icebergs' own [REST protocol](https://github.com/apache/iceberg/tree/main/open-api). Checkout the [configuration](configuration.md) page to find all the configuration details.
64+
65+
## Write a PyArrow dataframe
66+
67+
Let's take the Taxi dataset, and write this to an Iceberg table.
68+
69+
First download one month of data:
70+
71+
```shell
72+
curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /tmp/yellow_tripdata_2023-01.parquet
73+
```
74+
75+
Load it into your PyArrow dataframe:
76+
77+
```python
78+
import pyarrow.parquet as pq
4279

80+
df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet")
4381
```
44-
pip install "git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[s3fs]"
82+
83+
Create a new Iceberg table:
84+
85+
```python
86+
from pyiceberg.catalog import load_catalog
87+
88+
catalog = load_catalog("default")
89+
90+
table = catalog.create_table(
91+
"default.taxi_dataset",
92+
schema=df.schema,
93+
)
4594
```
4695

47-
Or clone the repository for local development:
96+
Append the dataframe to the table:
4897

49-
```sh
50-
git clone https://github.com/apache/iceberg-python.git
51-
cd iceberg-python
52-
pip3 install -e ".[s3fs,hive]"
98+
```python
99+
table.append(df)
100+
len(table.scan().to_arrow())
53101
```
54102

55-
You can mix and match optional dependencies depending on your needs:
103+
3066766 rows have been written to the table.
104+
105+
Now generate a tip-per-mile feature to train the model on:
106+
107+
```python
108+
import pyarrow.compute as pc
109+
110+
df = df.append_column("tip_per_mile", pc.divide(df["tip_amount"], df["trip_distance"]))
111+
```
112+
113+
Evolve the schema of the table with the new column:
114+
115+
```python
116+
with table.update_schema() as update_schema:
117+
update_schema.union_by_name(df.schema)
118+
```
119+
120+
And now we can write the new dataframe to the Iceberg table:
121+
122+
```python
123+
table.overwrite(df)
124+
print(table.scan().to_arrow())
125+
```
126+
127+
And the new column is there:
128+
129+
```
130+
taxi_dataset(
131+
1: VendorID: optional long,
132+
2: tpep_pickup_datetime: optional timestamp,
133+
3: tpep_dropoff_datetime: optional timestamp,
134+
4: passenger_count: optional double,
135+
5: trip_distance: optional double,
136+
6: RatecodeID: optional double,
137+
7: store_and_fwd_flag: optional string,
138+
8: PULocationID: optional long,
139+
9: DOLocationID: optional long,
140+
10: payment_type: optional long,
141+
11: fare_amount: optional double,
142+
12: extra: optional double,
143+
13: mta_tax: optional double,
144+
14: tip_amount: optional double,
145+
15: tolls_amount: optional double,
146+
16: improvement_surcharge: optional double,
147+
17: total_amount: optional double,
148+
18: congestion_surcharge: optional double,
149+
19: airport_fee: optional double,
150+
20: tip_per_mile: optional double
151+
),
152+
```
153+
154+
And we can see that 2371784 rows have a tip-per-mile:
155+
156+
```python
157+
df = table.scan(row_filter="tip_per_mile > 0").to_arrow()
158+
len(df)
159+
```
160+
161+
## More details
56162

57-
| Key | Description: |
58-
| -------- | -------------------------------------------------------------------- |
59-
| hive | Support for the Hive metastore |
60-
| glue | Support for AWS Glue |
61-
| dynamodb | Support for AWS DynamoDB |
62-
| pyarrow | PyArrow as a FileIO implementation to interact with the object store |
63-
| pandas | Installs both PyArrow and Pandas |
64-
| duckdb | Installs both PyArrow and DuckDB |
65-
| ray | Installs PyArrow, Pandas, and Ray |
66-
| s3fs | S3FS as a FileIO implementation to interact with the object store |
67-
| adlfs | ADLFS as a FileIO implementation to interact with the object store |
68-
| snappy | Support for snappy Avro compression |
69-
| gcs | GCS as the FileIO implementation to interact with the object store |
70-
71-
You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` for fetching files.
72-
73-
There is both a [CLI](cli.md) and [Python API](api.md) available.
163+
For the details, please check the [CLI](cli.md) or [Python API](api.md) page.

pyiceberg/table/__init__.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1477,9 +1477,11 @@ def case_sensitive(self, case_sensitive: bool) -> UpdateSchema:
14771477
self._case_sensitive = case_sensitive
14781478
return self
14791479

1480-
def union_by_name(self, new_schema: Schema) -> UpdateSchema:
1480+
def union_by_name(self, new_schema: Union[Schema, "pa.Schema"]) -> UpdateSchema:
1481+
from pyiceberg.catalog import Catalog
1482+
14811483
visit_with_partner(
1482-
new_schema,
1484+
Catalog._convert_schema_if_needed(new_schema),
14831485
-1,
14841486
UnionByNameVisitor(update_schema=self, existing_schema=self._schema, case_sensitive=self._case_sensitive), # type: ignore
14851487
PartnerIdByNameAccessor(partner_schema=self._schema, case_sensitive=self._case_sensitive),

tests/test_schema.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
from textwrap import dedent
1919
from typing import Any, Dict, List
2020

21+
import pyarrow as pa
2122
import pytest
2223

2324
from pyiceberg import schema
@@ -1579,3 +1580,23 @@ def test_append_nested_lists() -> None:
15791580
)
15801581

15811582
assert union.as_struct() == expected.as_struct()
1583+
1584+
1585+
def test_union_with_pa_schema(primitive_fields: NestedField) -> None:
1586+
base_schema = Schema(NestedField(field_id=1, name="foo", field_type=StringType(), required=True))
1587+
1588+
pa_schema = pa.schema([
1589+
pa.field("foo", pa.string(), nullable=False),
1590+
pa.field("bar", pa.int32(), nullable=True),
1591+
pa.field("baz", pa.bool_(), nullable=True),
1592+
])
1593+
1594+
new_schema = UpdateSchema(None, schema=base_schema).union_by_name(pa_schema)._apply()
1595+
1596+
expected_schema = Schema(
1597+
NestedField(field_id=1, name="foo", field_type=StringType(), required=True),
1598+
NestedField(field_id=2, name="bar", field_type=IntegerType(), required=False),
1599+
NestedField(field_id=3, name="baz", field_type=BooleanType(), required=False),
1600+
)
1601+
1602+
assert new_schema == expected_schema

0 commit comments

Comments
 (0)