Small getting started guide on writes

Fokko · Fokko · commit 683b70007c32 · 2024-01-27T20:36:26.000+01:00
diff --git a/mkdocs/docs/SUMMARY.md b/mkdocs/docs/SUMMARY.md
@@ -17,7 +17,7 @@
 
 <!-- prettier-ignore-start -->
 
-- [Home](index.md)
+- [Getting started](index.md)
 - [Configuration](configuration.md)
 - [CLI](cli.md)
 - [API](api.md)
diff --git a/mkdocs/docs/contributing.md b/mkdocs/docs/contributing.md
@@ -58,6 +58,22 @@ For IDEA ≤2021 you need to install the [Poetry integration as a plugin](https:
 
 Now you're set using Poetry, and all the tests will run in Poetry, and you'll have syntax highlighting in the pyproject.toml to indicate stale dependencies.
 
+## Installation from source
+
+Clone the repository for local development:
+
+```sh
+git clone https://github.com/apache/iceberg-python.git
+cd iceberg-python
+pip3 install -e ".[s3fs,hive]"
+```
+
+Install it directly for GitHub (not recommended), but sometimes handy:
+
+```
+pip install "git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[s3fs]"
+```
+
 ## Linting
 
 `pre-commit` is used for autoformatting and linting:
diff --git a/mkdocs/docs/index.md b/mkdocs/docs/index.md
@@ -20,11 +20,11 @@ hide:
  - limitations under the License.
  -->
 
-# PyIceberg
+# Getting started with PyIceberg
 
 PyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM.
 
-## Install
+## Installation
 
 Before installing PyIceberg, make sure that you're on an up-to-date version of `pip`:
 
@@ -38,36 +38,129 @@ You can install the latest release version from pypi:
 pip install "pyiceberg[s3fs,hive]"
 ```
 
-Install it directly for GitHub (not recommended), but sometimes handy:
+You can mix and match optional dependencies depending on your needs:
+
+| Key          | Description:                                                         |
+| ------------ | -------------------------------------------------------------------- |
+| hive         | Support for the Hive metastore                                       |
+| glue         | Support for AWS Glue                                                 |
+| dynamodb     | Support for AWS DynamoDB                                             |
+| sql-postgres | Support for SQL Catalog backed by Postgresql                         |
+| sql-sqlite   | Support for SQL Catalog backed by SQLite                             |
+| pyarrow      | PyArrow as a FileIO implementation to interact with the object store |
+| pandas       | Installs both PyArrow and Pandas                                     |
+| duckdb       | Installs both PyArrow and DuckDB                                     |
+| ray          | Installs PyArrow, Pandas, and Ray                                    |
+| s3fs         | S3FS as a FileIO implementation to interact with the object store    |
+| adlfs        | ADLFS as a FileIO implementation to interact with the object store   |
+| snappy       | Support for snappy Avro compression                                  |
+| gcs          | GCS as the FileIO implementation to interact with the object store   |
+
+You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` to be able to fetch files from an object store.
+
+## Connecting to a catalog
+
+Iceberg leverages the [catalog to have one centralized place to organize the tables](https://iceberg.apache.org/catalog/). This can be a traditional Hive catalog to store your Iceberg tables next to the rest, a vendor solution like the AWS Glue catalog, or an implementation of Icebergs' own [REST protocol](https://github.com/apache/iceberg/tree/main/open-api). Checkout the [configuration](configuration.md) page to find all the configuration details.
 
+## Write a PyArrow dataframe
+
+Let's take the Taxi dataset, and write this to an Iceberg table.
+
+First download one month of data:
+
+```shell
+curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /tmp/yellow_tripdata_2023-01.parquet
 ```
-pip install "git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[s3fs]"
+
+Load it into your PyArrow dataframe:
+
+```python
+import pyarrow.parquet as pq
+
+df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet")
 ```
 
-Or clone the repository for local development:
+Create a new Iceberg table:
 
-```sh
-git clone https://github.com/apache/iceberg-python.git
-cd iceberg-python
-pip3 install -e ".[s3fs,hive]"
+```python
+from pyiceberg.catalog import load_catalog
+
+catalog = load_catalog("default")
+
+table = catalog.create_table(
+    "default.taxi_dataset",
+    schema=df.schema,  # Blocked by https://github.com/apache/iceberg-python/pull/305
+)
 ```
 
-You can mix and match optional dependencies depending on your needs:
+Append the dataframe to the table:
+
+```python
+table.append(df)
+len(table.scan().to_arrow())
+```
+
+3066766 rows have been written to the table.
+
+Now generate a tip-per-mile feature to train the model on:
+
+```python
+import pyarrow.compute as pc
+
+df = df.append_column("tip_per_mile", pc.divide(df["tip_amount"], df["trip_distance"]))
+```
+
+Evolve the schema of the table with the new column:
+
+```python
+from pyiceberg.catalog import Catalog
+
+with table.update_schema() as update_schema:
+    # Blocked by https://github.com/apache/iceberg-python/pull/305
+    update_schema.union_by_name(Catalog._convert_schema_if_needed(df.schema))
+```
+
+And now we can write the new dataframe to the Iceberg table:
+
+```python
+table.overwrite(df)
+print(table.scan().to_arrow())
+```
+
+And the new column is there:
+
+```
+taxi_dataset(
+  1: VendorID: optional long,
+  2: tpep_pickup_datetime: optional timestamp,
+  3: tpep_dropoff_datetime: optional timestamp,
+  4: passenger_count: optional double,
+  5: trip_distance: optional double,
+  6: RatecodeID: optional double,
+  7: store_and_fwd_flag: optional string,
+  8: PULocationID: optional long,
+  9: DOLocationID: optional long,
+  10: payment_type: optional long,
+  11: fare_amount: optional double,
+  12: extra: optional double,
+  13: mta_tax: optional double,
+  14: tip_amount: optional double,
+  15: tolls_amount: optional double,
+  16: improvement_surcharge: optional double,
+  17: total_amount: optional double,
+  18: congestion_surcharge: optional double,
+  19: airport_fee: optional double,
+  20: tip_per_mile: optional double
+),
+```
+
+And we can see that 2371784 rows have a tip-per-mile:
+
+```python
+df = table.scan(row_filter="tip_per_mile > 0").to_arrow()
+len(df)
+```
+
+## More details
 
-| Key      | Description:                                                         |
-| -------- | -------------------------------------------------------------------- |
-| hive     | Support for the Hive metastore                                       |
-| glue     | Support for AWS Glue                                                 |
-| dynamodb | Support for AWS DynamoDB                                             |
-| pyarrow  | PyArrow as a FileIO implementation to interact with the object store |
-| pandas   | Installs both PyArrow and Pandas                                     |
-| duckdb   | Installs both PyArrow and DuckDB                                     |
-| ray      | Installs PyArrow, Pandas, and Ray                                    |
-| s3fs     | S3FS as a FileIO implementation to interact with the object store    |
-| adlfs    | ADLFS as a FileIO implementation to interact with the object store   |
-| snappy   | Support for snappy Avro compression                                  |
-| gcs      | GCS as the FileIO implementation to interact with the object store   |
-
-You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` for fetching files.
-
-There is both a [CLI](cli.md) and [Python API](api.md) available.
+For the details, please check the [CLI](cli.md) or [Python API](api.md) page.