diff --git a/mkdocs/docs/SUMMARY.md b/mkdocs/docs/SUMMARY.md index 5504d2ac7b..40ba0bffd7 100644 --- a/mkdocs/docs/SUMMARY.md +++ b/mkdocs/docs/SUMMARY.md @@ -17,7 +17,7 @@ -- [Home](index.md) +- [Getting started](index.md) - [Configuration](configuration.md) - [CLI](cli.md) - [API](api.md) diff --git a/mkdocs/docs/contributing.md b/mkdocs/docs/contributing.md index 8ec6dcb2d2..7411382d2c 100644 --- a/mkdocs/docs/contributing.md +++ b/mkdocs/docs/contributing.md @@ -58,6 +58,22 @@ For IDEA ≤2021 you need to install the [Poetry integration as a plugin](https: Now you're set using Poetry, and all the tests will run in Poetry, and you'll have syntax highlighting in the pyproject.toml to indicate stale dependencies. +## Installation from source + +Clone the repository for local development: + +```sh +git clone https://github.com/apache/iceberg-python.git +cd iceberg-python +pip3 install -e ".[s3fs,hive]" +``` + +Install it directly for GitHub (not recommended), but sometimes handy: + +``` +pip install "git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[s3fs]" +``` + ## Linting `pre-commit` is used for autoformatting and linting: diff --git a/mkdocs/docs/index.md b/mkdocs/docs/index.md index 628f4f7dd4..d82aa658eb 100644 --- a/mkdocs/docs/index.md +++ b/mkdocs/docs/index.md @@ -20,11 +20,11 @@ hide: - limitations under the License. --> -# PyIceberg +# Getting started with PyIceberg PyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM. -## Install +## Installation Before installing PyIceberg, make sure that you're on an up-to-date version of `pip`: @@ -38,36 +38,126 @@ You can install the latest release version from pypi: pip install "pyiceberg[s3fs,hive]" ``` -Install it directly for GitHub (not recommended), but sometimes handy: +You can mix and match optional dependencies depending on your needs: + +| Key | Description: | +| ------------ | -------------------------------------------------------------------- | +| hive | Support for the Hive metastore | +| glue | Support for AWS Glue | +| dynamodb | Support for AWS DynamoDB | +| sql-postgres | Support for SQL Catalog backed by Postgresql | +| sql-sqlite | Support for SQL Catalog backed by SQLite | +| pyarrow | PyArrow as a FileIO implementation to interact with the object store | +| pandas | Installs both PyArrow and Pandas | +| duckdb | Installs both PyArrow and DuckDB | +| ray | Installs PyArrow, Pandas, and Ray | +| s3fs | S3FS as a FileIO implementation to interact with the object store | +| adlfs | ADLFS as a FileIO implementation to interact with the object store | +| snappy | Support for snappy Avro compression | +| gcs | GCS as the FileIO implementation to interact with the object store | + +You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` to be able to fetch files from an object store. + +## Connecting to a catalog + +Iceberg leverages the [catalog to have one centralized place to organize the tables](https://iceberg.apache.org/catalog/). This can be a traditional Hive catalog to store your Iceberg tables next to the rest, a vendor solution like the AWS Glue catalog, or an implementation of Icebergs' own [REST protocol](https://github.com/apache/iceberg/tree/main/open-api). Checkout the [configuration](configuration.md) page to find all the configuration details. + +## Write a PyArrow dataframe + +Let's take the Taxi dataset, and write this to an Iceberg table. + +First download one month of data: + +```shell +curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet -o /tmp/yellow_tripdata_2023-01.parquet +``` + +Load it into your PyArrow dataframe: + +```python +import pyarrow.parquet as pq +df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet") ``` -pip install "git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[s3fs]" + +Create a new Iceberg table: + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog("default") + +table = catalog.create_table( + "default.taxi_dataset", + schema=df.schema, +) ``` -Or clone the repository for local development: +Append the dataframe to the table: -```sh -git clone https://github.com/apache/iceberg-python.git -cd iceberg-python -pip3 install -e ".[s3fs,hive]" +```python +table.append(df) +len(table.scan().to_arrow()) ``` -You can mix and match optional dependencies depending on your needs: +3066766 rows have been written to the table. + +Now generate a tip-per-mile feature to train the model on: + +```python +import pyarrow.compute as pc + +df = df.append_column("tip_per_mile", pc.divide(df["tip_amount"], df["trip_distance"])) +``` + +Evolve the schema of the table with the new column: + +```python +with table.update_schema() as update_schema: + update_schema.union_by_name(df.schema) +``` + +And now we can write the new dataframe to the Iceberg table: + +```python +table.overwrite(df) +print(table.scan().to_arrow()) +``` + +And the new column is there: + +``` +taxi_dataset( + 1: VendorID: optional long, + 2: tpep_pickup_datetime: optional timestamp, + 3: tpep_dropoff_datetime: optional timestamp, + 4: passenger_count: optional double, + 5: trip_distance: optional double, + 6: RatecodeID: optional double, + 7: store_and_fwd_flag: optional string, + 8: PULocationID: optional long, + 9: DOLocationID: optional long, + 10: payment_type: optional long, + 11: fare_amount: optional double, + 12: extra: optional double, + 13: mta_tax: optional double, + 14: tip_amount: optional double, + 15: tolls_amount: optional double, + 16: improvement_surcharge: optional double, + 17: total_amount: optional double, + 18: congestion_surcharge: optional double, + 19: airport_fee: optional double, + 20: tip_per_mile: optional double +), +``` + +And we can see that 2371784 rows have a tip-per-mile: + +```python +df = table.scan(row_filter="tip_per_mile > 0").to_arrow() +len(df) +``` + +## More details -| Key | Description: | -| -------- | -------------------------------------------------------------------- | -| hive | Support for the Hive metastore | -| glue | Support for AWS Glue | -| dynamodb | Support for AWS DynamoDB | -| pyarrow | PyArrow as a FileIO implementation to interact with the object store | -| pandas | Installs both PyArrow and Pandas | -| duckdb | Installs both PyArrow and DuckDB | -| ray | Installs PyArrow, Pandas, and Ray | -| s3fs | S3FS as a FileIO implementation to interact with the object store | -| adlfs | ADLFS as a FileIO implementation to interact with the object store | -| snappy | Support for snappy Avro compression | -| gcs | GCS as the FileIO implementation to interact with the object store | - -You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` for fetching files. - -There is both a [CLI](cli.md) and [Python API](api.md) available. +For the details, please check the [CLI](cli.md) or [Python API](api.md) page. diff --git a/pyiceberg/table/__init__.py b/pyiceberg/table/__init__.py index 26eecefd0f..16ed9ed292 100644 --- a/pyiceberg/table/__init__.py +++ b/pyiceberg/table/__init__.py @@ -1477,9 +1477,11 @@ def case_sensitive(self, case_sensitive: bool) -> UpdateSchema: self._case_sensitive = case_sensitive return self - def union_by_name(self, new_schema: Schema) -> UpdateSchema: + def union_by_name(self, new_schema: Union[Schema, "pa.Schema"]) -> UpdateSchema: + from pyiceberg.catalog import Catalog + visit_with_partner( - new_schema, + Catalog._convert_schema_if_needed(new_schema), -1, UnionByNameVisitor(update_schema=self, existing_schema=self._schema, case_sensitive=self._case_sensitive), # type: ignore PartnerIdByNameAccessor(partner_schema=self._schema, case_sensitive=self._case_sensitive), diff --git a/tests/test_schema.py b/tests/test_schema.py index a5487b7fd9..cfee6e7f14 100644 --- a/tests/test_schema.py +++ b/tests/test_schema.py @@ -18,6 +18,7 @@ from textwrap import dedent from typing import Any, Dict, List +import pyarrow as pa import pytest from pyiceberg import schema @@ -1579,3 +1580,23 @@ def test_append_nested_lists() -> None: ) assert union.as_struct() == expected.as_struct() + + +def test_union_with_pa_schema(primitive_fields: NestedField) -> None: + base_schema = Schema(NestedField(field_id=1, name="foo", field_type=StringType(), required=True)) + + pa_schema = pa.schema([ + pa.field("foo", pa.string(), nullable=False), + pa.field("bar", pa.int32(), nullable=True), + pa.field("baz", pa.bool_(), nullable=True), + ]) + + new_schema = UpdateSchema(None, schema=base_schema).union_by_name(pa_schema)._apply() + + expected_schema = Schema( + NestedField(field_id=1, name="foo", field_type=StringType(), required=True), + NestedField(field_id=2, name="bar", field_type=IntegerType(), required=False), + NestedField(field_id=3, name="baz", field_type=BooleanType(), required=False), + ) + + assert new_schema == expected_schema