Skip to content

Conversation

@gh-yzou
Copy link
Contributor

@gh-yzou gh-yzou commented May 7, 2025

Add a base-location keyword for GenericTable create request and GenericTable load response.

flyrain
flyrain previously approved these changes May 7, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we move it above the field properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updated

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board May 7, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this location mean? Is it supposed to point to a particular file? Is it supposed to be the common prefix of the locations of all files within the table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dimas-b i want to resume the discussion. The locations refers to the table root location in an URI format, I also updated the comment.

I am also copy some of the discussion point over from the email thread

1) Shall we introduce an explicit definition of location, and what is `location`?
The location refers to the root location of the table.
The table root location is a required information for different engines to access the table with formats like Delta, CSV etc. It is important that we explicitly define this information to provide robust cross engine interpolation. Furthermore, it is also an important information that is needed for credential vending.

2) Do we support single table root location or multiple root location ?
Today, only the Iceberg table allows multiple root locations, other table formats including Delta, Hudi and Hive style tables (CSV, Parquet) only support single table root location.  Since Generic Table is not designed to serve Iceberg functionality today, there is no use case for multiple table root locations, and starting with single location should be sufficient.

If in the future, we want to repurpose Generic table to also support Iceberg capabilities, and encounters the following use case:
"people may want to move their data to some other location. As an option users may want to write new
files into another location but keep old files in place"

One option we can do is to introduce an extra `additional location` to record the 
old data locations, and it can make a clear separation there about which is the current location, which are the
old data locations. Similar as what glue has been introducing. Or another option is to move on for V2 spec. 

Glue Table Catalog: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-Table

3) Should the location allow all URI schemas and special characters, especially s3a, s3n?
There are various issues raised in the Iceberg community when dealing with all those S3 schemas during path matching. However, since the root table location is not an absolute path, and Generic Table has restricted support in short and mid term (Polaris is still promoting for native Iceberg support), it will not do any complicated matching operation with the path, allowing all schemas shouldn't cause those issues like iceberg community. Furthermore, people may uses other schema with specific reasons such as performance or engine limitation.

4) Should the location be an explicit field or a reserved property key?
Given that table root location is an important information for most of the non-iceberg table formats. Having an explicit field could make things more clear when sharing across engines. 

@dimas-b Could you take a look ? please let me know if you have other concern points

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very hesitant to accept that all table files always have a common base URI 🤔

From my POV it is preferable to resolve these concerns on the dev ML, then resume this PR.

Copy link
Contributor

@dimas-b dimas-b May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of an Iceberg table where files are outside the "location" URI: projectnessie/nessie#10817 (comment)

I'm pretty sure Generic Tables will run into similar situations eventually, so we need to be prepared to deal with them.

If the "location" property is singular, we should put clean language in the spec to indicate that no table files should be out the location. Even in that case, I think we ought to expect user demand for multiple locations at some point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dimas-b Thanks a lot for the information! Is that for Iceberg tables, from the description, it does seems for Iceberg tables. I think for iceberg tables, we all agree that it can have multiple table locations, in fact, @dennishuo is looking into proposing new "locations" fields for Iceberg. Since generic table is not designed for iceberg usage today, so far i don't see it is necessary.

I am also open to have locations instead of location to accommodate future possibilities, I am little bit worried that no one will need it in such way, and it could introduce potential confusion to users.
An alternative we can do is if in the future if we encounter table formats that could allow multiple locations, we can introduce extra reserved property key to record extra locations for those table formats.

WDYT? i will also sent it on the mailing thread as a record

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue this on the email thread.

@gh-yzou gh-yzou marked this pull request as draft May 8, 2025 17:40
@gh-yzou gh-yzou force-pushed the yzou-generic-table-location branch from 5deacb4 to 774cf34 Compare May 14, 2025 18:55
@gh-yzou gh-yzou force-pushed the yzou-generic-table-location branch from f04b230 to 0d5c5ae Compare May 21, 2025 23:36
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimas-b i will follow up with a dedicated webpage for generic table

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx!

@gh-yzou gh-yzou marked this pull request as ready for review June 5, 2025 20:14
@gh-yzou gh-yzou requested a review from ajantha-bhat as a code owner June 5, 2025 20:14
@gh-yzou gh-yzou requested a review from singhpk234 as a code owner June 5, 2025 20:14
@gh-yzou gh-yzou force-pushed the yzou-generic-table-location branch from 0d5c5ae to 1a9168d Compare June 5, 2025 20:34
Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR LGTM personally, but I'm not sure whether the dev ML discussion reached a consensus... @gh-yzou , could you revive the dev thread to be certain?

@gh-yzou
Copy link
Contributor Author

gh-yzou commented Jun 6, 2025

@dimas-b yes, will do

@gh-yzou gh-yzou force-pushed the yzou-generic-table-location branch from 1a9168d to dd39e6d Compare June 10, 2025 21:52
@gh-yzou gh-yzou changed the title [SPEC] Add location keyword for GenericTable API [SPEC] Add base-location keyword for GenericTable API Jun 10, 2025
@flyrain flyrain merged commit 6cd6558 into apache:main Jun 11, 2025
9 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Jun 11, 2025
@dimas-b
Copy link
Contributor

dimas-b commented Jun 11, 2025

@flyrain : why was it merged? As I commented above, I do not think there was a consensus on the related dev ML discussion. Moreover, new replies were posted today and this merge did not allow the discussion to contribute to this SPEC change.

@flyrain
Copy link
Contributor

flyrain commented Jun 11, 2025

@dimas-b, Sorry, I didn't notice the discussion thread today. I saw your LGTM here, #1543 (review) and Eric's approval. I'm OK to revert it while waiting for the consensus from Dev ML.

@dimas-b
Copy link
Contributor

dimas-b commented Jun 11, 2025

My LGTM message also has a reference to the dev email thread 🤷

I do not think a revert is necessary, but I suppose Generic Tables may be need some follow-up changes... depending on how the discussion evolves.

Specifically, I'd like to highlight the concern about having this API change released in 1.0. I believe this concern needs to be addressed before 1.0 is released. Please reply in the dev thread.

snazy added a commit to snazy/polaris that referenced this pull request Nov 20, 2025
* main: Update dependency io.projectreactor.netty:reactor-netty-http to v1.2.7 (apache#1845)

* docs: fix broken 'Polaris Overview' link in README.md (apache#1846)

docs: fix broken 'Polaris Overview' link in README.md

Co-authored-by: Joy Haldar <[email protected]>

* Update spark client license (apache#1839)

* Update LICENCE (apache#1851)

* Add Yun Zou as the new committer

Co-authored-by: Yufei Gu <yufei.apache.org>

* main: Update docker.io/jaegertracing/all-in-one Docker tag to v1.70.0 (apache#1853)

* main: Update dependency io.opentelemetry.semconv:opentelemetry-semconv to v1.34.0 (apache#1850)

* main: Update dependency boto3 to v1.38.34 (apache#1852)

* main: Update dependency org.postgresql:postgresql to v42.7.7 (apache#1859)

* [SPEC] Add base-location keyword for GenericTable API (apache#1543)

* Update LICENSE for binary distributions (apache#1855)

* Add integration tests to Python Client (apache#1856)

* Update JDBC Getting-started example's README.md to use localhost for curl commands (apache#1872)

* Remove PolarisConfiguration.loadConfig (v2) (apache#1858)

* Enhance EclipseLink quickstart (apache#1870)

1. Update the name catalog name used to `quickstart_catalog` from `polaris` to make it consistent
2. Update the name of `eclipselink-trino-1` to `polaris-trino-1`
3. make the polaris server url consistent with using `localhost` to avoid failures

* Remove generated Python client from git tracking (apache#1810)

After apache#1675, we now generate the Python client on-demand when it's needed by tests or the CLI. With this, we can safely remove the Python client from being tracked in the repo and add its files to .gitignore

* Fix telemetry quickstart example for 1.0 release (apache#1873)

* main: Update gradle/actions digest to ac638b0 (apache#1877)

* main: Update dependency boto3 to v1.38.35 (apache#1874)

* Removing star import and adding errorprone rule (apache#1831)

Fixes apache#1100

* NoSQL: merge related changes

* Last merged commit 0faf948

---------

Co-authored-by: Mend Renovate <[email protected]>
Co-authored-by: Joy Haldar <[email protected]>
Co-authored-by: Joy Haldar <[email protected]>
Co-authored-by: gh-yzou <[email protected]>
Co-authored-by: Yufei Gu <[email protected]>
Co-authored-by: Yun Zou <[email protected]>
Co-authored-by: Honah (Jonas) J. <[email protected]>
Co-authored-by: Eric Maynard <[email protected]>
Co-authored-by: William Hyun <[email protected]>
Co-authored-by: gfakbar20 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants