Add policies for metadata compaction, orphan file removal and snapshot retention #969

flyrain · 2025-02-08T01:54:09Z

No description provided.

…t retention

leangjonathan · 2025-02-10T18:59:45Z

polaris-core/src/main/resources/schemas/policies/system/orphan-file-removal/2025-02-03.json

+    },
+    "location": {
+      "type": "string",
+      "description": "Customized directory other than table location to look for files in."


I think this should have a warning that if you specify locations other than the table base location for example s3://my-bucket instead of s3://my-bucket/my-table-location, all files not referenced by the table will purged including potentially other table files if those files are stored in the specified path. I think this aligns with best practice notes that one shouldn't store tables under the same location

Added a waning message.

gurukaraje · 2025-02-10T19:40:39Z

polaris-core/src/main/resources/schemas/policies/system/orphan-file-removal/2025-02-03.json

+      "type": "boolean",
+      "description": "Enable or disable orphan file removal."
+    },
+    "older_than": {


It might be easier to specify the file age for orphaned files in days instead of timestamp? I.e.
"max_orphan_file_age_in_days" : 30
instead of
"older_than": 1707315296

+1. Timestamp probably only makes sense for engines during operation runtime

Good point, made the change.

gurukaraje · 2025-02-10T19:41:43Z

polaris-core/src/main/resources/schemas/policies/system/orphan-file-removal/2025-02-03.json

+    {
+      "version": "2025-02-03",
+      "enable": true,
+      "older_than": 1707315296,


how about keeping all the 3 policies consistent with the format? metadata compaction and snapshot-expiry seem to follow the format:

"enable": true "config" : { "key1" : value1 "key2" : value2 }

But orphan file deletion has some of the properties older_than and location out side of config:

"enable": true, "older_than": 1707315296, "location": "s3://my-bucket/my-table-location", "config": { "prefix_mismatch_mode": "ignore", "my_key": "my_value" }

The config map is used for customized properties. For example, a TMS may provide an optional feature that only a subset of tables need it.

flyrain · 2025-02-11T01:11:45Z

polaris-core/src/main/resources/schemas/policies/system/orphan-file-removal/2025-02-03.json

+    },
+    "location": {
+      "type": "string",
+      "description": "Specifies a custom directory to search for files instead of the default table location. Use with caution—if set to a broad location (e.g., s3://my-bucket instead of s3://my-bucket/my-table-location), all unreferenced files in that path may be permanently deleted, including files from other tables. Following best practices, tables should be stored in separate locations to avoid accidental data loss."


This hard to read in an IDE like IntelliJ as it is a single long line. Json doesn't support a way to break one line to multiple lines. This makes me think we may use the format yaml instead of json.

The alternative is just to put line breaks in the string, and then preprocess it anywhere we want to strip out the whitespace

put line breaks in the string

One of my versions was that :). It didn't help much, esp. in the IDE.

ashvina · 2025-02-14T07:03:13Z

polaris-core/src/main/resources/schemas/policies/system/orphan-file-removal/2025-02-03.json

+      "type": "number",
+      "description": "Specifies the maximum age (in days) for orphaned files before they are eligible for removal."
+    },
+    "location": {


What do you think about making this property multi-value? This way, a user could support adding paths for multiple "namespaces".

I guess you don't mean namespaces but just "locations"?

I think having one policy map to one location is probably okay for now; I'm not sure if/how we plan to handle overlapping locations though.

We could support that. One use case is that table files might be stored in different locations based on the write.data.path and/or write.metadata.path settings. This is generally not recommended though, due to issues like it makes credential vending harder. Are there any other use cases you have in mind, @ashvina?

Overlapping locations is very dangerous. No orphan file removal should happen in that case.

Made it a string array type in the new commit.

eric-maynard · 2025-02-14T08:37:54Z

polaris-core/src/main/resources/schemas/policies/system/orphan-file-removal/2025-02-03.json

+  "license": "Licensed under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0)",
+  "$id": "https://polaris.apache.org/schemas/policies/system/orphan-file-removal/2025-02-03.json",
+  "title": "Orphan File Removal Policy",
+  "description": "Inheritable Polaris policy schema for Iceberg table orphan file removal.",


What does Inheritable mean here?

Polaris seems redundant, all of these are going to be Polaris schemas right?

A inheritable policy means it can be applied to the under layer entities. For example, all tables under a namespace get the policies if it is assigned to the namespace.
I'm OK to remove it or keep it. Keeping it provides a complete view for anyone who read the schema, but without too much context of Polaris.

Are all policies inheritable?

No. For example, column masking policy are not inheritable, which makes more sense, and also that's what most engines do.

polaris-core/src/main/resources/schemas/policies/system/snapshot-retention/2025-02-03.json

eric-maynard · 2025-02-14T08:40:03Z

polaris-core/src/main/resources/schemas/policies/system/metadata-compaction/2025-02-03.json

@@ -0,0 +1,37 @@
+{
+  "license": "Licensed under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0)",
+  "$id": "https://polaris.apache.org/schemas/policies/system/metadata-compaction/2025-02-03.json",


Does the JSON magically land at the location specified in the ID somehow? Or do we always need a followup PR?

Also, it looks a little funny to use dates here given that the date in the PR may not align with the date the schema actually becomes effective. In the worst case, we could merge two versions in one day. Maybe just an incrementing number is easier?

Unfortunately no. I hope I can publish it once for all based on the directory structure.

I want to keep them as the same for date as these are first batch, should be fine as nobody is using it. Once we release them at 1.0. We should follow the date schema strictly.

snazy

Adding a note here as we might need a couple more days to review this one. Please do not merge yet.

flyrain · 2025-02-18T07:42:25Z

Adding a note here as we might need a couple more days to review this one. Please do not merge yet.

@snazy, this PR was filed 10 days ago, and it's a small one. Can you review it in a timely manner?

snazy · 2025-02-19T11:58:54Z

My concern with this change, and it's quickly merged predecessor #945, is that AFAIK we don't have a consensus on policies in general. And specifically where these files should live, how those are accessed by users and how those are used by implementations.

I don't object on draft-code and draft-PRs, but I think that we should have a consensus first on those rather big topics before merging anything. Sure, this change is rather "small" or "innocent", but the topic is big.

flyrain · 2025-02-20T01:05:30Z

I believe we have reached a general consensus on policy management:

Design Document: The design document [1] was published over two months ago, triggering a lot of discussions. Stakeholders from various companies—including representatives from Dremio—actively contributed, and all feedback has been incorporated without any outstanding blockers.
Review Sessions: We conducted multiple review sessions with key stakeholders such as @jbonofre, @omarsmak, @RussellSpitzer and several others. Through these discussions, we aligned on the overall approach.
PR Add data compaction policy schema #945 was received strong support, not only from committer @eric-maynard but also from @omarsmak, one of the stakeholders from Dremio and @HonahX, confirming broad agreement.

[1] https://docs.google.com/document/d/1kIiVkFFg9tPa5SH70b9WwzbmclrzH3qWHKfCKXw5lbs/edit?tab=t.0

omarsmak · 2025-02-20T17:31:37Z

polaris-core/src/main/resources/schemas/policies/system/orphan-file-removal/2025-02-03.json

+      "type": "boolean",
+      "description": "Enable or disable orphan file removal."
+    },
+    "max_orphan_file_age_in_days": {


@flyrain this also struggles me a bit, remove orphan policy can be even expressed in more than in just file age. We don't we opt for the config key similar to the other policies?

remove orphan policy can be even expressed in more than in just file age.

Can you name them? We can put them into schema if they are commonly used, otherwise, the config map would be the best place to be.

I'd vote for the config map

If no extra field is suggested, we could keep it as is.

snazy · 2025-02-21T12:51:19Z

I believe we have reached a general consensus on policy management:

1. Design Document: The design document [1] was published over two months ago, triggering a lot of discussions. Stakeholders from various companies—including representatives from Dremio—actively contributed, and all feedback has been incorporated without any outstanding blockers.

2. Review Sessions: We conducted multiple review sessions with key stakeholders such as @jbonofre, @omarsmak, @RussellSpitzer and several others. Through these discussions, we aligned on the overall approach.

3. PR [Add data compaction policy schema #945](https://github.com/apache/polaris/pull/945) was received strong support, not only from committer @eric-maynard but also from @omarsmak, one of the stakeholders from Dremio and @HonahX, confirming broad agreement.

[1] https://docs.google.com/document/d/1Vuhw5b9-6KAol2vU3HUs9FJwcgWtiVVXMYhLtGmz53s/edit?tab=t.0

(Not sure how the linked google doc is related to this PR)

The points I raised are about:

The changes committed to the production code base are not used, there is just no code that uses those - and won't be used soon. Those belong to a topic that's overall still WIP - hence I object merging it into main at this point.
These files land in a place that makes it extremely hard for arbitrary consumers to consume those. I do not think that those should live there.
Add data compaction policy schema #945 was merged without giving all contributors enough time to review.

I propose to move the work to a feature branch and go from there.

flyrain · 2025-02-22T18:15:53Z

Update the doc link.

The changes committed to the production code base are not used, there is just no code that uses those - and won't be used soon. Those belong to a topic that's overall still WIP - hence I object merging it into main at this point.

It's a common and recommended way to break a big feature into multiple small PRs, which not only makes the process iterating faster, but also improve review/code quality. A separate branch is only necessary when a feature introduces breaking changes, which is not the case for this PR.

These files land in a place that makes it extremely hard for arbitrary consumers to consume those. I do not think that those should live there.

I'm surprised by this claim. It's a common practice to put non-code files in the dir resources, it's pretty easy for any Java code to consume. For example, #938 introduced a way to consume these schemas. We also plan to publish them in the website for other tools to consume.

Add policies for metadata compaction, orphan file removal and snapsho…

0ffbf92

…t retention

flyrain marked this pull request as ready for review February 8, 2025 01:54

flyrain requested review from MonkeyCanCode, RussellSpitzer, adutra, ashvina, collado-mike, dennishuo, dimas-b, ebyhr, eric-maynard, jackye1995, jbonofre, snazy, takidau and vvcephei as code owners February 8, 2025 01:54

Fix typo

561cfb7

leangjonathan reviewed Feb 10, 2025

View reviewed changes

gurukaraje reviewed Feb 10, 2025

View reviewed changes

Resolve comments

931094e

flyrain commented Feb 11, 2025

View reviewed changes

Resolve comments

02f77cb

ashvina reviewed Feb 14, 2025

View reviewed changes

eric-maynard reviewed Feb 14, 2025

View reviewed changes

polaris-core/src/main/resources/schemas/policies/system/snapshot-retention/2025-02-03.json Outdated Show resolved Hide resolved

eric-maynard approved these changes Feb 14, 2025

View reviewed changes

eric-maynard reviewed Feb 14, 2025

View reviewed changes

Resolve comments

efc4767

snazy reviewed Feb 17, 2025

View reviewed changes

omarsmak reviewed Feb 20, 2025

View reviewed changes

eric-maynard approved these changes Feb 25, 2025

View reviewed changes

flyrain merged commit f3d9141 into apache:main Feb 25, 2025
5 checks passed

github-project-automation bot moved this from In Progress to Done in Policy management Feb 25, 2025

github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Feb 25, 2025

github-project-automation bot moved this from In Progress to Done in Table Maintenance Feb 25, 2025

HonahX mentioned this pull request Mar 13, 2025

Policy Store: Add PolicyEntity and PolicyTypes #1133

Merged

Add policies for metadata compaction, orphan file removal and snapshot retention #969

Add policies for metadata compaction, orphan file removal and snapshot retention #969

Uh oh!

Conversation

flyrain commented Feb 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gurukaraje Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snazy left a comment

Choose a reason for hiding this comment

Uh oh!

flyrain commented Feb 18, 2025

Uh oh!

snazy commented Feb 19, 2025

Uh oh!

flyrain commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snazy commented Feb 21, 2025

Uh oh!

flyrain commented Feb 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

gurukaraje Feb 10, 2025 •

edited

Loading

flyrain Feb 14, 2025 •

edited

Loading

flyrain commented Feb 20, 2025 •

edited

Loading