Add data compaction policy schema #945

flyrain · 2025-02-05T05:16:20Z

This PR adds a policy schema for data compaction. Please refer to the design doc for more context.
Check out #938 if you want to know more about how a policy is validated.

The policy schemas will be shipped within the polaris-core jar file, so that engines or applications can reuse the polaris-core for related functionallities, like schema validation.

The policy will also be pushed in the Apache Polaris website via a link like https://polaris.apache.org/schemas/policies/system/data-compaction/2025-02-03.json, so that it is easy to refer to.

cc @HonahX @jackye1995 @jbonofre @eric-maynard @dennishuo @collado-mike @RussellSpitzer

flyrain · 2025-02-05T05:17:00Z

polaris-core/src/main/resources/schemas/policies/system/data-compaction/2025-02-03.json

@@ -0,0 +1,41 @@
+{
+  "license": "Licensed under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0)",
+  "$id": "https://polaris.apache.org/schemas/policies/system/data-compaction/2025-02-03.json",


Will publish it to web site in a followup PR.

flyrain · 2025-02-05T05:18:48Z

polaris-core/src/main/resources/schemas/policies/system/data-compaction/2025-02-03.json

@@ -0,0 +1,41 @@
+{
+  "license": "Licensed under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0)",


Json doesn't support comments. Using a field to indicate Apache License instead of a header.

flyrain · 2025-02-05T06:32:38Z

polaris-core/src/main/resources/schemas/policies/system/data-compaction/2025-02-03.json

+      "additionalProperties": {}
+    }
+  },
+  "required": ["enable"],


The schema version is optional. Polaris will pick up the latest version as the default one in case it's missing, and persist it in the metastore.

omarsmak

Thanks @flyrain for the PR! I have the concerns about the target file size, let me know what you think about my suggestion :)

omarsmak · 2025-02-05T08:36:13Z

polaris-core/src/main/resources/schemas/policies/system/data-compaction/2025-02-03.json

+      "type": "boolean",
+      "description": "Enable or disable data compaction."
+    },
+    "target_file_size_bytes": {


I still have reservations about including target_file_size_bytes in the main schema due to the potential issues that I highlighted in the design doc earlier. Here, I’d like to reiterate my reasoning:

Iceberg’s Source-of-Truth Property:
Iceberg already exposes a write.target-file-size-bytes property, which is the most appropriate configuration to define target file size across operations. Engines are expected to comply with this property as the source of truth.
By introducing target_file_size_bytes in Polaris as a separate Policy configuration, we risk creating disparity between the two settings.

Conflicting Policies Between Polaris and Iceberg:
If a user sets a different value for target_file_size_bytes in Polaris (compared to write.target-file-size-bytes in Iceberg), it introduces the following risks:

In cases where files are compacted manually (i.e., not through TMS), the manual operation would likely respect Iceberg's write.target-file-size-bytes setting.

This mismatch could result in unintended behavior, such as a complete rewrite of the table (critical for large tables with TBs of data).

I understand the potential reasons for adding this configuration in Polaris. Given how the config map is fully optional and extensible, maybe it might make sense to rely on this config field instead to set the target file size there and allow the engine to interpret and manage its policies as needed. Wdyt?

The target_file_size_bytes in the policy will be used exclusively for TMS (Table Maintenance System) or similar applications, and I believe it's reasonable to keep it as is.

It's optional—TMS or users can ignore it if they prefer table properties instead.

It has been in place for a while. For example, Iceberg's Spark rewrite procedure could use a different target_file_size_bytes.

That said, moving it to the config map is also a viable option. However, the main drawback would be losing interoperability. I'm also looking forward to others' thoughts.

1- it is not about being optional, having defined as policy schema, implicitly adds the expectations to the users that any TMS would support this and can open up a lot of false/positives.

2- Yes I am aware of Spark rewrite data files procedure call, but there is a difference here, the procedure call is a manual one time thing same as a DDL query, that takes the target file size as input if the user wishes to do so, but it doesn't save that as policy, so that implicitly put it as engine implementation details. For example if I have the same procedure written in Flink, I am not obligated to have the target file size as an input as long as I respect the iceberg table property.

I understand this might add flexibility, but also we need to think about the consequences further ahead. I personally think, if this is really needed and TMS users know their engine would support that, then really using the config is enough here.

I'm OK to use config for that. Removed in the new commit.

HonahX

@flyrain Thanks for the policy schema, it looks great! I left some comments/questions. Please let me know what you think!

HonahX · 2025-02-05T21:57:46Z

polaris-core/src/main/resources/schemas/policies/system/data-compaction/2025-02-03.json

+    "enable": {
+      "type": "boolean",
+      "description": "Enable or disable data compaction."


Will this be a common field for all policy types? I guess the general question is whether we want to extend the concept of enabling/disabling policies to all types or keep it limited to TMS-related policies like compaction and snapshot expiry.

I'd consider it is common for any inheritable policies, including all TMS policies, but not necessarily for access control policies, like row filtering.
In case of inheritable policies, users will need to disable a certain type of policies at a lower level, as it's able to override the upper-level polices. While it isn't mandatory for non-inheritable policies. In case of disabling, we could remove the mapping.

HonahX · 2025-02-05T22:03:27Z

polaris-core/src/main/resources/schemas/policies/system/data-compaction/2025-02-03.json

+    "version": {
+      "type": "string",
+      "const": "2025-02-03",
+      "description": "Schema version."


Since we use date instead of package version, does this mean the pre-defined policy type schema (which evolves through community-driven changes) will be released separately from polaris-core, and we'll need to track which Polaris version supports which schema version? If that's the case, would it make more sense to move this outside of polaris-core?

I think it should still release alongside Polaris but it can be helpful to have it versioned separately. Just like the API endpoints have their own versioning #

+1 on release along with Polaris. Much like endpoint evolution, the Polaris version upgrades should cover policy schema version upgrades, as the later is usually much slower. It requires both servers and clients to be ready.
I'm OK to move it out of Polaris core, but it doesn't seem necessary at this moment. We could do that if things changed.

HonahX · 2025-02-05T22:12:24Z

polaris-core/src/main/resources/schemas/policies/system/data-compaction/2025-02-03.json

+      "config": {
+        "compaction_strategy": "bin-pack",
+        "max-concurrent-file-group-rewrites": 5,


Could these fields be pre-defined in the config schema? For example

"config": { "type": "object", "description": "A map containing custom configuration properties. Please note that interoperability is not guaranteed.", "properties": { "compaction-strategy": {"type" : "string"}, "max-concurrent-file-group-rewrites": {"type": "number"} }, "additionalProperties": {} }

I think it will be good to "reserve" these names for reference and standardization.

We could reserve certain names, but each reserved name would become part of the spec, making it very hard to remove later. This would lead to prolonged debate about which names to reserve. On the upside, these reserved keys would ensure interoperability. I prefer to keep things simple first, adding keys if they are proved truly necessary. I'm also open to any suggestion of reserved keys.

eric-maynard

The schema seems reasonable to me, though I'm still a little confused about interoperability

eric-maynard · 2025-02-05T22:28:43Z

polaris-core/src/main/resources/schemas/policies/system/data-compaction/2025-02-03.json

+    },
+    "config": {
+      "type": "object",
+      "description": "A map containing custom configuration properties. Please note that interoperability is not guaranteed.",


To me this is where the whole thing sort of falls apart. If TMS X and TMS Y will have different interpretations of the same policy, what's the point of defining a policy's schema so strictly?

And what if I want to implement TMS Z, which has some new policy type that's not known by the service. Do I have to fork Polaris and add a new policy schema? What will other TMSs do when they see this unknown policy?

I agreed that we should promote interoperability as much as possible, that's the point of the policy schema. However, there are use cases to support optional features of each TMS, e.g., TMS X has a optional feature f1, only a small set of tables need it, users can select f1 via config map. TMS Y should ignore f1 and is OK to ignore it.

In case of a new policy type, a new schema and a new validator have to be introduced. Check design doc section "Policy Content Schema" and "Policy Validation" for more details. Other TMSs or engines need to upgrade to recognize it. It's like adding a new endpoint in a web service. To be clear, we are talking about new schemas, which is supposed to be rare, engines can always recognize new policies with old schemas.

omarsmak

LGTM! Thanks @flyrain for being flexible on the schema, I appreciate it 😊

HonahX

LGTM!

flyrain · 2025-02-06T19:02:01Z

Thanks @omarsmak @HonahX @eric-maynard for the review!

Yufei Gu added 2 commits February 4, 2025 21:06

Add data compaction policy schema

6a3ead3

Add license

8462c6c

flyrain commented Feb 5, 2025

View reviewed changes

flyrain marked this pull request as ready for review February 5, 2025 05:32

flyrain requested review from MonkeyCanCode, RussellSpitzer, adutra, ashvina, collado-mike, dennishuo, dimas-b, ebyhr, eric-maynard, jackye1995, jbonofre, snazy, takidau and vvcephei as code owners February 5, 2025 05:32

flyrain commented Feb 5, 2025

View reviewed changes

omarsmak suggested changes Feb 5, 2025

View reviewed changes

HonahX reviewed Feb 5, 2025

View reviewed changes

eric-maynard approved these changes Feb 5, 2025

View reviewed changes

eric-maynard reviewed Feb 5, 2025

View reviewed changes

Resolve comments

6708393

omarsmak approved these changes Feb 6, 2025

View reviewed changes

HonahX approved these changes Feb 6, 2025

View reviewed changes

eric-maynard approved these changes Feb 6, 2025

View reviewed changes

flyrain merged commit cc041c9 into apache:main Feb 6, 2025
5 checks passed

snazy mentioned this pull request Feb 19, 2025

Add policies for metadata compaction, orphan file removal and snapshot retention #969

Merged

		@@ -0,0 +1,41 @@
		{
		"license": "Licensed under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0)",

Add data compaction policy schema #945

Add data compaction policy schema #945

Uh oh!

Conversation

flyrain commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flyrain Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

omarsmak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HonahX left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-maynard left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

omarsmak left a comment

Choose a reason for hiding this comment

Uh oh!

HonahX left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flyrain commented Feb 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flyrain commented Feb 5, 2025 •

edited

Loading

flyrain Feb 5, 2025 •

edited

Loading

flyrain Feb 5, 2025 •

edited

Loading

flyrain Feb 5, 2025 •

edited

Loading

flyrain Feb 6, 2025 •

edited

Loading

flyrain Feb 5, 2025 •

edited

Loading

flyrain Feb 6, 2025 •

edited

Loading

flyrain Feb 5, 2025 •

edited

Loading