-
Notifications
You must be signed in to change notification settings - Fork 330
Add data compaction policy schema #945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @@ -0,0 +1,41 @@ | |||
| { | |||
| "license": "Licensed under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0)", | |||
| "$id": "https://polaris.apache.org/schemas/policies/system/data-compaction/2025-02-03.json", | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will publish it to web site in a followup PR.
| @@ -0,0 +1,41 @@ | |||
| { | |||
| "license": "Licensed under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0)", | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Json doesn't support comments. Using a field to indicate Apache License instead of a header.
| "additionalProperties": {} | ||
| } | ||
| }, | ||
| "required": ["enable"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema version is optional. Polaris will pick up the latest version as the default one in case it's missing, and persist it in the metastore.
omarsmak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @flyrain for the PR! I have the concerns about the target file size, let me know what you think about my suggestion :)
| "type": "boolean", | ||
| "description": "Enable or disable data compaction." | ||
| }, | ||
| "target_file_size_bytes": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have reservations about including target_file_size_bytes in the main schema due to the potential issues that I highlighted in the design doc earlier. Here, I’d like to reiterate my reasoning:
- Iceberg’s Source-of-Truth Property:
Iceberg already exposes awrite.target-file-size-bytesproperty, which is the most appropriate configuration to define target file size across operations. Engines are expected to comply with this property as the source of truth.
By introducingtarget_file_size_bytesin Polaris as a separate Policy configuration, we risk creating disparity between the two settings. - Conflicting Policies Between Polaris and Iceberg:
If a user sets a different value fortarget_file_size_bytesin Polaris (compared towrite.target-file-size-bytesin Iceberg), it introduces the following risks:- In cases where files are compacted manually (i.e., not through TMS), the manual operation would likely respect Iceberg's
write.target-file-size-bytessetting. - This mismatch could result in unintended behavior, such as a complete rewrite of the table (critical for large tables with TBs of data).
- In cases where files are compacted manually (i.e., not through TMS), the manual operation would likely respect Iceberg's
I understand the potential reasons for adding this configuration in Polaris. Given how the config map is fully optional and extensible, maybe it might make sense to rely on this config field instead to set the target file size there and allow the engine to interpret and manage its policies as needed. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The target_file_size_bytes in the policy will be used exclusively for TMS (Table Maintenance System) or similar applications, and I believe it's reasonable to keep it as is.
- It's optional—TMS or users can ignore it if they prefer table properties instead.
- It has been in place for a while. For example, Iceberg's Spark rewrite procedure could use a different
target_file_size_bytes.
That said, moving it to the config map is also a viable option. However, the main drawback would be losing interoperability. I'm also looking forward to others' thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1- it is not about being optional, having defined as policy schema, implicitly adds the expectations to the users that any TMS would support this and can open up a lot of false/positives.
2- Yes I am aware of Spark rewrite data files procedure call, but there is a difference here, the procedure call is a manual one time thing same as a DDL query, that takes the target file size as input if the user wishes to do so, but it doesn't save that as policy, so that implicitly put it as engine implementation details. For example if I have the same procedure written in Flink, I am not obligated to have the target file size as an input as long as I respect the iceberg table property.
I understand this might add flexibility, but also we need to think about the consequences further ahead. I personally think, if this is really needed and TMS users know their engine would support that, then really using the config is enough here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK to use config for that. Removed in the new commit.
HonahX
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flyrain Thanks for the policy schema, it looks great! I left some comments/questions. Please let me know what you think!
| "enable": { | ||
| "type": "boolean", | ||
| "description": "Enable or disable data compaction." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be a common field for all policy types? I guess the general question is whether we want to extend the concept of enabling/disabling policies to all types or keep it limited to TMS-related policies like compaction and snapshot expiry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider it is common for any inheritable policies, including all TMS policies, but not necessarily for access control policies, like row filtering.
In case of inheritable policies, users will need to disable a certain type of policies at a lower level, as it's able to override the upper-level polices. While it isn't mandatory for non-inheritable policies. In case of disabling, we could remove the mapping.
| "version": { | ||
| "type": "string", | ||
| "const": "2025-02-03", | ||
| "description": "Schema version." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we use date instead of package version, does this mean the pre-defined policy type schema (which evolves through community-driven changes) will be released separately from polaris-core, and we'll need to track which Polaris version supports which schema version? If that's the case, would it make more sense to move this outside of polaris-core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should still release alongside Polaris but it can be helpful to have it versioned separately. Just like the API endpoints have their own versioning #
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on release along with Polaris. Much like endpoint evolution, the Polaris version upgrades should cover policy schema version upgrades, as the later is usually much slower. It requires both servers and clients to be ready.
I'm OK to move it out of Polaris core, but it doesn't seem necessary at this moment. We could do that if things changed.
| "config": { | ||
| "compaction_strategy": "bin-pack", | ||
| "max-concurrent-file-group-rewrites": 5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these fields be pre-defined in the config schema? For example
"config": {
"type": "object",
"description": "A map containing custom configuration properties. Please note that interoperability is not guaranteed.",
"properties": {
"compaction-strategy": {"type" : "string"},
"max-concurrent-file-group-rewrites": {"type": "number"}
},
"additionalProperties": {}
}I think it will be good to "reserve" these names for reference and standardization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could reserve certain names, but each reserved name would become part of the spec, making it very hard to remove later. This would lead to prolonged debate about which names to reserve. On the upside, these reserved keys would ensure interoperability. I prefer to keep things simple first, adding keys if they are proved truly necessary. I'm also open to any suggestion of reserved keys.
eric-maynard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema seems reasonable to me, though I'm still a little confused about interoperability
| }, | ||
| "config": { | ||
| "type": "object", | ||
| "description": "A map containing custom configuration properties. Please note that interoperability is not guaranteed.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this is where the whole thing sort of falls apart. If TMS X and TMS Y will have different interpretations of the same policy, what's the point of defining a policy's schema so strictly?
And what if I want to implement TMS Z, which has some new policy type that's not known by the service. Do I have to fork Polaris and add a new policy schema? What will other TMSs do when they see this unknown policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agreed that we should promote interoperability as much as possible, that's the point of the policy schema. However, there are use cases to support optional features of each TMS, e.g., TMS X has a optional feature f1, only a small set of tables need it, users can select f1 via config map. TMS Y should ignore f1 and is OK to ignore it.
In case of a new policy type, a new schema and a new validator have to be introduced. Check design doc section "Policy Content Schema" and "Policy Validation" for more details. Other TMSs or engines need to upgrade to recognize it. It's like adding a new endpoint in a web service. To be clear, we are talking about new schemas, which is supposed to be rare, engines can always recognize new policies with old schemas.
omarsmak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @flyrain for being flexible on the schema, I appreciate it 😊
HonahX
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
Thanks @omarsmak @HonahX @eric-maynard for the review! |
This PR adds a policy schema for data compaction. Please refer to the design doc for more context.
Check out #938 if you want to know more about how a policy is validated.
The policy schemas will be shipped within the
polaris-corejar file, so that engines or applications can reuse thepolaris-corefor related functionallities, like schema validation.The policy will also be pushed in the Apache Polaris website via a link like https://polaris.apache.org/schemas/policies/system/data-compaction/2025-02-03.json, so that it is easy to refer to.
cc @HonahX @jackye1995 @jbonofre @eric-maynard @dennishuo @collado-mike @RussellSpitzer