Skip to content

Conversation

@collado-mike
Copy link
Contributor

Description

Custom write.data.path and write.metadata.path are currently blocked, preventing users from following the guidelines in https://iceberg.apache.org/docs/1.5.1/aws/#object-store-file-layout . This updates the blocking to allow for customized write paths with the existing ALLOW_UNSTRUCTURED_TABLE_LOCATION configuration flag. This also adds a test using the object store location provider feature.

Fixes #113

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

New pytest for S3

  • Test A
  • Test B

Test Configuration:

  • Firmware version:
  • Hardware:
  • Toolchain:
  • SDK:

Checklist:

Please delete options that are not relevant.

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • If adding new functionality, I have discussed my implementation with the community using the linked GitHub issue

@snazy
Copy link
Member

snazy commented Aug 23, 2024

We had to tackle the same problem (object-storage layout) in Nessie, but also wanted to retain the ability to control the table's location for data security reasons and generating IAM policies.

Technically, there's sadly not only write.(meta)data.path, but also the (deprecated) write.object-storage.path and write.folder-storage.path.

The solution in Nessie to the "object-storage layout problem" is: If write.object-storage.enabled == true, then we return the "right" write.data.path value for the table's location (and remove the deprecated properties). Users only have to set write.object-storage.enabled = true and the catalog's doing the "right" thing. This way we can still generate IAM policies that are compatible to the object-storage layout.

@collado-mike
Copy link
Contributor Author

The solution in Nessie to the "object-storage layout problem" is: If write.object-storage.enabled == true, then we return the "right" write.data.path value for the table's location (and remove the deprecated properties).

By "right" write.data.path, you mean still under catalog_root/namespace/table_dir? That's specifically what this change is avoiding. If a catalog specifically declares ALLOW_UNSTRUCTURED_TABLE_LOCATION, we can allow any table location under the catalog root so that if there are many tables, writes and reads can be distributed (somewhat) evenly across the bucket. But setting that configuration is the admin's way of telling us they care more about the even distribution than they do scoped credential vending.

A user should be able to set the write.data.path to something within the table's subdirectory without setting ALLOW_UNSTRUCTURED_TABLE_LOCATION and I think they'll get what you described - object-storage layout scoped to within the table's subdirectory so that credential vending is still sub-scoped to the table. I can add a test to verify that use case.

@collado-mike
Copy link
Contributor Author

@snazy I added an extra test verifying that object-store layout already works under the table's default location. I think this should work for anyone who really wants credential-vending to be specifically scoped to the table location, while the extra configuration allows for a less strict solution for use cases that want to trade credential vending for performance/scale

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good but I wonder if we need a way to add even more locations, the basic issue is this only represents the "latest" location that data or metadata could be written to and a user could have set a different value to these properties in the past.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the purpose of the allowedLocations field of the storageConfiguration. Currently, we can't set table-level StorageConfiguration, but part of the reason for implementing the StorageConfigurationOverride is to help pave the way toward that.

Copy link
Member

@RussellSpitzer RussellSpitzer Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computeIfPresent May work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if Map had a ifPresent(Consumer<T>) method...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check Optional.ofNullable to simplify this ternary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scratch that, l is a list here, ... hmm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like to avoid always doing a storage configuration override we are using the Optional path below, It just feels like it may be simpler to just always add Locs to the Config?

Or we could just do

if (userSpecifiedWriteLocations.notEmpty)
return Storage Override
else
configInfo

I'm not sure the functional style here makes it much more readable, One of those things that I think would be much cleaner in Scala.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah I think we can just always add the StorageConfigurationOverride with the empty locs if not set

RussellSpitzer
RussellSpitzer previously approved these changes Aug 26, 2024
Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a High level I think this is is a great idea, I do think we should probably reduce some of the functional approach here which I think is a little complicated for what we are trying to accomplish. If we were in Scala I feel like it would be very simple but in Java we have to do the List -> Option swap which I don't think is very easy to read.

I don't have a strong feeling about that though if other engineers want to go with this style everywhere in the project.

There are some references to "snowman" in the test file which I think we may need to change now, but I see there are other references there already so this isn't a deal breaker for me.

Also for future problems, we probably will need to consider arbitrary additional paths but i'm not sure how we can do that at the moment.

@collado-mike collado-mike force-pushed the mcollado-test-objectstore-locationprovider branch from 64e0624 to 0c32a7a Compare August 27, 2024 17:54
@RussellSpitzer RussellSpitzer merged commit 1cdac7b into apache:main Aug 29, 2024
@liamzwbao liamzwbao mentioned this pull request Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE REQUEST] add support for random prefixes (MURMUR3 S3 hash)

5 participants