-
Notifications
You must be signed in to change notification settings - Fork 332
add blog: apache doris and polaris integration #2571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dimas-b
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution, @morningman! The guide looks pretty useful to me! I have some thoughts about the location for this page (below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see any easy links to existing blog posts from the main project site 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR to add it, #2575
|
|
||
| With the continuous evolution of data lake technologies, efficiently and securely managing massive datasets stored on object storage (such as AWS S3) while providing unified access endpoints for upstream analytics engines (like [Apache Doris](https://doris.apache.org)) has become a core challenge in modern data architectures. [Apache Polaris](https://polaris.apache.org/), as an open and standardized REST Catalog service for Iceberg, provides an ideal solution to this challenge. It not only handles centralized metadata management but also significantly enhances data lake security and manageability through fine-grained access control and flexible credential management mechanisms. | ||
|
|
||
| This document will provide a detailed guide on integrating Apache Doris with Polaris to achieve efficient querying and management of Iceberg data on S3. We'll guide you through the complete process from environment preparation to final data querying step by step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it may be worth placing this page under the docs section > Getting Started... WDYT?
| -- Enable credential vending | ||
| 'iceberg.rest.vended-credentials-enabled' = 'true', | ||
| -- S3 basic configuration (no keys required) | ||
| 's3.endpoint' = 'https://s3.us-west-2.amazonaws.com', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note: Starting with 1.1.0 Polaris can provide endpoints to clients automatically. Cf. #1913
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out! This is actually a limitation on the Doris side — we currently need to recognize the storage type through an explicit parameter. We’ll look into improving this in the future.
flyrain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Thanks a lot for working on it, @morningman ! This blog not only shows case how Apache Doris works together with Polaris, but also demonstrates a detailed end-to-end setup. It'd super helpful for anyone want to try similar deployment.
|
|
||
| ### 2. Polaris Deployment and Catalog Creation | ||
|
|
||
| With the environment ready, we'll now deploy the Polaris service and configure the Iceberg Catalog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[not a blocker] wondering if you had change to look at this script in the repo :
https://github.com/apache/polaris/blob/main/getting-started/assets/cloud_providers/deploy-aws.sh
it automatically sets up polaris env with bucket creation etc, wondering if that is something we can leverage
adnanhemani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @morningman, thanks for this contribution - but I'm against the blog as it stands in the PR now for a few different reasons:
- You've rewritten the instructions for most of what is already available here and here. For maintenance of the instructions in this PR, we should really put the Getting Started instructions for Doris within this section - similar to how we have it for Spark and Trino. That way we can reuse the existing flows for all cloud providers as well.
- There are no instructions currently for how to start Apache Doris. Personally, I find it important to see the instructions on how to start Apache Doris included somewhere as well - not all users will have instance(s) of it already. Either we should add them to the existing Kubernetes Docker Compose files (if Doris has a prebuilt image) or how to install it locally within this section here.
I'm excited to see Apache Doris' Getting Started flow added to our documentation - but would prefer if we conform to the formats we've already created to maintain uniformity :)
adnanhemani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatting with @flyrain offline - I still believe we should be augmenting the Getting Started documentation rather than a stand-alone blog. IMO it is a much ROI to do that then releasing a blog stating that we've updated the documentation with that information. (Or publishing the blog in addition to adding it to the Getting Started documentation). I strongly believe that keeping Apache Doris congruent to other query engines would be a win-win for both Doris and Polaris.
But I understand that this is a PR solely for blog purposes. Approving here to remove the "Request Changes", but I still highly recommend making the changes for the Getting Started documentation as a higher priority.
I’m fine with following the same rules as other engines. For now, this blog is mainly meant to showcase an integration case(not a "document"). Later on, I can also improve the current “Getting Started” guide to include Doris as another query engine, similar to Spark or Trino. |
|
Awesome! I'm excited to see Doris as another engines in the doc. We could make it happen in another PR. Glad @adnanhemani pointed it out. Thanks a lot @morningman for adding the blog! Thanks @dimas-b @singhpk234 @adnanhemani for the review! |
Add blog about how to integrate Apache Doris with Polaris