Skip to content

Conversation

@sdd
Copy link
Contributor

@sdd sdd commented Apr 4, 2024

This trait is used for BoundPredicate visitors that evaluate the predicate, and associated tests. It has been broken out of the #241 PR as that was getting too large.

@sdd sdd force-pushed the add-bound-predicate-evaluator branch 4 times, most recently from 05e6a54 to 8c6d461 Compare April 4, 2024 22:31
@sdd sdd force-pushed the add-bound-predicate-evaluator branch 3 times, most recently from 7fbf2a3 to 1b95080 Compare April 4, 2024 23:07
@sdd sdd changed the title Add BoundPredicateEvaluator trait Add BoundPredicateVtor tisirait Apr 4, 2024
@sdd sdd changed the title Add BoundPredicateVtor tisirait Add BoundPredicateVisitor trait Apr 4, 2024
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdd for this pr! I've left some small suggestions to improve.

@sdd sdd force-pushed the add-bound-predicate-evaluator branch from 1b95080 to 77653b1 Compare April 5, 2024 07:42
fn or(&mut self, lhs: Self::T, rhs: Self::T) -> Result<Self::T>;
fn not(&mut self, inner: Self::T) -> Result<Self::T>;

fn is_null(&mut self, reference: &BoundReference) -> Result<Self::T>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this approach is that when we add an operator, we need to modify the trait. We can still keep it compatible with default function, but the actual behavior may break when user upgrades library. cc @viirya @Xuanwo @marvinlanhenke What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add default implementations that return something like Err(OperatorNotImplemented), then if we add an operator, wouldn't existing code work for existing operators and only break if users then try to use the new operator without having updated any visitors they have to use the new operator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add default implementations that return something like Err(OperatorNotImplemented), then if we add an operator, wouldn't existing code work for existing operators and only break if users then try to use the new operator without having updated any visitors they have to use the new operator?

Yes, I agree with that.

Another option is to use following method:

trait BoundPredicateVisitor {
  fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, results: Vec<Self::T>) -> Result<Self::T> {}
}

This way we can force user to think about added new operator, otherwise it will throw compiler error, but this may break things when we add new operator.

I think the two approaches have different pros and cons, let's wait to see others's comments.

Copy link
Contributor

@marvinlanhenke marvinlanhenke Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to use following method:

trait BoundPredicateVisitor {
  fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, results: Vec<Self::T>) -> Result<Self::T> {}
}

I think I'm in favor of keeping the trait as generic as possible - so we don't have to modify it, if we have to support a new operator.

When implementing this trait (no matter which variant) I'd most likely have to overwrite the default behaviour anyway? So I wouldn't bother providing any in the first place and keep the trait as generic and future proof as possible? If that makes any sense to you?

This way we can force user to think about added new operator, otherwise it will throw compiler error, but this may break things when we add new operator.

...wouldn't I have match op anyway, so having a default match arm in the implementation would solve the issue for the downstream user?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...wouldn't I have match op anyway, so having a default match arm in the implementation would solve the issue for the downstream user?

I think this is the advantage of the op approach, it's left to user to decide what to do. If they want new op to be ignored, they could use a default match arm, otherwise they could choose to do pattern match against each operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this will work with just visitor_op.

The proposed function signature fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference) would work for a unary operator.

A binary operator would need fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, literal: &Datum), and a set operator would need fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, literals: &FnvHashSet<Datum>).

To resolve this, you'd need the visitor to have fn unary_op(...), fn binary_op(...), and fn set_op(...).

Is that what we want?

Personally I think this is another example of where the current type hierarchy feels awkward - there are other places where we bump into these type hierarchy mismatches, such as needing a default leg in a match on operator to handle operators that will never show up, such as binary ops in a unary expression.

Something more like this feels better to me:

pub enum Predicate {
    // By not having Unary / Binary / Set here, and removing
    // that layer in favour of having the ops themselves
    // as part of this enum, there is no need for code that
    // raises errors when the wrong type of `Op` is used, since those
    // issues are prevented by the type system at compile time

    // Unary ops now only contain one param, the term, and so can be
    // implemented more simply
    IsNull(Reference),
    // ... other Unary ops here,

    // Binary / Set ops ditch the `BinaryExpression`, with the literal
    // and reference being available directly
    GreaterThan {
        literal: Datum,
        reference: Reference,
    },
    // ... other Binary ops here,
    
    In {
        literals: HashSet<Datum>,
        reference: Reference,
    },
    // ... other Set ops here
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the idea of going further and having an Expression enum that contains the logical operators plus Predicate above, with Reference from the above being an enum over bound or unbound. This allows visitors to potentially work over both bound and unbound expressions.

pub enum Expression {
    // The top level would be Expression rather than Predicate.
    // The logical operators would live at this level.

    // Since logical operators such as And / Or / Not
    // are common between bound and unbound, some visitors
    // such as rewrite_not can be applied to both bound and
    // unbound expressions, preventing duplication and increasing
    // flexibility

    /// An expression always evaluates to true.
    AlwaysTrue,

    /// An expression always evaluates to false.
    AlwaysFalse,

    // Here we use a struct variant to eliminate the need
    // for a generic `LogicalExpression`, which did not
    // add much, also allowing us to refer to lhs and rhs
    // by name

    /// An expression combined by `AND`, for example, `a > 10 AND b < 20`.
    And {
        lhs: Box<Expression>,
        rhs: Box<Expression>
    },

    /// An expression combined by `OR`, for example, `a > 10 OR b < 20`.
    Or {
        lhs: Box<Expression>,
        rhs: Box<Expression>
    },

    /// An expression combined by `NOT`, for example, `NOT (a > 10)`.
    Not(Box<Expression>),

    /// A Predicate
    Predicate(Predicate),
}

pub enum Reference {
    // Only at the leaf level do we distinguish between
    // bound and unbound. This allows things like bind
    // and unbind to be implemented as a transforming visitor,
    // and avoids duplication of logic compared to having
    // Bound / Unbound being distinguished at a higher level
    // in the type hierarchy

    Bound(BoundReference),
    Unbound(UnboundReference),
}

pub struct UnboundReference {
    name: String,
}

pub struct BoundReference {
    column_name: String,
    field: NestedFieldRef,
    accessor: StructAccessor,
}

We could use type alises such as type BoundExpression = Expression and have bind return a BoundExpression to prevent bound expressions being used where unbound ones are required and vice versa. But I'm digressing very far from the original comment now, sorry! 😆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed function signature fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference) would work for a unary operator.

I think following signature would be applied to all operators:

 fn visitor_op(&mut self, op: &PredicateOperator, reference: &BoundReference, literals: &[Datum])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the idea of going further and having an Expression enum that contains the logical operators plus Predicate above, with Reference from the above being an enum over bound or unbound. This allows visitors to potentially work over both bound and unbound expressions.

Hi, @sdd Would you mind to open an issue to raise a discussion about this idea? I think it looks promsing to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That still wouldn't work without collecting the set into a vec first for set ops. I've updated the PR with an alternative based on an enum, see what you think.

@sdd sdd force-pushed the add-bound-predicate-evaluator branch 2 times, most recently from 9ae2e80 to 9c4bef4 Compare April 6, 2024 11:18
Copy link
Contributor

@marvinlanhenke marvinlanhenke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sdd for this PR. I took a look and left some minor comments.

Regarding your design idea about Predicate and Expression -
I also think it looks promising, but I still have some doubts about removing the 'layer' of Unary, Binary, SetExpression and have the PredicateOperator directly inside the Predicate.

In #264 for example, I projected the transform based on the 'group of ops' e.g. Unary, Binary or Set. If we remove this layer I'd have to match each op and handle them individually. The extra layer of indirection was helping me in this use-case. Also what about adding, removing changing ops? With an extra layer we should have an easier time modifying the layer below without affecting the downstream consumers, that otherwise match the ops directly and could be breaking?

However, I still like your approach and I think an extra issue where we can discuss this would be very useful. I think more and different views will definitely help (since I have a very limited view and experience on the codebase after all).

@sdd sdd force-pushed the add-bound-predicate-evaluator branch 3 times, most recently from 9b491ea to 3716d7b Compare April 7, 2024 09:48
@sdd
Copy link
Contributor Author

sdd commented Apr 7, 2024

I've added a test for visit_op, as well as some doc comments.

Also, based on our discussion in the comments, it seemed prudent to add the non_exhaustive marker to PredicateOperator? PTAL @liurenjie1024, @marvinlanhenke, @ZENOTME as I think this is pretty much ready now

@sdd sdd force-pushed the add-bound-predicate-evaluator branch 2 times, most recently from a295f83 to 033994a Compare April 8, 2024 06:47
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @sdd 🙌 Would be great to get this in, since it is blocking a few follow-up PRs. Thanks for breaking them out!


visitor.not(inner_result)
}
BoundPredicate::Unary(expr) => visitor.op(expr.op(), expr.term(), None, predicate),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PyIceberg we've split out the operator in a per-operator basis: https://github.com/apache/iceberg-python/blob/4b911057f13491f30f89f133544c063133133fa5/pyiceberg/expressions/visitors.py#L256

The InclusiveMetricsEvaluator is already quite a beefy visitor, and by having it on this level, 95% will fall in a the op method: https://github.com/apache/iceberg-python/blob/4b911057f13491f30f89f133544c063133133fa5/pyiceberg/expressions/visitors.py#L1137
I think this is the nicest way since it leverages the visitor pattern as much as possible, but I also saw the discussion of having this on the {unary,binary,set} level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Python it is easy to have a cache-all where if nothing matches, it just goes to that one, and then we raise an exception that we've encountered an unknown predicate. This way it is possible to add new expressions, but I don't think that would happen too often.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the nicest way since it leverages the visitor pattern as much as possible, but I also saw the discussion of having this on the {unary,binary,set} level.

If you're not happy with this approach @Fokko, I'll wait until the three of you are in agreement and fix up to whatever you all decide on. I'm happy with the approach as-is after switching my original design to the approach proposed by @liurenjie1024 and @marvinlanhenke.

With Python it is easy to have a cache-all where if nothing matches, it just goes to that one, and then we raise an exception that we've encountered an unknown predicate. This way it is possible to add new expressions, but I don't think that would happen too often.

With the approach in the PR, if a new Op gets added, you have three options available when implementing BoundPredicateVisitor for a struct (if I remove the #[non_exhaustive] macro from PredicateOperator):

  1. Exhaustively specify a match arm for every Op. If another Op gets added, your BoundPredicateVisitor will no longer compile
  2. Specify a default match arm, and call panic!() in it. When a new Op gets added, your code still compiles but you get a runtime crash if the visitor encounters the new operator
  3. Specify a default match arm, and return Err(NotImplemented) in it. When a new Op gets added, your code still compiles but you will get a handleable error if you encounter the new operator - potentially failing an invocation but not taking the service down.

That provides the capability that you mention from Python, as well as other options.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdd Sorry for late reply, I think we are quite close to the final version. I found another way to write op method with generics, what do you think? Others LGTM, thanks!

/// or SetPredicate. Passes the predicate's operator in all cases,
/// as well as the term and literals in the case of binary and aet
/// predicates.
fn op(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @sdd How about we try rewrite it like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Renjie! Thanks for the comment. I'm not totally convinced with your suggestion - we're trying to get the best performance, but by passing Set literals through as an iterator, rather than passing a reference to the HashSet, it requires either collecting the iterator, resulting in an allocation, or iterating over potentially the whole iterator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generating an iterator for hashset will not introduce allocation, but you are right that user may need to iterating the whole iterator to implement their logic. Method in this pr seems kind of overdesign and counterintutive for user to implement. In this case I think your original design seems more intuitive to me, as @Fokko said adding operator maybe not too frequent. Sorry for the back and forth, I didn't expect this approach to be so complicated and introduce so much overhead.

cc @Fokko @marvinlanhenke @Xuanwo What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - here's another PR with the original design. I'll let you guys choose which one to merge :-) #334.

Here's how the inclusive projection would end up looking with this design too: #335

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is the original design, which is (confusingly) in the "alternative" PRs 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sold on the iterator idea either, it looks very fancy, but iterating over a set feels awkward to me.

If you're not happy with this approach @Fokko, I'll wait until the three of you are in agreement and fix up to whatever you all decide on. I'm happy with the approach as-is after switching my original design to the approach proposed by @liurenjie1024 and @marvinlanhenke.

Well, I'm not the one to decide here, just weighing in my opinion :D I have a strong preference for the original approach as suggested in #334 I think this leverages the visitor as much as possible, and also makes sure that all the operations are covered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also +1 for the original approach after comparing these two approaches.

@sdd sdd force-pushed the add-bound-predicate-evaluator branch from 033994a to c86c768 Compare April 16, 2024 06:45
@liurenjie1024
Copy link
Contributor

Close by #334

@sdd sdd deleted the add-bound-predicate-evaluator branch August 2, 2024 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants