Skip to content

Enhance string concat coercion to support castable types #16510

@osipovartem

Description

@osipovartem

Is your feature request related to a problem or challenge?

Related to #12709

Currently, in BinaryTypeCoercer, we have the following logic for the StringConcat operator:

  StringConcat => {
      string_concat_coercion(self.lhs, self.rhs).map(Signature::uniform).ok_or_else(|| {
          plan_datafusion_err!(
              "Cannot infer common string type for string concat operation {} {} {}", self.lhs, self.op, self.rhs
          )
      })
  }

The internal string_concat_coercion function requires at least one side to be a string-like type (Utf8, LargeUtf8, Utf8View) and the other side to be castable to string:

/// Coercion rules for string concat.
/// This is a union of string coercion rules and specified rules:
/// 1. At least one side of lhs and rhs should be string type (Utf8 / LargeUtf8)
/// 2. Data type of the other side should be able to cast to string type
fn string_concat_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
    use arrow::datatypes::DataType::*;
    string_coercion(lhs_type, rhs_type).or(match (lhs_type, rhs_type) {
        (Utf8View, from_type) | (from_type, Utf8View) => {
            string_concat_internal_coercion(from_type, &Utf8View)
        }
        (Utf8, from_type) | (from_type, Utf8) => {
            string_concat_internal_coercion(from_type, &Utf8)
        }
        (LargeUtf8, from_type) | (from_type, LargeUtf8) => {
            string_concat_internal_coercion(from_type, &LargeUtf8)
        }
        (Dictionary(_, lhs_value_type), Dictionary(_, rhs_value_type)) => {
            string_coercion(lhs_value_type, rhs_value_type).or(None)
        }
        _ => None,
    })
}

However, this implementation fails if both operands are non-string types that can still be casted to string, for example:

select 1||2;
+----------------------+
| Int64(1) || Int64(2) |
|----------------------|
| 12                   |
+----------------------+

This use case could be safely supported by leveraging existing TypeCoercionRewriter, which is already responsible for inserting casts during logical plan rewriting:

  Expr::BinaryExpr(BinaryExpr { left, op, right }) => {
      let (left, right) =
          self.coerce_binary_op(*left, self.schema, op, *right, self.schema)?;
      Ok(Transformed::yes(Expr::BinaryExpr(BinaryExpr::new(
          Box::new(left),
          op,
          Box::new(right),
      ))))
  }

Describe the solution you'd like

Proposal
We should update string_concat_coercion to also accept types where both sides are castable to string, not just one side being a string already.
This would allow expressions like 1 || 2 to succeed by coercing both operands to strings prior to concatenation.
This change aligns with the behavior in SQL engines like PostgreSQL and makes coercion rules for || more intuitive and consistent with general type coercion strategy.

fn string_concat_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
    use arrow::datatypes::DataType::*;
    string_coercion(lhs_type, rhs_type).or(match (lhs_type, rhs_type) {
        (Utf8View, from_type) | (from_type, Utf8View) => {
            string_concat_internal_coercion(from_type, &Utf8View)
        }
        (Utf8, from_type) | (from_type, Utf8) => {
            string_concat_internal_coercion(from_type, &Utf8)
        }
        (LargeUtf8, from_type) | (from_type, LargeUtf8) => {
            string_concat_internal_coercion(from_type, &LargeUtf8)
        }
        (Dictionary(_, lhs_value_type), Dictionary(_, rhs_value_type)) => {
            string_coercion(lhs_value_type, rhs_value_type).or(None)
        }
        _ =>
            if can_cast_types(lhs_type, &Utf8) && can_cast_types(rhs_type, &Utf8) {
                Some(Utf8)
            } else if can_cast_types(lhs_type, &LargeUtf8) && can_cast_types(rhs_type, &LargeUtf8) {
                Some(LargeUtf8)
            } else {
                None
            }
    })
}

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions