Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -404,21 +404,26 @@ abstract class HashExpression[E] extends Expression {
input: String,
result: String,
fields: Array[StructField]): String = {
val tmpInput = ctx.freshName("input")
val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
nullSafeElementHash(input, index.toString, field.nullable, field.dataType, result, ctx)
nullSafeElementHash(tmpInput, index.toString, field.nullable, field.dataType, result, ctx)
}
val hashResultType = CodeGenerator.javaType(dataType)
ctx.splitExpressions(
val code = ctx.splitExpressions(
expressions = fieldsHash,
funcName = "computeHashForStruct",
arguments = Seq("InternalRow" -> input, hashResultType -> result),
arguments = Seq("InternalRow" -> tmpInput, hashResultType -> result),
returnType = hashResultType,
makeSplitFunction = body =>
s"""
|$body
|return $result;
""".stripMargin,
foldFunctions = _.map(funcCall => s"$result = $funcCall;").mkString("\n"))
s"""
|final InternalRow $tmpInput = $input;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we can avoid creating a new variable if the input is already a variable. I think this can be done after we fully adopt the new codegen infra from @viirya

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, very agree, we can improve this in the future.

|$code
""".stripMargin
}

@tailrec
Expand Down Expand Up @@ -778,21 +783,22 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] {
input: String,
result: String,
fields: Array[StructField]): String = {
val tmpInput = ctx.freshName("input")
Copy link
Member

@wangyum wangyum Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does HiveHash have this issue? Can you add a test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like HiveHash cannot be triggered in the normal way. Because Spark uses Murmur3Hash.
But this function does have this issue. You can hack to test in this way.
In HashPartitioning:

  def partitionIdExpression: Expression = Pmod(new Murmur3Hash(expressions), Literal(numPartitions))

to

  def partitionIdExpression: Expression = Pmod(new HiveHash(expressions), Literal(numPartitions))

Then run tests:

  val df = spark.range(1000)
  val columns = (0 until 400).map{ i => s"id as id$i" }
  val distributeExprs = (0 until 100).map(c => s"id$c").mkString(",")
  df.selectExpr(columns : _*).createTempView("test")
  spark.sql(s"select * from test distribute by ($distributeExprs)").count()

val childResult = ctx.freshName("childResult")
val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
val computeFieldHash = nullSafeElementHash(
input, index.toString, field.nullable, field.dataType, childResult, ctx)
tmpInput, index.toString, field.nullable, field.dataType, childResult, ctx)
s"""
|$childResult = 0;
|$computeFieldHash
|$result = (31 * $result) + $childResult;
""".stripMargin
}

s"${CodeGenerator.JAVA_INT} $childResult = 0;\n" + ctx.splitExpressions(
val code = ctx.splitExpressions(
expressions = fieldsHash,
funcName = "computeHashForStruct",
arguments = Seq("InternalRow" -> input, CodeGenerator.JAVA_INT -> result),
arguments = Seq("InternalRow" -> tmpInput, CodeGenerator.JAVA_INT -> result),
returnType = CodeGenerator.JAVA_INT,
makeSplitFunction = body =>
s"""
Expand All @@ -801,6 +807,11 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] {
|return $result;
""".stripMargin,
foldFunctions = _.map(funcCall => s"$result = $funcCall;").mkString("\n"))
s"""
|final InternalRow $tmpInput = $input;
|${CodeGenerator.JAVA_INT} $childResult = 0;
|$code
""".stripMargin
}
}

Expand Down
12 changes: 12 additions & 0 deletions sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -2831,4 +2831,16 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
}
}
}

test("SPARK-25084: 'distribute by' on multiple columns may lead to codegen issue") {
withView("spark_25084") {
val count = 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can probably inline in the next line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to inline? We still use it in the assert.

      assert(
        spark.sql(s"select * from spark_25084 distribute by ($distributeExprs)").count()
          === count)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, disregard the comment, thanks

val df = spark.range(count)
val columns = (0 until 400).map{ i => s"id as id$i" }
val distributeExprs = (0 until 100).map(c => s"id$c").mkString(",")
df.selectExpr(columns : _*).createTempView("spark_25084")
assert(
spark.sql(s"select * from spark_25084 distribute by ($distributeExprs)").count === count)
}
}
}