[SPARK-8850] [SQL] Enable Unsafe mode by default #7564

JoshRosen · 2015-07-21T07:00:17Z

This pull request enables Unsafe mode by default in Spark SQL. In order to do this, we had to fix a number of small issues:

List of fixed blockers:

Make some default buffer sizes configurable so that HiveCompatibilitySuite can run properly ([SPARK-9411] [SQL] Make Tungsten page sizes configurable #7741).
Memory leak on grouped aggregation of empty input (fixed by [SPARK-8357] Fix unsafe memory leak on empty inputs in GeneratedAggregate #7560 to fix this)
Update planner to also check whether codegen is enabled before planning unsafe operators.
Investigate failing HiveThriftBinaryServerSuite test. This turns out to be caused by a ClassCastException that occurs when Exchange tries to apply an interpreted RowOrdering to an UnsafeRow when range partitioning an RDD. This could be fixed by [SPARK-9054] [SQL] Rename RowOrdering to InterpretedOrdering; use newOrdering in more places #7408, but a shorter-term fix is to just skip the Unsafe exchange path when RangePartitioner is used.
Memory leak exceptions masking exceptions that actually caused tasks to fail (will be fixed by [SPARK-9266] Prevent "managed memory leak detected" exception from masking original exception #7603).
https://issues.apache.org/jira/browse/SPARK-9162, to implement code generation for ScalaUDF. This is necessary for UDFSuite to pass. For now, I've just ignored this test in order to try to find other problems while we wait for a fix. This is no longer necessary as of [SPARK-9368][SQL] Support get(ordinal, dataType) generic getter in UnsafeRow. #7682.
Memory leaks from Limit after UnsafeExternalSort cause the memory leak detector to fail tests. This is a huge problem in the HiveCompatibilitySuite (fixed by f4ac642a4e5b2a7931c5e04e086bb10e263b1db6).
Tests in AggregationQuerySuite are failing due to NaN-handling issues in UnsafeRow, which were fixed in [SPARK-9421] Fix null-handling bugs in UnsafeRow.getDouble, getFloat(), and get(ordinal, dataType) #7736.
org.apache.spark.sql.ColumnExpressionSuite.rand needs to be updated so that the planner check also matches TungstenProject.
After having lowered the buffer sizes to 4MB so that most of HiveCompatibilitySuite runs:
- Wrong answer in join_1to1 (fixed by [SPARK-9364] Fix array out of bounds and use-after-free bugs in UnsafeExternalSorter #7680)
- Wrong answer in join_nulls (fixed by [SPARK-9364] Fix array out of bounds and use-after-free bugs in UnsafeExternalSorter #7680)
- Managed memory OOM / leak in lateral_view
- Seems to hang indefinitely in partcols1. This might be a deadlock in script transformation or a bug in error-handling code? The hang was fixed by [SPARK-9393] [SQL] Fix several error-handling bugs in ScriptTransform operator #7710.
- Error while freeing memory in partcols1: will be fixed by [SPARK-9419] ShuffleMemoryManager and MemoryStore should track memory on a per-task, not per-thread, basis #7734.
After fixing the partcols1 hang, it appears that a number of later tests have issues as well.
Fix thread-safety bug in codegen fallback expression evaluation ([SPARK-9448][SQL] GenerateUnsafeProjection should not share expressions across instances. #7759).

JoshRosen · 2015-07-21T07:17:32Z

Found an interesting problem where the "exception on memory leak" configuration can lead to task failure exceptions being masked. Going to push a fix to this branch, which I may end up spinning off into a separate PR depending on how things go.

SparkQA · 2015-07-21T07:21:42Z

Test build #37925 has finished for PR 7564 at commit ce55b32.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-07-21T07:35:02Z

Looks like there’s a problem with using ScalaUDFs in the unsafe GeneratedAggregate path. It looks like there’s a case where we end up using a BoundReference expression to extract fields before passing them to the UDF. This ends up calling the generic Row.apply(), which returns data of the wrong type when called on an UnsafeRow (specifically, the failing test tries to get a Long column but ends up getting a byte array back).

One way to fix this would be to implement code generation for ScalaUDF so that the field access expressions are code generated. Edit: this is covered by https://issues.apache.org/jira/browse/SPARK-9162.

SparkQA · 2015-07-21T10:28:52Z

Test build #37928 timed out for PR 7564 at commit a49ca2a after a configured wait of 175m.

SparkQA · 2015-07-21T10:43:52Z

Test build #37931 timed out for PR 7564 at commit e872840 after a configured wait of 175m.

JoshRosen · 2015-07-21T17:27:46Z

In this latest round of tests, it looks like HiveThriftBinaryServerSuite failed a test:

[info] - test multiple session *** FAILED *** (2 seconds, 488 milliseconds)
[info]   java.sql.SQLException: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree:
[info] UnsafeExternalSort [KEY#270 DESC], true, 0
[info]  ConvertToSafe
[info]   Exchange rangepartitioning(KEY#270 DESC)
[info]    UnsafeExternalSort [key#270 DESC], true, 0
[info]     Exchange rangepartitioning(key#270 DESC)
[info]      HiveTableScan [key#270], (MetastoreRelation default, test_map, None)
[info]   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:275)
[info]   at org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:355)
[info]   at org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite$$anonfun$8$$anonfun$apply$mcV$sp$13.apply(HiveThriftServer2Suites.scala:327)
[info]   at org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite$$anonfun$8$$anonfun$apply$mcV$sp$13.apply(HiveThriftServer2Suites.scala:324)
[info]   at org.apache.spark.sql.hive.thriftserver.HiveThriftJdbcTest$$anonfun$withMultipleConnectionJdbcStatement$1.apply(HiveThriftServer2Suites.scala:438)
[info]   at org.apache.spark.sql.hive.thriftserver.HiveThriftJdbcTest$$anonfun$withMultipleConnectionJdbcStatement$1.apply(HiveThriftServer2Suites.scala:438)
[info]   at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[info]   at org.apache.spark.sql.hive.thriftserver.HiveThriftJdbcTest.withMultipleConnectionJdbcStatement(HiveThriftServer2Suites.scala:438)
[info]   at org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite$$anonfun$8.apply$mcV$sp(HiveThriftServer2Suites.scala:209)
[info]   at org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite$$anonfun$8.apply(HiveThriftServer2Suites.scala:204)
[info]   at org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite$$anonfun$8.apply(HiveThriftServer2Suites.scala:204)
[info]   at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
[info]   at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
[info]   at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
[info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
[info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
[info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
[info]   at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
[info]   at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
[info]   at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
[info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
[info]   at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Test.org$scalatest$BeforeAndAfterAll$$super$run(HiveThriftServer2Suites.scala:450)
[info]   at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
[info]   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
[info]   at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Test.run(HiveThriftServer2Suites.scala:450)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[info]   at java.lang.Thread.run(Thread.java:745)

It looks like there may also be a planner bug that occurs when unsafe is enabled while codegen is disabled:

org.apache.spark.sql.hive.execution.HiveWindowFunctionQueryWithoutCodeGenSuite. testMultipleRangeWindows
 Failed to execute query using catalyst: Error: assertion failed: UnsafeExternalSort requires code generation to be enabled java.lang.AssertionError: assertion failed: UnsafeExternalSort requires code generation to be enabled  at scala.Predef$.assert(Predef.scala:179)  at org.apache.spark.sql.execution.UnsafeExternalSort$$anonfun$doExecute$7.apply(basicOperators.scala:281)  at org.apache.spark.sql.execution.UnsafeExternalSort$$anonfun$doExecute$7.apply(basicOperators.scala:280)  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)  at

This is probably easy to fix by adding configuration validation which does not allow unsafe to be enabled unless codegen is also enabled, then updating the relevant non-code-gen suites to also disable unsafe. We can also modify the planner to guard against this by also checking whether codegen is enabled when figuring out whether an unsafe sort operator can be used.

Finally, it looks like a number of the HiveCompatibilitySuite tests are failing due to the minimum buffer size issue, which I'll try to fix today.

SparkQA · 2015-07-21T22:01:41Z

Test build #37973 timed out for PR 7564 at commit e71ddbc after a configured wait of 175m.

JoshRosen · 2015-07-21T23:50:17Z

I lowered some buffers to 4MB so that the majority of HiveCompatibilitySuite runs and have found a few other test failures; see updated ticket description for more details.

SparkQA · 2015-07-22T01:36:40Z

Test build #37987 timed out for PR 7564 at commit e42fb7d after a configured wait of 175m.

SparkQA · 2015-07-22T02:11:25Z

Test build #37990 timed out for PR 7564 at commit 5464206 after a configured wait of 175m.

SparkQA · 2015-07-27T00:11:33Z

Test build #38482 timed out for PR 7564 at commit 4fcae4a after a configured wait of 175m.

SparkQA · 2015-07-27T04:58:27Z

Test build #38496 has finished for PR 7564 at commit f4ac642.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-27T08:42:03Z

Test build #38514 timed out for PR 7564 at commit 88aa54a after a configured wait of 175m.

SparkQA · 2015-07-27T09:41:44Z

Test build #38519 timed out for PR 7564 at commit c38bcdd after a configured wait of 175m.

JoshRosen · 2015-07-27T16:41:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala

This may be unnecessary now that we support generic getters in UnsafeRows.

SparkQA · 2015-07-27T19:36:47Z

Test build #38563 timed out for PR 7564 at commit 393a3eb after a configured wait of 175m.

SparkQA · 2015-07-28T04:35:28Z

Test build #38638 has finished for PR 7564 at commit fa1c3e2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-07-28T06:03:15Z

Lots of changes are temporarily merged in here while I test on top of Reynold's struct type patch.

SparkQA · 2015-07-28T08:11:40Z

Test build #38667 has finished for PR 7564 at commit 8946cb9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public static class StructWriter
- abstract class InternalRow extends Serializable with SpecializedGetters
- case class CreateStructUnsafe(children: Seq[Expression]) extends Expression
- case class CreateNamedStructUnsafe(children: Seq[Expression]) extends Expression
- case class LastDay(startDate: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class NextDay(startDate: Expression, dayOfWeek: Expression)
- case class TungstenProject(projectList: Seq[NamedExpression], child: SparkPlan) extends UnaryNode

SparkQA · 2015-07-28T08:25:49Z

Test build #38677 has finished for PR 7564 at commit 7f463f8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RFormula(override val uid: String) extends Estimator[RFormulaModel] with RFormulaBase
- public static class StructWriter
- abstract class InternalRow extends Serializable with SpecializedGetters
- case class CreateStructUnsafe(children: Seq[Expression]) extends Expression
- case class CreateNamedStructUnsafe(children: Seq[Expression]) extends Expression
- case class LastDay(startDate: Expression) extends UnaryExpression with ImplicitCastInputTypes
- case class NextDay(startDate: Expression, dayOfWeek: Expression)
- case class TungstenProject(projectList: Seq[NamedExpression], child: SparkPlan) extends UnaryNode

SparkQA · 2015-07-28T19:58:41Z

Test build #38757 has finished for PR 7564 at commit 445b4db.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

shaneknapp · 2015-07-28T20:11:58Z

jenkins, test this please

SparkQA · 2015-07-28T20:28:48Z

Test build #38761 has finished for PR 7564 at commit 445b4db.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-30T00:19:22Z

Test build #38896 has finished for PR 7564 at commit 7652648.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-07-30T00:24:41Z

Nooo! This failed a PySpark test because I need to configure a lower default page size in those tests.

SparkQA · 2015-07-30T01:38:09Z

Test build #38910 has finished for PR 7564 at commit 4f92106.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-07-30T03:23:31Z

Jenkins, retest this please.

JoshRosen · 2015-07-30T03:23:51Z

(Re-testing because a flaky MLlib test may have caused a spurious failure and I want to have the Python tests run...)

SparkQA · 2015-07-30T03:29:41Z

Test build #38940 has finished for PR 7564 at commit d6986de.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-30T03:40:36Z

Test build #38945 has finished for PR 7564 at commit 963f567.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-30T05:15:35Z

Test build #1231 has finished for PR 7564 at commit 963f567.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-30T05:28:31Z

Test build #38965 has finished for PR 7564 at commit 963f567.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-30T07:08:05Z

Test build #38992 has finished for PR 7564 at commit f4cc859.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class TungstenSort(

JoshRosen · 2015-07-30T16:18:24Z

Jenkins, retest this please.

(Just preemptively in case the first build hits a flaky streaming test)

SparkQA · 2015-07-30T17:15:32Z

Test build #39056 has finished for PR 7564 at commit 83c0c56.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-07-30T17:16:50Z

Jenkins retest this please

rxin · 2015-07-30T17:45:43Z

I've merged this!

SparkQA · 2015-07-30T18:11:28Z

Test build #1236 has finished for PR 7564 at commit 83c0c56.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-30T19:15:07Z

Test build #39073 has finished for PR 7564 at commit 83c0c56.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public static final class FloatPrefixComparator extends PrefixComparator
- case class QRDecomposition[UType, VType](Q: UType, R: VType)
- case class UnsafeExternalSort(

JoshRosen force-pushed the unsafe-by-default branch from e872840 to e71ddbc Compare July 21, 2015 19:00

JoshRosen force-pushed the unsafe-by-default branch from 5464206 to 4fcae4a Compare July 26, 2015 21:11

JoshRosen force-pushed the unsafe-by-default branch 2 times, most recently from 88aa54a to c38bcdd Compare July 27, 2015 06:41

JoshRosen force-pushed the unsafe-by-default branch from c38bcdd to 393a3eb Compare July 27, 2015 16:36

JoshRosen reviewed Jul 27, 2015
View reviewed changes

JoshRosen force-pushed the unsafe-by-default branch from a2ccefa to 8946cb9 Compare July 28, 2015 06:09

JoshRosen force-pushed the unsafe-by-default branch from b4ab1c0 to 445b4db Compare July 28, 2015 19:52

JoshRosen added 3 commits July 29, 2015 16:49

Enable Unsafe by default

715517b

Disable unsafe Exchange path when RangePartitioning is used

ea250da

Add task completion callback to avoid leak in limit after sort

5d0b2d3

JoshRosen force-pushed the unsafe-by-default branch from 4f92106 to 5d0b2d3 Compare July 29, 2015 23:50

Also match TungstenProject in checkNumProjects

013b9da

JoshRosen changed the title ~~[SPARK-8850] [SQL] [WIP] Enable Unsafe mode by default~~ [SPARK-8850] [SQL] Enable Unsafe mode by default Jul 29, 2015

Lower page size in PySpark tests

d6986de

Reduce buffer size for R tests

963f567

Merge remote-tracking branch 'origin/master' into unsafe-by-default

f4cc859

Merge remote-tracking branch 'origin/master' into unsafe-by-default

83c0c56

asfgit closed this in 520ec0f Jul 30, 2015

JoshRosen deleted the unsafe-by-default branch October 16, 2015 21:48

[SPARK-8850] [SQL] Enable Unsafe mode by default #7564

[SPARK-8850] [SQL] Enable Unsafe mode by default #7564

Uh oh!

Conversation

JoshRosen commented Jul 21, 2015

Uh oh!

JoshRosen commented Jul 21, 2015

Uh oh!

SparkQA commented Jul 21, 2015

Uh oh!

JoshRosen commented Jul 21, 2015

Uh oh!

SparkQA commented Jul 21, 2015

Uh oh!

SparkQA commented Jul 21, 2015

Uh oh!

JoshRosen commented Jul 21, 2015

Uh oh!

SparkQA commented Jul 21, 2015

Uh oh!

JoshRosen commented Jul 21, 2015

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

JoshRosen Jul 27, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 28, 2015

Uh oh!

JoshRosen commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 28, 2015

Uh oh!

shaneknapp commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 28, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

JoshRosen commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

JoshRosen commented Jul 30, 2015

Uh oh!

JoshRosen commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

JoshRosen commented Jul 30, 2015

Uh oh!

SparkQA commented Jul 30, 2015

Uh oh!

JoshRosen commented Jul 30, 2015

Uh oh!

rxin commented Jul 30, 2015