Skip to content
This repository was archived by the owner on May 9, 2024. It is now read-only.

Conversation

@corruptmemory
Copy link

@corruptmemory corruptmemory commented Apr 26, 2016

This is a backport of the upstream fix for SPARK-12583

It's initially based on the patch from this commit: apache@310981d

After applying the commit (cherry-pick) I updated the source by selectively applying code from master.

How was this patch tested?

Not tested yet. Only builds.

@corruptmemory
Copy link
Author

Rebased on latest in branch-1.6 from upstream

@dragos
Copy link

dragos commented Apr 27, 2016

It looks like to have Scala style check errors. You can run the exact same suite as the validator with dev/run-tests, or just the style-checks with dev/lint-scala

Scalastyle checks failed at following occurrences:
[error] /home/ubuntu/workspace/ghprb-spark-multi-conf/label/mesos-spark-docker/scala_version/2.11/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:331:44: No space after token :
[error] (core/compile:scalastyle) errors exist
[error] Total time: 10 s, completed Apr 26, 2016 4:18:18 PM
[error] /home/ubuntu/workspace/ghprb-spark-multi-conf/label/mesos-spark-docker/scala_version/2.11/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:331:44: No space after token :
[error] (core/compile:scalastyle) errors exist
[error] Total time: 10 s, completed Apr 26, 2016 4:18:50 PM
[error] running /home/ubuntu/workspace/ghprb-spark-multi-conf/label/mesos-spark-docker/scala_version/2.11/dev/lint-scala ; received return code 1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole file shouldn't be here. Note that there's been a lot of shuffling around this class. In 1.6.0 there are two RPC implementations: Netty-based and Akka-based. In 2.0 the Akka one is removed.

This class already exists in your fork under org.apache.spark.rpc.netty. You should probably modify the heartbeat logic to work with both RPC implementations and remove this copy. This commit is related: 4f5a24d

@corruptmemory corruptmemory force-pushed the SPARK-12583 branch 3 times, most recently from ea7419a to 0434e87 Compare May 2, 2016 16:43
srowen and others added 15 commits May 3, 2016 13:15
Update Jetty 8.1 to the latest 2016/02 release, from a 2013/10 release, for security and bug fixes. This does not resolve the JIRA necessarily, as it's still worth considering an update to 9.3.

Jenkins tests

Author: Sean Owen <[email protected]>

Closes apache#12842 from srowen/SPARK-14897.

(cherry picked from commit 57ac7c1)
Signed-off-by: Sean Owen <[email protected]>
…ady succeeded

Don't re-queue a task if another attempt has already succeeded.  This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded.

I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown.  Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks).  With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts.

Author: Jason Moore <[email protected]>

Closes apache#12751 from jasonmoore2k/SPARK-14915.

(cherry picked from commit 77361a4)
Signed-off-by: Sean Owen <[email protected]>
…3c060

## What changes were proposed in this pull request?

I botched the back-port of SPARK-14915 to branch-1.6 in apache@bf3c060 resulting in a code block being added twice. This simply removes it, such that the net change is the intended one.

## How was this patch tested?

Jenkins tests. (This in theory has already been tested.)

Author: Sean Owen <[email protected]>

Closes apache#12950 from srowen/SPARK-14915.2.
…Thread

Temp patch for branch 1.6, avoid deadlock between BlockManager and Executor Thread.

Author: cenyuhai <[email protected]>

Closes apache#11546 from cenyuhai/SPARK-13566.
## What changes were proposed in this pull request?

The configuration setting `spark.executor.logs.rolling.size.maxBytes` was changed to `spark.executor.logs.rolling.maxSize` in 1.4 or so.

This commit fixes a remaining reference to the old name in the documentation.

Also the description for `spark.executor.logs.rolling.maxSize` was edited to clearly state that the unit for the size is bytes.

## How was this patch tested?

no tests

Author: Philipp Hoffmann <[email protected]>

Closes apache#13001 from philipphoffmann/patch-3.
…eb UI timeline

## What changes were proposed in this pull request?

This patch fixes an escaping bug in the Web UI's event timeline that caused Javascript errors when displaying timeline entries whose descriptions include single quotes.

The original bug can be reproduced by running

```scala
sc.setJobDescription("double quote: \" ")
sc.parallelize(1 to 10).count()

sc.setJobDescription("single quote: ' ")
sc.parallelize(1 to 10).count()
```

and then browsing to the driver UI. Previously, this resulted in an "Uncaught SyntaxError" because the single quote from the description was not escaped and ended up closing a Javascript string literal too early.

The fix implemented here is to change the relevant Javascript to define its string literals using double-quotes. Our escaping logic already properly escapes double quotes in the description, so this is safe to do.

## How was this patch tested?

Tested manually in `spark-shell` using the following cases:

```scala
sc.setJobDescription("double quote: \" ")
sc.parallelize(1 to 10).count()

sc.setJobDescription("single quote: ' ")
sc.parallelize(1 to 10).count()

sc.setJobDescription("ampersand: &")
sc.parallelize(1 to 10).count()

sc.setJobDescription("newline: \n text after newline ")
sc.parallelize(1 to 10).count()

sc.setJobDescription("carriage return: \r text after return ")
sc.parallelize(1 to 10).count()
```

/cc sarutak for review.

Author: Josh Rosen <[email protected]>

Closes apache#12995 from JoshRosen/SPARK-15209.

(cherry picked from commit 3323d0f)
Signed-off-by: Kousuke Saruta <[email protected]>
…distinct aggregate function

#### Symptom:
In the latest **branch 1.6**, when a `DISTINCT` aggregation function is used in the `HAVING` clause, Analyzer throws `AnalysisException` with a message like following:
```
resolved attribute(s) gid#558,id#559 missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], [date#554,id#561,gid#560,if ((gid = 1)) id else null#562];
```
#### Root cause:
The problem is that the distinct aggregate in having condition are resolved by the rule `DistinctAggregationRewriter` twice, which messes up the resulted `EXPAND` operator.

In a `ResolveAggregateFunctions` rule, when resolving ```Filter(havingCondition, _: Aggregate)```, the `havingCondition` is resolved as an `Aggregate` in a nested loop of analyzer rule execution (by invoking `RuleExecutor.execute`). At this nested level of analysis, the rule `DistinctAggregationRewriter` rewrites this distinct aggregate clause to an expanded two-layer aggregation, where the `aggregateExpresssions` of the final `Aggregate` contains the resolved `gid` and the aggregate expression attributes (In the above case, they are  `gid#558, id#559`).

After completion of the nested analyzer rule execution, the resulted `aggregateExpressions` in the `havingCondition` is pushed down into the underlying `Aggregate` operator. The `DistinctAggregationRewriter` rule is executed again. The `projections` field of `EXPAND` operator is populated with the `aggregateExpressions` of the `havingCondition` mentioned above. However, the attributes (In the above case, they are `gid#558, id#559`) in the projection list of `EXPAND` operator can not be found in the underlying relation.

#### Solution:
This PR retrofits part of [apache#11579](apache#11579) that moves the `DistinctAggregationRewriter` to the beginning of Optimizer, so that it guarantees that the rewrite only happens after all the aggregate functions are resolved first. Thus, it avoids resolution failure.

#### How is the PR change tested
New [test cases ](https://github.com/xwu0226/spark/blob/f73428f94746d6d074baf6702589545bdbd11cad/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala#L927-L988) are added to drive `DistinctAggregationRewriter` rewrites for multi-distinct aggregations , involving having clause.

A following up PR will be submitted to add these test cases to master(2.0) branch.

Author: xin Wu <[email protected]>

Closes apache#12974 from xwu0226/SPARK-14495_review.
…leaning executor's state

## What changes were proposed in this pull request?

When the driver removes an executor's state, the connection between the driver and the executor may be still alive so that the executor cannot exit automatically (E.g., Master will send RemoveExecutor when a work is lost but the executor is still alive), so the driver should try to tell the executor to stop itself. Otherwise, we will leak an executor.

This PR modified the driver to send `StopExecutor` to the executor when it's removed.

## How was this patch tested?

manual test: increase the worker heartbeat interval to force it's always timeout and the leak executors are gone.

Author: Shixiong Zhu <[email protected]>

Closes apache#11399 from zsxwing/SPARK-13519.
…eartbeat to driver more than N times

## What changes were proposed in this pull request?

Sometimes, network disconnection event won't be triggered for other potential race conditions that we may not have thought of, then the executor will keep sending heartbeats to driver and won't exit.

This PR adds a new configuration `spark.executor.heartbeat.maxFailures` to kill Executor when it's unable to heartbeat to the driver more than `spark.executor.heartbeat.maxFailures` times.

## How was this patch tested?

unit tests

Author: Shixiong Zhu <[email protected]>

Closes apache#11401 from zsxwing/SPARK-13522.
## What changes were proposed in this pull request?

Just fixed the log place introduced by apache#11401

## How was this patch tested?

unit tests.

Author: Shixiong Zhu <[email protected]>

Closes apache#11432 from zsxwing/SPARK-13522-follow-up.
## What changes were proposed in this pull request?

If an executor is still alive even after the scheduler has removed its metadata, we may receive a heartbeat from that executor and tell its block manager to reregister itself. If that happens, the block manager master will know about the executor, but the scheduler will not.

That is a dangerous situation, because when the executor does get disconnected later, the scheduler will not ask the block manager to also remove metadata for that executor. Later, when we try to clean up an RDD or a broadcast variable, we may try to send a message to that executor, triggering an exception.

## How was this patch tested?

Jenkins.

Author: Andrew Or <[email protected]>

Closes apache#13055 from andrewor14/block-manager-remove.

(cherry picked from commit 40a949a)
Signed-off-by: Shixiong Zhu <[email protected]>
## What changes were proposed in this pull request?

(This is the branch-1.6 version of apache#13039)

When we acquire execution memory, we do a lot of things between shrinking the storage memory pool and enlarging the execution memory pool. In particular, we call memoryStore.evictBlocksToFreeSpace, which may do a lot of I/O and can throw exceptions. If an exception is thrown, the pool sizes on that executor will be in a bad state.

This patch minimizes the things we do between the two calls to make the resizing more atomic.

## How was this patch tested?

Jenkins.

Author: Andrew Or <[email protected]>

Closes apache#13058 from andrewor14/safer-pool-1.6.
Fixed memory leak (HiveConf in the CommandProcessorFactory)

Author: Oleg Danilov <[email protected]>

Closes apache#12932 from dosoft/SPARK-14261.

(cherry picked from commit e384c7f)
Signed-off-by: Reynold Xin <[email protected]>
…for 1.6)

## What changes were proposed in this pull request?

Backport apache#13185 to branch 1.6.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <[email protected]>

Closes apache#13196 from zsxwing/host-string-1.6.
… in generated code (branch-1.6)

## What changes were proposed in this pull request?

This PR introduce place holder for comment in generated code and the purpose is same for apache#12939 but much safer.

Generated code to be compiled doesn't include actual comments but includes place holder instead.

Place holders in generated code will be replaced with actual comments only at the time of logging.

Also, this PR can resolve SPARK-15205.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Added new test cases.

Author: Kousuke Saruta <[email protected]>

Closes apache#13230 from sarutak/SPARK-15165-branch-1.6.
@corruptmemory
Copy link
Author

retest this please

…s before application has stopped

Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped.

spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583
External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159

This is a follow up on apache#11207 .

This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case.

This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service:
```
16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms).
16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort}
16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort}
16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files.
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs
```
Note: there are 2 executors running on this slave.

Author: Bertrand Bossy <[email protected]>

Closes apache#11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat.

Initial backport of apache#11272

* No new test failures introduced.
* Provisional backport complete
@corruptmemory
Copy link
Author

retest this please

3 similar comments
@skonto
Copy link

skonto commented May 24, 2016

retest this please

@skonto
Copy link

skonto commented May 24, 2016

retest this please

@skonto
Copy link

skonto commented May 24, 2016

retest this please

@skonto
Copy link

skonto commented May 24, 2016

retest this please

2 similar comments
@skonto
Copy link

skonto commented May 24, 2016

retest this please

@skonto
Copy link

skonto commented May 24, 2016

retest this please

@corruptmemory
Copy link
Author

Closing to send upstream

@skonto
Copy link

skonto commented May 27, 2016

retest this please

18 similar comments
@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 27, 2016

retest this please

@skonto
Copy link

skonto commented May 30, 2016

retest this please

@typesafe-tools
Copy link
Collaborator

Can one of the admins verify this patch?

@skonto skonto closed this Nov 20, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.