Commit 0bb716b
committed
Revert [SPARK-23433][SPARK-25250][CORE] Later created TaskSet should learn about the finished partitions
## What changes were proposed in this pull request?
Our customer has a very complicated job. Sometimes it successes and sometimes it fails with
```
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 4 has failed the maximum allowable number of times: 4.
Most recent failure reason: org.apache.spark.shuffle.FetchFailedException
```
However, with the patch #23871 , the job hangs forever.
When I investigated it, I found that `DAGScheduler` and `TaskSchedulerImpl` define stage completion differently. `DAGScheduler` thinks a stage is completed if all its partitions are marked as completed ([result stage](https://github.com/apache/spark/blob/v2.4.1/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1362-L1368) and [shuffle stage](https://github.com/apache/spark/blob/v2.4.1/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1400)). `TaskSchedulerImpl` thinks a stage's task set is completed when all tasks finish (see the [code](https://github.com/apache/spark/blob/v2.4.1/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L779-L784)).
Ideally this two definition should be consistent, but #23871 breaks it. In our customer's Spark log, I found that, a stage's task set completes, but the stage never completes. More specifically, `DAGScheduler` submits a task set for stage 4.1 with 1000 tasks, but the `TaskSetManager` skips to run the first 100 tasks. Later on, `TaskSetManager` finishes 900 tasks and marks the task set as completed. However, `DAGScheduler` doesn't agree with it and hangs forever, waiting for more task completion events of stage 4.1.
With hindsight, I think `TaskSchedulerIImpl.stageIdToFinishedPartitions` is fragile. We need to pay more effort to make sure this is consistent with `DAGScheduler`'s knowledge. When `DAGScheduler` marks some partitions from finished to unfinished, `TaskSchedulerIImpl.stageIdToFinishedPartitions` should be updated as well.
This PR reverts #23871, let's think of a more robust idea later.
## How was this patch tested?
N/A
Closes #24359 from cloud-fan/revert.
Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>1 parent eea3f55 commit 0bb716b
File tree
3 files changed
+20
-79
lines changed- core/src
- main/scala/org/apache/spark/scheduler
- test/scala/org/apache/spark/scheduler
3 files changed
+20
-79
lines changedLines changed: 4 additions & 32 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
25 | | - | |
| 25 | + | |
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
| |||
101 | 101 | | |
102 | 102 | | |
103 | 103 | | |
104 | | - | |
105 | | - | |
106 | | - | |
107 | 104 | | |
108 | 105 | | |
109 | 106 | | |
| |||
252 | 249 | | |
253 | 250 | | |
254 | 251 | | |
255 | | - | |
256 | | - | |
257 | | - | |
258 | | - | |
259 | | - | |
260 | | - | |
261 | | - | |
262 | | - | |
263 | | - | |
264 | | - | |
265 | | - | |
266 | | - | |
267 | | - | |
268 | | - | |
| 252 | + | |
269 | 253 | | |
270 | 254 | | |
271 | 255 | | |
| |||
886 | 870 | | |
887 | 871 | | |
888 | 872 | | |
889 | | - | |
| 873 | + | |
890 | 874 | | |
891 | 875 | | |
892 | 876 | | |
893 | 877 | | |
894 | 878 | | |
895 | | - | |
896 | | - | |
897 | | - | |
898 | | - | |
899 | | - | |
900 | | - | |
901 | | - | |
902 | 879 | | |
903 | 880 | | |
904 | 881 | | |
905 | 882 | | |
906 | 883 | | |
907 | | - | |
908 | | - | |
909 | | - | |
910 | | - | |
911 | | - | |
912 | 884 | | |
913 | | - | |
| 885 | + | |
914 | 886 | | |
915 | 887 | | |
916 | 888 | | |
| |||
Lines changed: 5 additions & 14 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
24 | | - | |
| 24 | + | |
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| |||
800 | 800 | | |
801 | 801 | | |
802 | 802 | | |
803 | | - | |
804 | | - | |
805 | | - | |
806 | | - | |
807 | | - | |
| 803 | + | |
808 | 804 | | |
809 | 805 | | |
810 | 806 | | |
| |||
823 | 819 | | |
824 | 820 | | |
825 | 821 | | |
826 | | - | |
827 | | - | |
828 | | - | |
| 822 | + | |
829 | 823 | | |
830 | 824 | | |
831 | 825 | | |
832 | | - | |
| 826 | + | |
833 | 827 | | |
834 | 828 | | |
835 | 829 | | |
836 | 830 | | |
837 | | - | |
838 | | - | |
839 | | - | |
840 | | - | |
| 831 | + | |
841 | 832 | | |
842 | 833 | | |
843 | 834 | | |
| |||
Lines changed: 11 additions & 33 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1121 | 1121 | | |
1122 | 1122 | | |
1123 | 1123 | | |
1124 | | - | |
| 1124 | + | |
1125 | 1125 | | |
1126 | 1126 | | |
1127 | 1127 | | |
| |||
1133 | 1133 | | |
1134 | 1134 | | |
1135 | 1135 | | |
1136 | | - | |
1137 | | - | |
1138 | | - | |
| 1136 | + | |
| 1137 | + | |
| 1138 | + | |
1139 | 1139 | | |
1140 | 1140 | | |
1141 | 1141 | | |
| |||
1152 | 1152 | | |
1153 | 1153 | | |
1154 | 1154 | | |
1155 | | - | |
1156 | | - | |
1157 | | - | |
1158 | | - | |
1159 | | - | |
1160 | | - | |
1161 | | - | |
1162 | | - | |
1163 | | - | |
1164 | | - | |
1165 | | - | |
1166 | 1155 | | |
1167 | | - | |
1168 | | - | |
1169 | | - | |
1170 | | - | |
| 1156 | + | |
| 1157 | + | |
| 1158 | + | |
1171 | 1159 | | |
1172 | 1160 | | |
1173 | 1161 | | |
1174 | | - | |
1175 | | - | |
1176 | | - | |
1177 | | - | |
1178 | | - | |
1179 | | - | |
1180 | | - | |
1181 | | - | |
1182 | 1162 | | |
1183 | 1163 | | |
1184 | 1164 | | |
1185 | 1165 | | |
1186 | 1166 | | |
1187 | 1167 | | |
1188 | 1168 | | |
1189 | | - | |
1190 | | - | |
1191 | | - | |
| 1169 | + | |
| 1170 | + | |
| 1171 | + | |
1192 | 1172 | | |
1193 | 1173 | | |
1194 | 1174 | | |
1195 | | - | |
1196 | 1175 | | |
1197 | 1176 | | |
1198 | 1177 | | |
1199 | | - | |
| 1178 | + | |
1200 | 1179 | | |
1201 | 1180 | | |
1202 | 1181 | | |
| |||
1244 | 1223 | | |
1245 | 1224 | | |
1246 | 1225 | | |
1247 | | - | |
1248 | 1226 | | |
1249 | 1227 | | |
1250 | 1228 | | |
| |||
0 commit comments