Skip to content

Conversation

@tdas
Copy link
Contributor

@tdas tdas commented Dec 2, 2015

The JobConf object created in DStream.saveAsHadoopFiles is used concurrently in multiple places:

  • The JobConf is updated by RDD.saveAsHadoopFile() before the job is launched
  • The JobConf is serialized as part of the DStream checkpoints.
    These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object.

The solution is to create a new JobConf in every batch, that is updated by RDD.saveAsHadoopFile(), while the checkpointing serializes the original JobConf.

Tests to be added in #9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches.

@tdas
Copy link
Contributor Author

tdas commented Dec 2, 2015

@zsxwing Please take a look. This should be merged to older branches if possible. And it blocks #9988 .

@SparkQA
Copy link

SparkQA commented Dec 2, 2015

Test build #47032 has finished for PR 10088 at commit 7ff8174.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 2, 2015

Test build #2147 has finished for PR 10088 at commit 7ff8174.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Dec 2, 2015

LGTM

@zsxwing
Copy link
Member

zsxwing commented Dec 2, 2015

merging it

asfgit pushed a commit that referenced this pull request Dec 2, 2015
…HadoopFiles

The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places:
* The JobConf is updated by `RDD.saveAsHadoopFile()` before the job is launched
* The JobConf is serialized as part of the DStream checkpoints.
These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object.

The solution is to create a new JobConf in every batch, that is updated by `RDD.saveAsHadoopFile()`, while the checkpointing serializes the original JobConf.

Tests to be added in #9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches.

Author: Tathagata Das <[email protected]>

Closes #10088 from tdas/SPARK-12087.

(cherry picked from commit 8a75a30)
Signed-off-by: Shixiong Zhu <[email protected]>
asfgit pushed a commit that referenced this pull request Dec 2, 2015
…HadoopFiles

The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places:
* The JobConf is updated by `RDD.saveAsHadoopFile()` before the job is launched
* The JobConf is serialized as part of the DStream checkpoints.
These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object.

The solution is to create a new JobConf in every batch, that is updated by `RDD.saveAsHadoopFile()`, while the checkpointing serializes the original JobConf.

Tests to be added in #9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches.

Author: Tathagata Das <[email protected]>

Closes #10088 from tdas/SPARK-12087.

(cherry picked from commit 8a75a30)
Signed-off-by: Shixiong Zhu <[email protected]>
asfgit pushed a commit that referenced this pull request Dec 2, 2015
…HadoopFiles

The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places:
* The JobConf is updated by `RDD.saveAsHadoopFile()` before the job is launched
* The JobConf is serialized as part of the DStream checkpoints.
These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object.

The solution is to create a new JobConf in every batch, that is updated by `RDD.saveAsHadoopFile()`, while the checkpointing serializes the original JobConf.

Tests to be added in #9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches.

Author: Tathagata Das <[email protected]>

Closes #10088 from tdas/SPARK-12087.

(cherry picked from commit 8a75a30)
Signed-off-by: Shixiong Zhu <[email protected]>
@asfgit asfgit closed this in 8a75a30 Dec 2, 2015
@zsxwing
Copy link
Member

zsxwing commented Dec 2, 2015

Merged to master, 1.6, 1.5 and 1.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants