Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Feb 12, 2018

What changes were proposed in this pull request?

This PR aims to resolve an open file leakage issue reported at SPARK-23390 by moving the listener registration position. Currently, the sequence is like the following.

  1. Create batchReader
  2. batchReader.initialize opens a ORC file.
  3. batchReader.initBatch may take a long time to alloc memory in some environment and cause errors.
  4. Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => iter.close()))

This PR moves 4 before 2 and 3. To sum up, the new sequence is 1 -> 4 -> 2 -> 3.

How was this patch tested?

Manual. The following test case makes OOM intentionally to cause leaked filesystem connection in the current code base. With this patch, leakage doesn't occurs.

  // This should be tested manually because it raises OOM intentionally
  // in order to cause `Leaked filesystem connection`.
  test("SPARK-23399 Register a task completion listener first for OrcColumnarBatchReader") {
    withSQLConf(SQLConf.ORC_VECTORIZED_READER_BATCH_SIZE.key -> s"${Int.MaxValue}") {
      withTempDir { dir =>
        val basePath = dir.getCanonicalPath
        Seq(0).toDF("a").write.format("orc").save(new Path(basePath, "first").toString)
        Seq(1).toDF("a").write.format("orc").save(new Path(basePath, "second").toString)
        val df = spark.read.orc(
          new Path(basePath, "first").toString,
          new Path(basePath, "second").toString)
        val e = intercept[SparkException] {
          df.collect()
        }
        assert(e.getCause.isInstanceOf[OutOfMemoryError])
      }
    }
  }

val iter = new RecordReaderIterator(batchReader)
Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => iter.close()))

batchReader.initialize(fileSplit, taskAttemptContext)
Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the reported case, the ORC file is opened here.
But, it seems that the task is killed, TaskKilled (Stage cancelled), during initBatch before registering its listener . For a case throwing Exception at initBatch, this PR prevents the open file leakage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan and @gatorsmile . Could you take a look this?
For ORC library, it looks okay when we call close correctly.

Copy link
Member

@viirya viirya Feb 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Thanks for this fix! My question is how do we know if close is not called before and is called now? Have you verified it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I tried to verify it manually in local, seems close is called before this change. Maybe I miss something or this is environment depending.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-23399][SQL] Register a task completion listner first for OrcColumnarBatchReader [SPARK-23399][SQL] Register a task completion listener first for OrcColumnarBatchReader Feb 12, 2018
@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87349 has finished for PR 20590 at commit 198f186.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Feb 13, 2018

Umm...

@kiszk
Copy link
Member

kiszk commented Feb 13, 2018

@dongjoon-hyun One question. Do you know which test case in FileBasedDataSourceSuite may cause this failure? Now, we are seeing a stack trace at afterEach.

@cloud-fan
Copy link
Contributor

looks reasonable.

batchReader.initBatch throw FileNotException, and we enter afterEach, detect the file stream leak and fail.

@cloud-fan
Copy link
Contributor

I know it's hard to add a test, we need a malformed ORC file to make the reader fail midway. @dongjoon-hyun do you think it's possible to generate such a ORC file?

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @viirya , @kiszk , @cloud-fan .
Yep. I'm still trying to reproduce it by a test case. I'll inform you later.

assert(e.getCause.isInstanceOf[OutOfMemoryError])
}
}
}
Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, All.
The above test case generates the same leakage reported in JIRA.
And, this PR fixes that. Please try this test case in IntelliJ with the master branch.

}

// This should be tested manually because it raises OOM intentionally
// in order to cause `Leaked filesystem connection`. The test suite dies, too.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, nice trick to fail the reader midway!

But it's a little weird to have it as a unit test, shall we just put it in the PR description and say it's manually tested? This test needs to be run manually anyway...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87384 has finished for PR 20590 at commit 3b8cb0a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87385 has finished for PR 20590 at commit d4cc32e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Feb 13, 2018

retest this please

val batchReader = new OrcColumnarBatchReader(
enableOffHeapColumnVector && taskContext.isDefined, copyToSpark, capacity)
val iter = new RecordReaderIterator(batchReader)
Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => iter.close()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add comment why we put this registration here with SPARK-23399. Since we would forget this investigation in the future :), this comment will help us and will remind to run the test case manually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87393 has finished for PR 20590 at commit d4cc32e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87397 has finished for PR 20590 at commit 5a9fa0b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 13, 2018

Test build #87402 has finished for PR 20590 at commit 18c2485.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Feb 14, 2018

Test build #87425 has finished for PR 20590 at commit 18c2485.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan and @gatorsmile .
It's ready for review again. I think this definitely will reduce the flakiness for that suite.
But, I'm still looking at some other places, too.

@cloud-fan
Copy link
Contributor

thanks, merging to master/2.3!

asfgit pushed a commit that referenced this pull request Feb 14, 2018
…olumnarBatchReader

This PR aims to resolve an open file leakage issue reported at [SPARK-23390](https://issues.apache.org/jira/browse/SPARK-23390) by moving the listener registration position. Currently, the sequence is like the following.

1. Create `batchReader`
2. `batchReader.initialize` opens a ORC file.
3. `batchReader.initBatch` may take a long time to alloc memory in some environment and cause errors.
4. `Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => iter.close()))`

This PR moves 4 before 2 and 3. To sum up, the new sequence is 1 -> 4 -> 2 -> 3.

Manual. The following test case makes OOM intentionally to cause leaked filesystem connection in the current code base. With this patch, leakage doesn't occurs.

```scala
  // This should be tested manually because it raises OOM intentionally
  // in order to cause `Leaked filesystem connection`.
  test("SPARK-23399 Register a task completion listener first for OrcColumnarBatchReader") {
    withSQLConf(SQLConf.ORC_VECTORIZED_READER_BATCH_SIZE.key -> s"${Int.MaxValue}") {
      withTempDir { dir =>
        val basePath = dir.getCanonicalPath
        Seq(0).toDF("a").write.format("orc").save(new Path(basePath, "first").toString)
        Seq(1).toDF("a").write.format("orc").save(new Path(basePath, "second").toString)
        val df = spark.read.orc(
          new Path(basePath, "first").toString,
          new Path(basePath, "second").toString)
        val e = intercept[SparkException] {
          df.collect()
        }
        assert(e.getCause.isInstanceOf[OutOfMemoryError])
      }
    }
  }
```

Author: Dongjoon Hyun <[email protected]>

Closes #20590 from dongjoon-hyun/SPARK-23399.

(cherry picked from commit 357babd)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in 357babd Feb 14, 2018
@dongjoon-hyun dongjoon-hyun deleted the SPARK-23399 branch February 14, 2018 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants