[SPARK-18726][SQL]resolveRelation for FileFormat DataSource don't need to listFiles twice #17081

windpiger · 2017-02-27T08:17:06Z

What changes were proposed in this pull request?

Currently when we resolveRelation for a FileFormat DataSource without providing user schema, it will execute listFiles twice in InMemoryFileIndex during resolveRelation.

This PR add a FileStatusCache for DataSource, this can avoid listFiles twice.

But there is a bug in InMemoryFileIndex see:
SPARK-19748
SPARK-19761,
so this pr should be after SPARK-19748/ SPARK-19761.

How was this patch tested?

unit test added

…ed to listFiles twice

SparkQA · 2017-02-27T16:45:50Z

Test build #73500 has finished for PR 17081 at commit 6b5454a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-28T00:02:37Z

Test build #73539 has started for PR 17081 at commit f1da0a4.

windpiger · 2017-03-01T22:37:33Z

retest this please

windpiger · 2017-03-01T22:52:00Z

retest this please

SparkQA · 2017-03-01T23:58:52Z

Test build #73715 has finished for PR 17081 at commit f1da0a4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-02T00:13:09Z

Test build #73716 has finished for PR 17081 at commit f79f12c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-02T04:51:28Z

Test build #73724 has finished for PR 17081 at commit a8c1dea.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ResolvedDataSourceSuite extends SparkFunSuite with SharedSQLContext

windpiger · 2017-03-02T04:54:27Z

cc @cloud-fan @gatorsmile @ericl could you help to review this?thanks :)

gatorsmile · 2017-03-02T06:36:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

        } else {
-          new InMemoryFileIndex(sparkSession, globbedPaths, options, Some(partitionSchema))
+          new InMemoryFileIndex(sparkSession, globbedPaths, options, Some(partitionSchema),
+            fileStatusCache)


Nit: indent issue

SparkQA · 2017-03-02T06:50:56Z

Test build #73726 has finished for PR 17081 at commit 60fa037.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-02T06:52:30Z

Test build #73733 has started for PR 17081 at commit 9a73947.

gatorsmile · 2017-03-02T06:53:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+            globbedPaths,
+            options,
+            Some(partitionSchema),
+            fileStatusCache)


new InMemoryFileIndex( sparkSession, globbedPaths, options, Some(partitionSchema), fileStatusCache)

This is also valid

…ceScanFilesTwice

gatorsmile · 2017-03-02T06:59:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

        SparkHadoopUtil.get.globPathIfNecessary(qualified)
      }.toArray
-      new InMemoryFileIndex(sparkSession, globbedPaths, options, None)
+      new InMemoryFileIndex(sparkSession, globbedPaths, options, None, fileStatusCache)


This also impacts the streaming code path. If it is fine to streaming, the code changes look good to me.

I have make it local only in the no streaming FileFormat match case~

cloud-fan · 2017-03-02T07:03:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

  lazy val sourceInfo: SourceInfo = sourceSchema()
  private val caseInsensitiveOptions = CaseInsensitiveMap(options)
-
+  private lazy val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)


what's the lifetime of this cache?

cloud-fan · 2017-03-02T07:06:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

            catalogTable.get,
            catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(defaultTableSize))
        } else {
-          new InMemoryFileIndex(sparkSession, globbedPaths, options, Some(partitionSchema))


I'd like to create file status cache as a local variable, pass it to getOrInferFileFormatSchema, then use it here. It's much easier to reason about the lifetime of this cache by this way.

ok, I think it is more reasonable~ thanks~

SparkQA · 2017-03-02T13:12:56Z

Test build #73758 has finished for PR 17081 at commit 28c8158.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-02T13:18:36Z

Test build #73759 has finished for PR 17081 at commit 92618b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-02T16:18:47Z

Why you closed it?

windpiger · 2017-03-02T22:58:37Z

oh...sorry ， I don't know when I close it...

gatorsmile · 2017-03-03T00:18:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

-  private def getOrInferFileFormatSchema(format: FileFormat): (StructType, StructType) = {
+  private def getOrInferFileFormatSchema(
+      format: FileFormat,
+      fileStatusCache: FileStatusCache = NoopCache): (StructType, StructType) = {


Please update the function description with a new @parm

ok, thanks~

gatorsmile · 2017-03-03T00:19:15Z

Please remove [FOLLOW-UP] from the PR title. Thanks!

SparkQA · 2017-03-03T00:58:00Z

Test build #73791 has finished for PR 17081 at commit 92618b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-03T01:47:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

   *     be any further inference in any triggers.
   *
   * @param format the file format object for this DataSource
+   * @param fileStatusCache fileStatusCache for InMemoryFileIndex


@param fileStatusCache the shared cache for file statuses to speed up listing

SparkQA · 2017-03-03T02:47:02Z

Test build #73793 has finished for PR 17081 at commit f6ec4fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-03T04:02:30Z

Test build #73798 has finished for PR 17081 at commit 3e495a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-03T05:07:48Z

LGTM

cloud-fan · 2017-03-03T07:55:03Z

thanks, merging to master!

windpiger added 2 commits February 27, 2017 16:04

[SPAKR-18726][SQL]resolveRelation for FileFormate DataSource don't ne…

0082b76

…ed to listFiles twice

add test case

6b5454a

windpiger changed the title ~~[SPAKR-18726][SQL][WIP]resolveRelation for FileFormat DataSource don't need to listFiles twice~~ [SPAKR-18726][SQL][FOLLOW-UP]resolveRelation for FileFormat DataSource don't need to listFiles twice Feb 27, 2017

windpiger changed the title ~~[SPAKR-18726][SQL][FOLLOW-UP]resolveRelation for FileFormat DataSource don't need to listFiles twice~~ [SPARK-18726][SQL][FOLLOW-UP]resolveRelation for FileFormat DataSource don't need to listFiles twice Feb 27, 2017

fix a style

f1da0a4

windpiger mentioned this pull request Feb 28, 2017

[SPARK-19748][SQL]refresh function has a wrong order to do cache invalidate and regenerate the inmemory var for InMemoryFileIndex with FileStatusCache #17079

Closed

Merge branch 'master' into resolveDataSourceScanFilesTwice

f79f12c

windpiger added 2 commits March 2, 2017 09:50

fix test failed

a8c1dea

add a lazy

60fa037

gatorsmile reviewed Mar 2, 2017

View reviewed changes

fix code style

9a73947

gatorsmile reviewed Mar 2, 2017

View reviewed changes

Merge branch 'master' of github.com:apache/spark into resolveDataSour…

850094c

…ceScanFilesTwice

gatorsmile reviewed Mar 2, 2017

View reviewed changes

cloud-fan reviewed Mar 2, 2017

View reviewed changes

windpiger added 4 commits March 2, 2017 19:03

make filestatuscache local var

c39eb26

modify a test case

f3332cb

modify a test case

9cadd41

modify a test case

28c8158

remove an empty line

92618b3

windpiger closed this Mar 2, 2017

windpiger reopened this Mar 2, 2017

gatorsmile reviewed Mar 3, 2017

View reviewed changes

add param comment

f6ec4fe

windpiger changed the title ~~[SPARK-18726][SQL][FOLLOW-UP]resolveRelation for FileFormat DataSource don't need to listFiles twice~~ [SPARK-18726][SQL]resolveRelation for FileFormat DataSource don't need to listFiles twice Mar 3, 2017

gatorsmile reviewed Mar 3, 2017

View reviewed changes

fix a comment

3e495a7

asfgit closed this in 982f322 Mar 3, 2017

[SPARK-18726][SQL]resolveRelation for FileFormat DataSource don't need to listFiles twice #17081

[SPARK-18726][SQL]resolveRelation for FileFormat DataSource don't need to listFiles twice #17081

Uh oh!

Conversation

windpiger commented Feb 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 27, 2017

Uh oh!

SparkQA commented Feb 28, 2017

Uh oh!

windpiger commented Mar 1, 2017

Uh oh!

windpiger commented Mar 1, 2017

Uh oh!

SparkQA commented Mar 1, 2017

Uh oh!

SparkQA commented Mar 2, 2017

Uh oh!

SparkQA commented Mar 2, 2017

Uh oh!

windpiger commented Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 2, 2017

Uh oh!

SparkQA commented Mar 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

windpiger Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 2, 2017

Uh oh!

SparkQA commented Mar 2, 2017

Uh oh!

gatorsmile commented Mar 2, 2017

Uh oh!

windpiger commented Mar 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Mar 3, 2017

Uh oh!

SparkQA commented Mar 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 3, 2017

Uh oh!

SparkQA commented Mar 3, 2017

Uh oh!

gatorsmile commented Mar 3, 2017

Uh oh!

cloud-fan commented Mar 3, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

windpiger commented Feb 27, 2017 •

edited

Loading

windpiger commented Mar 2, 2017 •

edited

Loading

windpiger Mar 2, 2017 •

edited

Loading

cloud-fan Mar 2, 2017 •

edited

Loading