Skip to content

Conversation

@windpiger
Copy link
Contributor

@windpiger windpiger commented Feb 27, 2017

What changes were proposed in this pull request?

Currently when we resolveRelation for a FileFormat DataSource without providing user schema, it will execute listFiles twice in InMemoryFileIndex during resolveRelation.

This PR add a FileStatusCache for DataSource, this can avoid listFiles twice.

But there is a bug in InMemoryFileIndex see:
SPARK-19748
SPARK-19761,
so this pr should be after SPARK-19748/ SPARK-19761.

How was this patch tested?

unit test added

@windpiger windpiger changed the title [SPAKR-18726][SQL][WIP]resolveRelation for FileFormat DataSource don't need to listFiles twice [SPAKR-18726][SQL][FOLLOW-UP]resolveRelation for FileFormat DataSource don't need to listFiles twice Feb 27, 2017
@windpiger windpiger changed the title [SPAKR-18726][SQL][FOLLOW-UP]resolveRelation for FileFormat DataSource don't need to listFiles twice [SPARK-18726][SQL][FOLLOW-UP]resolveRelation for FileFormat DataSource don't need to listFiles twice Feb 27, 2017
@SparkQA
Copy link

SparkQA commented Feb 27, 2017

Test build #73500 has finished for PR 17081 at commit 6b5454a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73539 has started for PR 17081 at commit f1da0a4.

@windpiger
Copy link
Contributor Author

retest this please

@windpiger
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73715 has finished for PR 17081 at commit f1da0a4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73716 has finished for PR 17081 at commit f79f12c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73724 has finished for PR 17081 at commit a8c1dea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ResolvedDataSourceSuite extends SparkFunSuite with SharedSQLContext

@windpiger
Copy link
Contributor Author

windpiger commented Mar 2, 2017

cc @cloud-fan @gatorsmile @ericl could you help to review this?thanks :)

} else {
new InMemoryFileIndex(sparkSession, globbedPaths, options, Some(partitionSchema))
new InMemoryFileIndex(sparkSession, globbedPaths, options, Some(partitionSchema),
fileStatusCache)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indent issue

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73726 has finished for PR 17081 at commit 60fa037.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73733 has started for PR 17081 at commit 9a73947.

globbedPaths,
options,
Some(partitionSchema),
fileStatusCache)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

          new InMemoryFileIndex(
            sparkSession, globbedPaths, options, Some(partitionSchema), fileStatusCache)

This is also valid

SparkHadoopUtil.get.globPathIfNecessary(qualified)
}.toArray
new InMemoryFileIndex(sparkSession, globbedPaths, options, None)
new InMemoryFileIndex(sparkSession, globbedPaths, options, None, fileStatusCache)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also impacts the streaming code path. If it is fine to streaming, the code changes look good to me.

Copy link
Contributor Author

@windpiger windpiger Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have make it local only in the no streaming FileFormat match case~

lazy val sourceInfo: SourceInfo = sourceSchema()
private val caseInsensitiveOptions = CaseInsensitiveMap(options)

private lazy val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
Copy link
Contributor

@cloud-fan cloud-fan Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the lifetime of this cache?

catalogTable.get,
catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(defaultTableSize))
} else {
new InMemoryFileIndex(sparkSession, globbedPaths, options, Some(partitionSchema))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to create file status cache as a local variable, pass it to getOrInferFileFormatSchema, then use it here. It's much easier to reason about the lifetime of this cache by this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think it is more reasonable~ thanks~

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73758 has finished for PR 17081 at commit 28c8158.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73759 has finished for PR 17081 at commit 92618b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger windpiger closed this Mar 2, 2017
@gatorsmile
Copy link
Member

Why you closed it?

@windpiger windpiger reopened this Mar 2, 2017
@windpiger
Copy link
Contributor Author

oh...sorry , I don't know when I close it...

private def getOrInferFileFormatSchema(format: FileFormat): (StructType, StructType) = {
private def getOrInferFileFormatSchema(
format: FileFormat,
fileStatusCache: FileStatusCache = NoopCache): (StructType, StructType) = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the function description with a new @parm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks~

@gatorsmile
Copy link
Member

Please remove [FOLLOW-UP] from the PR title. Thanks!

@windpiger windpiger changed the title [SPARK-18726][SQL][FOLLOW-UP]resolveRelation for FileFormat DataSource don't need to listFiles twice [SPARK-18726][SQL]resolveRelation for FileFormat DataSource don't need to listFiles twice Mar 3, 2017
@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73791 has finished for PR 17081 at commit 92618b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* be any further inference in any triggers.
*
* @param format the file format object for this DataSource
* @param fileStatusCache fileStatusCache for InMemoryFileIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@param fileStatusCache the shared cache for file statuses to speed up listing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks!

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73793 has finished for PR 17081 at commit f6ec4fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73798 has finished for PR 17081 at commit 3e495a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 982f322 Mar 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants