Skip to content

Conversation

@ericl
Copy link
Contributor

@ericl ericl commented Oct 17, 2016

What changes were proposed in this pull request?

There was a bug introduced in #14690 which broke refreshByPath with converted hive tables (though, it turns out it was very difficult to refresh converted hive tables anyways, since you had to specify the exact path of one of the partitions).

This changes refreshByPath to invalidate by prefix instead of exact match, and fixes the issue.

cc @sameeragarwal for refreshByPath changes
@mallman

How was this patch tested?

Extended unit test.

@SparkQA
Copy link

SparkQA commented Oct 18, 2016

Test build #67090 has finished for PR 15521 at commit 7415666.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Invalidate and refresh all the cached data (and the associated metadata) for any dataframe that
* contains the given data source path.
* contains the given data source path. Path matching is by prefix, i.e. "/" would invalidate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are changing the semantic of REFRESH PATH right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah

@cloud-fan
Copy link
Contributor

Before #14690, users can refresh the whole table(including all partitions) by REFRESH the table path right? even the partition path is not under table path.

@ericl
Copy link
Contributor Author

ericl commented Oct 18, 2016

That's correct, refresh table has always worked. There was just a bug introduced that broke refreshByPath, since it doesn't invalidate the lazy val.

@sameeragarwal
Copy link
Member

The new prefix matching semantics makes sense to me

private val baseLocation = catalogTable.storage.locationUri

// Populated on-demand by calls to cachedAllPartitions
private var allPartitions: ListingFileCatalog = null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: according to the existing name style, we should name this var cachedAllPartitions, and name the public method allPartitions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

sql("select * from test").count()
}
assert(e2.getMessage.contains("FileNotFoundException"))
spark.catalog.refreshByPath(dir.getAbsolutePath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: before #14690, users need to refresh one of the partition paths to invalide the cache, but now they have to refresh the table path, because TableFileCatalog.rootPaths only contains table path while ListingFileCatalog.rootPaths only contains partition paths.

I think it's better than before, but it's still a breaking change, should we docuement it in the 2.1 release notes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. To get the old behavior, they can also disable the feature flag.

.map(_.makeQualified(fs.getUri, fs.getWorkingDirectory))
.contains(qualifiedPath)
.map(_.makeQualified(fs.getUri, fs.getWorkingDirectory).toString)
.exists(_.startsWith(prefixToInvalidate))
Copy link
Contributor

@cloud-fan cloud-fan Oct 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the prefix resolution? I think it's useful, so that users can refresh the table path to invalidate cache for partitioned data source table, but it's not related to this PR right?

Copy link
Contributor Author

@ericl ericl Oct 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You actually need this when metastore partition pruning is disabled for converted hive tables. Otherwise, the unit test below would fail on that case.

(but yeah, we could also leave that alone)

@SparkQA
Copy link

SparkQA commented Oct 18, 2016

Test build #67137 has finished for PR 15521 at commit 5088ae0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 5f20ae0 Oct 19, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
## What changes were proposed in this pull request?

There was a bug introduced in apache#14690 which broke refreshByPath with converted hive tables (though, it turns out it was very difficult to refresh converted hive tables anyways, since you had to specify the exact path of one of the partitions).

This changes refreshByPath to invalidate by prefix instead of exact match, and fixes the issue.

cc sameeragarwal for refreshByPath changes
mallman

## How was this patch tested?

Extended unit test.

Author: Eric Liang <[email protected]>

Closes apache#15521 from ericl/fix-caching.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

There was a bug introduced in apache#14690 which broke refreshByPath with converted hive tables (though, it turns out it was very difficult to refresh converted hive tables anyways, since you had to specify the exact path of one of the partitions).

This changes refreshByPath to invalidate by prefix instead of exact match, and fixes the issue.

cc sameeragarwal for refreshByPath changes
mallman

## How was this patch tested?

Extended unit test.

Author: Eric Liang <[email protected]>

Closes apache#15521 from ericl/fix-caching.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants