-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary #25899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Merge master
SPARK-29089 Parallelize DataSource#checkAndGlobPathIfNecessary
|
@cozos Do you have some benchmark number? |
|
ok to test |
|
@wangyum Not yet, I will look into running a comparison on the S3 Landsat dataset. Not sure what I'm gonna do about HDFS or any other filesystems. |
|
Here are some measurements on S3 Landsat files, from my laptop in San Francisco to us-west-2 S3 bucket. recursive list of
30 glob paths (subdirectories of s3a://landsat-pds/L8/001/003), will result in a total of 1206 paths The 30 glob paths:
|
|
|
||
| object TestPaths { | ||
| val hadoopConf = new Configuration() | ||
| hadoopConf.set("fs.mockFs.impl", classOf[MockFileSystem].getName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not needed here, but it is worth knowing that there is a package scoped static method Filesystem.addFileSystemForTesting to actually let you add a specific instance to the file system cache. I use sometimes as it lets me register mob file systems for existing schemes (s3, etc).
| val hadoopConf = new Configuration() | ||
| hadoopConf.set("fs.mockFs.impl", classOf[MockFileSystem].getName) | ||
|
|
||
| val path1: Path = new Path("mockFs:///somepath1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is safest to actually have an authority here, e.g mockFs://mock/somepath1
FileSystem uses the scheme and authority to cache things, and some of the FS operations work on the path part of the URI, ignoring the authority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (checkEmptyGlobPath && globPath.isEmpty) { | ||
| throw new AnalysisException(s"Path does not exist: $qualified") | ||
| val qualifiedPaths = pathStrings | ||
| .map{pathString => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: .map { pathString =>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| throw new AnalysisException(s"Path does not exist: ${globPath.head}") | ||
| // Split the paths into glob and non glob paths, because we don't need to do an existence check | ||
| // for globbed paths. | ||
| val globPaths = qualifiedPaths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can get both at once with val (globPaths, nonGlobPaths) = qualifiedPaths.partition(SparkHadoopUtil.get.isGlobPath)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ) | ||
|
|
||
| assert( | ||
| resultPaths.toSet == Set( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit: use === in asserts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| val nonGlobPaths = qualifiedPaths | ||
| .filter(path => !SparkHadoopUtil.get.isGlobPath(path)) | ||
|
|
||
| val globbedPaths = globPaths.par.flatMap { globPath => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
globPaths.par.flatMap -> ThreadUtils.parmap(globPaths, "globPath", 8)?
spark/core/src/test/scala/org/apache/spark/util/ThreadUtilsSuite.scala
Lines 131 to 162 in d4420b4
| test("parmap should be interruptible") { | |
| val t = new Thread() { | |
| setDaemon(true) | |
| override def run() { | |
| try { | |
| // "par" is uninterruptible. The following will keep running even if the thread is | |
| // interrupted. We should prefer to use "ThreadUtils.parmap". | |
| // | |
| // (1 to 10).par.flatMap { i => | |
| // Thread.sleep(100000) | |
| // 1 to i | |
| // } | |
| // | |
| ThreadUtils.parmap(1 to 10, "test", 2) { i => | |
| Thread.sleep(100000) | |
| 1 to i | |
| }.flatten | |
| } catch { | |
| case _: InterruptedException => // excepted | |
| } | |
| } | |
| } | |
| t.start() | |
| eventually(timeout(10.seconds)) { | |
| assert(t.isAlive) | |
| } | |
| t.interrupt() | |
| eventually(timeout(10.seconds)) { | |
| assert(!t.isAlive) | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good point, we want to try to use ThreadUtils to avoid reusing the global thread pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangyum Any thoughts on the number of threads? #25899 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 looks fine to me.
|
Test build #111213 has finished for PR 25899 at commit
|
|
@cozos . Thank you for making a PR. In Apache Spark community, we use |
|
@dongjoon-hyun Ok done, how does that sound? Should I update the JIRA aswell? |
|
Nice experiment! I guess in-EC2, you're limited by the number of course but also latency is nice and low. Remotely, latency is worse so if there is anything we can do in parallel threads -there are some tangible benefits. in both local and remote S3 interaction rename() is faked with a COPY, which is 6-10MB/s; that can be done via the thread pool too if you can configure the AWS SDK to split up a large copy into parallel parts. That shares the same pools, so its useful to have some capacity there on any process renaming things. |
|
@steveloughran How should we proceed? Does 40 threads sound OK? What else does the PR need? |
|
I'd say 40 sounds good; people can tune it |
| globResult | ||
| }.flatten | ||
| } catch { | ||
| case e: SparkException => throw e.getCause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there ever the case that the cause is null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparkException comes from ThreadUtils#parmap and ThreadUtils#awaitResult
Which always seems to wrap another exception i.e. never null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this silently drops the caller stack trace, I opened SPARK-47833(#46028) to fix it.
|
This code LGTM: skips needless probes on the globbed paths; parallel checks on the others. |
|
Test build #113603 has finished for PR 25899 at commit
|
|
Any next steps for me? Or just need 👀 from commiters? |
|
Quick follow up to the "how many connections" discussion. It turns out that a sufficiently large Number of S3 DNS lookups can trigger 503 throttling on DNS and there isn't anything in the AWS library to react to this. The bigger at the connection pool, the more connections you are likely to see on worker start-up, probably workers * S3A URLS * "fs.s3a.connection.maximum" . Don't go overboard. And if you do see the problem, file a HADOOP with stack traces, configs and anything else which could help implementing resilience here. that doesn't make any difference to this PR, just something to know |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fairly reasonable to me, modulo a few comments. Normally I don't like the extra complexity but this seems like a common bottleneck.
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
Show resolved
Hide resolved
| } | ||
|
|
||
| allGlobPath | ||
| allPaths.seq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can you call .toSeq? I'm working on Scala 2.13 support and I think this is deprecated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| val hadoopConf = new Configuration() | ||
| hadoopConf.set("fs.mockFs.impl", classOf[MockFileSystem].getName) | ||
|
|
||
| val path1: Path = new Path("mockFs://mockFs/somepath1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit: no need for ": Path" everywhere here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Test build #115136 has finished for PR 25899 at commit
|
This reverts commit 8cbc28a.
|
Test build #115166 has finished for PR 25899 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good to me. I don't see a behavior change; it just adds parallelism. While that has implications of its own, I think it'll be a win for a real pain point.
|
Hi all, How can I get this PR accepted? Anything I can do to help with the process? |
|
@steveloughran @wangyum you OK with it too? It seems OK for 3.1. Let me retrigger tests to see if there is still a failure though. |
|
Jenkins retest this please. |
|
LGTM, There's one thing to be aware of, which is that Hence: apache/hadoop#1838 ; removing some network IO in Does that mean there's anything wrong with this PR? No, only that performance is best if the relevant FS instances have already been preloaded into the FS cache. And those people implementing filesystem connectors should do a better job at low-latency instantiation, even if it means async network startup threads and moving the blocking to the first FS API call instead. |
|
Test build #118372 has finished for PR 25899 at commit
|
|
Merged to master |
|
Thank you everybody! |
…e#checkAndGlobPathIfNecessary ### What changes were proposed in this pull request? See JIRA: https://issues.apache.org/jira/browse/SPARK-29089 Mailing List: http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html When using DataFrameReader#csv to read many files on S3, globbing and fs.exists on DataSource#checkAndGlobPathIfNecessary becomes a bottleneck. From the mailing list discussions, an improvement that can be made is to parallelize the blocking FS calls: > - have SparkHadoopUtils differentiate between files returned by globStatus(), and which therefore exist, and those which it didn't glob for -it will only need to check those. > - add parallel execution to the glob and existence checks ### Why are the changes needed? Verifying/globbing files happens on the driver, and if this operations take a long time (for example against S3), then the entire cluster has to wait, potentially sitting idle. This change hopes to make this process faster. ### Does this PR introduce any user-facing change? No ### How was this patch tested? I added a test suite `DataSourceSuite` - open to suggestions for better naming. See [here](apache#25899 (comment)) and [here](apache#25899 (comment)) for some measurements Closes apache#25899 from cozos/master. Lead-authored-by: Arwin Tio <[email protected]> Co-authored-by: Arwin Tio <[email protected]> Co-authored-by: Arwin Tio <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…e#checkAndGlobPathIfNecessary ### What changes were proposed in this pull request? See JIRA: https://issues.apache.org/jira/browse/SPARK-29089 Mailing List: http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html When using DataFrameReader#csv to read many files on S3, globbing and fs.exists on DataSource#checkAndGlobPathIfNecessary becomes a bottleneck. From the mailing list discussions, an improvement that can be made is to parallelize the blocking FS calls: > - have SparkHadoopUtils differentiate between files returned by globStatus(), and which therefore exist, and those which it didn't glob for -it will only need to check those. > - add parallel execution to the glob and existence checks ### Why are the changes needed? Verifying/globbing files happens on the driver, and if this operations take a long time (for example against S3), then the entire cluster has to wait, potentially sitting idle. This change hopes to make this process faster. ### Does this PR introduce any user-facing change? No ### How was this patch tested? I added a test suite `DataSourceSuite` - open to suggestions for better naming. See [here](apache#25899 (comment)) and [here](apache#25899 (comment)) for some measurements Closes apache#25899 from cozos/master. Lead-authored-by: Arwin Tio <[email protected]> Co-authored-by: Arwin Tio <[email protected]> Co-authored-by: Arwin Tio <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…e#checkAndGlobPathIfNecessary ### What changes were proposed in this pull request? See JIRA: https://issues.apache.org/jira/browse/SPARK-29089 Mailing List: http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html When using DataFrameReader#csv to read many files on S3, globbing and fs.exists on DataSource#checkAndGlobPathIfNecessary becomes a bottleneck. From the mailing list discussions, an improvement that can be made is to parallelize the blocking FS calls: > - have SparkHadoopUtils differentiate between files returned by globStatus(), and which therefore exist, and those which it didn't glob for -it will only need to check those. > - add parallel execution to the glob and existence checks ### Why are the changes needed? Verifying/globbing files happens on the driver, and if this operations take a long time (for example against S3), then the entire cluster has to wait, potentially sitting idle. This change hopes to make this process faster. ### Does this PR introduce any user-facing change? No ### How was this patch tested? I added a test suite `DataSourceSuite` - open to suggestions for better naming. See [here](apache#25899 (comment)) and [here](apache#25899 (comment)) for some measurements Closes apache#25899 from cozos/master. Lead-authored-by: Arwin Tio <[email protected]> Co-authored-by: Arwin Tio <[email protected]> Co-authored-by: Arwin Tio <[email protected]> Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
See JIRA: https://issues.apache.org/jira/browse/SPARK-29089
Mailing List: http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrameReader-bottleneck-in-DataSource-checkAndGlobPathIfNecessary-when-reading-S3-files-td27828.html
When using DataFrameReader#csv to read many files on S3, globbing and fs.exists on DataSource#checkAndGlobPathIfNecessary becomes a bottleneck.
From the mailing list discussions, an improvement that can be made is to parallelize the blocking FS calls:
Why are the changes needed?
Verifying/globbing files happens on the driver, and if this operations take a long time (for example against S3), then the entire cluster has to wait, potentially sitting idle. This change hopes to make this process faster.
Does this PR introduce any user-facing change?
No
How was this patch tested?
I added a test suite
DataSourceSuite- open to suggestions for better naming.See here and here for some measurements