[SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader #50765

pan3793 · 2025-04-30T09:09:47Z

What changes were proposed in this pull request?

On a busy Hadoop cluster, the GetFileInfo and GetBlockLocations contribute the most RPCs to the HDFS NameNode. After investigating the Spark Parquet vectorized reader, I think 3/4 RPCs can be reduced.

Currently, the Parquet vectorized reader produces 4 NameNode RPCs on reading each file (or split):

Read the footer - one GetFileInfo and one GetBlockLocations
Read the data (row groups) - one GetFileInfo and one GetBlockLocations

The key idea of this PR is:

Driver already knows the FileStatus for each Parquet file during the planning phase, we can transfer the FileStatus from the driver to the executor via PartitionFile, so that the task doesn't need to ask the NameNode again, this saves two GetFileInfo RPCs.
Reuse the SeekableInputStream on reading footer and row groups, this saves one GetBlockLocations RPC.

The PR requires some changes on the Parquet side first. (Changes are already included in Parquet 1.16.0)

Why are the changes needed?

Reduce unnecessary RPCs of NameNode to improve performance and stability for large Hadoop clusters.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass UT

A few UTs are tuned to adapt to the change.

Manual test with TPC-H query.

Manually tested on a small Hadoop cluster, the test uses TPC-H Q4, based on sf3000 Parquet tables.

HDFS NameNode metrics (master VS. this PR)

Production Verification

The patch has also been deployed to a production cluster for 4 months, where 95% of workloads are Spark jobs.

Before

After

Was this patch authored or co-authored using generative AI tooling?

No.

pan3793 · 2025-04-30T09:49:57Z

...java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java

This constructor internally calls HadoopInputFile.fromPath(file, configuration), which produces an unnecessary GetFileInfo RPC

public static HadoopInputFile fromPath(Path path, Configuration conf) throws IOException { FileSystem fs = path.getFileSystem(conf); return new HadoopInputFile(fs, fs.getFileStatus(path), conf); }

pan3793 · 2025-05-06T02:13:42Z

cc @sunchao @wangyum @wgtmac I mark this PR draft because it requires changes on Parquet side first, would be great if you can take a look at this idea first, thank you in advance.

wangyum · 2025-05-06T03:13:43Z

also cc @turboFei

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

pan3793 · 2025-09-05T02:13:48Z

ping @wangyum @sunchao @LuciferYang @yaooqinn

This PR is ready for review. Looking forward to your feedback!

pan3793 · 2025-09-08T05:50:46Z

cc @viirya @peter-toth @cloud-fan

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

LuciferYang · 2025-09-09T03:29:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

    filePath: SparkPath,
    start: Long,
    length: Long,
+    fileStatus: FileStatus,


Similarly, due to the addition of fileStatus, the constructor of PartitionedFile can also be further simplified, right?

In addition, since fileStatus will hold more state and also participate in serialization, will this lead to additional memory overhead and serialization pressure?

The fileStatus should occupy a little bit more memory, but I haven't received OOM issues during the rollout of this change to the online cluster.

@cloud-fan Are there also risks of breaking internal APIs with modifications similar to those made here and in FileFormat.createMetadataInternalRow?

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

LuciferYang · 2025-09-09T03:38:11Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

        try {
-          vectorizedReader.initialize(split, hadoopAttemptContext, Option.apply(fileFooter))
+          vectorizedReader.initialize(
+            split, hadoopAttemptContext, Some(inputFile), Some(inputStream), Some(fileFooter))


nit: Although fileFooter should not be null, it's still advisable to use Option(fileFooter) just to be safe

@LuciferYang After second thought, I went back to use Some.

Generally, if something goes wrong, let's fail it immediately rather than letting the illegal state propagate. Here, we should avoid NULL propagation as much as possible.

...la/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala

pan3793 · 2025-09-09T19:24:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

-    fileSize: Long = 0L,
    otherConstantMetadataColumnValues: Map[String, Any] = Map.empty) {

+  @transient lazy val filePath: SparkPath = SparkPath.fromFileStatus(fileStatus)


If SparkPath.fromFileStatus is cheap enough, we don't need to materialize filePath, so we can save some memory

sunchao · 2025-09-10T21:19:53Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

+        .withMetadataFilter(metadataFilter).build
+
+      val inputFile = HadoopInputFile.fromStatus(file.fileStatus, sharedConf)
+      val inputStream = inputFile.newStream()


Do we need to ensure this is properly closed if something goes wrong in the following code?

risks are low but still possible before transferring the ownership of the inputStream to the vectorizedReader, I wrap the code block with a try finally to ensure inputStream won't leak.

This causes indent change, please view the diff with Hide whitespace

sunchao · 2025-09-10T21:23:03Z

...la/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala

+      (None, None, footer)
    } else {
-      ParquetFooterReader.readFooter(conf, file, ParquetFooterReader.SKIP_ROW_GROUPS)
+      // When there are vectorized reads, we can avoid


Can we extract this into a new util method in ParquetFooterReader which perhaps returns the footer and also the input stream? we can then avoid duplicating this code in two places.

Addressed, please check the updated ParquetFooterReader.

BTW, I don't see special reason to write this file in Java, as I'm going to use Scala data structures (Tuple, Option) in this class, I converted it to Scala.

sunchao · 2025-09-10T21:23:45Z

...re/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java

          .getMetadataFilter();
    }
-    return readFooter(configuration, file.toPath(), filter);
+    return readFooter(HadoopInputFile.fromStatus(file.fileStatus(), configuration), filter);


We can remove the SKIP_ROW_GROUPS and WITH_ROW_GROUPS now I think, as they are no longer used.

SKIP_ROW_GROUPS is unused now, but WITH_ROW_GROUPS is used by ParquetPartitionReaderFactory.openFileAndReadFooter agg push down case, I removed them and replaced the WITH_ROW_GROUPS with the literal false instead.

viirya · 2025-09-11T08:00:05Z

...la/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala

        private var hasNext = true
        private lazy val row: InternalRow = {
-          val footer = getFooter(file)
+          val (_, _, footer) = openFileAndReadFooter(file)


If I read it correctly, previously for non-vectorized case, getFooter skips row groups (SKIP_ROW_GROUPS). But now openFileAndReadFooter always reading footer with row groups?

For non-vectorized case, detachFileInputStream is false, it still is SKIP_ROW_GROUPS

viirya · 2025-09-11T08:01:38Z

...la/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala

+      (None, None, footer)
    } else {
-      ParquetFooterReader.readFooter(conf, file, ParquetFooterReader.SKIP_ROW_GROUPS)
+      // When there are vectorized reads, we can avoid


There was When there are vectorized reads... because getFooter has checked enableVectorizedReader. I think this else block handles both non-vectorized and vectorized cases?

Yes, logic is controlled by enableVectorizedReader, comment only describes the vectorized reading optimization. If you think the current comment is confusing, I can update it to mention behavior for non-vectorized reading path too.

pan3793 · 2025-09-15T02:27:50Z

kindly ping @sunchao and @viirya, could you take another look when you have time?

cloud-fan · 2025-09-15T03:45:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala

-    filePath: SparkPath,
    start: Long,
    length: Long,
+    fileStatus: FileStatus,


what's the cost of serializing file status?

@cloud-fan I think path contributes the majority of the size.

public class FileStatus implements Writable, Comparable<Object>, Serializable, ObjectInputValidation { ... private Path path; // backed by URI private long length; private Boolean isdir; private short block_replication; private long blocksize; private long modification_time; private long access_time; private FsPermission permission; private String owner; private String group; private Path symlink; // likely be NULL private Set<AttrFlags> attr; // AttrFlags is enum ... } public class FsPermission implements Writable, Serializable, ObjectInputValidation { ... private FsAction useraction = null; // FsAction is enum private FsAction groupaction = null; private FsAction otheraction = null; private Boolean stickyBit = false; ... }

https://github.com/apache/hadoop/blob/branch-3.4.2/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileStatus.java

https://github.com/apache/hadoop/blob/branch-3.4.2/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/permission/FsPermission.java

Is it possible to have a custom serde for it and only send the path string? This reminds me of SerializableConfiguration as these Hadoop classes are usually not optimized for serialization and transport.

seems not feasible, because FileStatus has many sub-classes, i.e. S3AFileStatus, ViewFsFileStatus

@cloud-fan the change basically moves the RPC cost from executor => storage service, to driver => executors, in my env (HDFS with RBF), the latter is much cheaper than the former. I don't have cloud env, so I can't give numbers for object storage services like S3

hmm, then this may cause regression for short queries?

Hmm not sure how much difference this will make it terms of driver memory usage. Is it easy to make the FileStatus optional in PartitionedFile and make it controllable via a flag?

It seems in Parquet Java the file status is only used in one case: https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopInputFile.java#L109-L132

Mostly we just need file path and length. But yea this one use case seems critical to avoid duplicated NN call to get the file status again.

@sunchao thanks for your suggestion, after an offline discussion with @cloud-fan, I understand his concerns about the overhead of FileStatus, let me summarize the conclusion and my thoughts:

there may be different Hadoop FileSystem implementations, the getFileStatus might be cheap or have executor-side cache in some implementations, but for our case - HDFS with RBF, it's relatively heavy.

there is an upcoming optimization to replace FileStatusCache with PathCache(only carry necessary metadata) on the driver side to reduce driver memory.

@cloud-fan suggests constructing FileStatus from the executor side directly

so, I'm going to split this PR into two parts

I will experiment (3), but I can only do it on HDFS cases (w/ and w/o RBF, w/ and w/o EC)

span the rest of the executor-side changes into a dedicated PR.

### What changes were proposed in this pull request? Reuse InputStream in vectorized Parquet reader between reading the footer and row groups, on the executor side. This PR is part of SPARK-52011, you can check more details at #50765 ### Why are the changes needed? Reduce unnecessary RPCs of NameNode to improve performance and stability for large Hadoop clusters. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? See #50765 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52384 from pan3793/SPARK-53633. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added the SQL label Apr 30, 2025

pan3793 mentioned this pull request Apr 30, 2025

GH-3141: Add constructor to ParquetFileReader to allow passing in parquet footer apache/parquet-java#3165

Closed

pan3793 commented Apr 30, 2025

View reviewed changes

pan3793 marked this pull request as draft May 6, 2025 02:13

pan3793 changed the title ~~[WIP] Reduce HDFS NameNode RPC on vectorized Parquet reader~~ [WIP][SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader May 6, 2025

pan3793 commented May 6, 2025

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala Outdated Show resolved Hide resolved

pan3793 force-pushed the nn-rpc branch from 7caef51 to 9b1df55 Compare July 3, 2025 15:42

pan3793 mentioned this pull request Aug 6, 2025

GH-3141: Add constructor to ParquetFileReader to allow passing in parquet footer and expose setRequestedSchema that accepts List<ColumnDescriptor> apache/parquet-java#3262

Merged

pan3793 force-pushed the nn-rpc branch from 9b1df55 to 23ae60f Compare August 28, 2025 16:16

github-actions bot added the BUILD label Aug 28, 2025

pan3793 mentioned this pull request Aug 28, 2025

[SPARK-53420][BUILD] Upgrade Parquet to 1.16.0 #52165

Closed

pan3793 force-pushed the nn-rpc branch 2 times, most recently from f5e3e11 to 060b058 Compare September 1, 2025 03:49

Reduce HDFS NameNode RPC on vectorized Parquet reader

7d9ead8

pan3793 force-pushed the nn-rpc branch from 060b058 to 7d9ead8 Compare September 4, 2025 13:38

pan3793 changed the title ~~[WIP][SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader~~ [SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader Sep 4, 2025

pan3793 marked this pull request as ready for review September 4, 2025 13:38

github-actions bot removed the BUILD label Sep 4, 2025

LuciferYang reviewed Sep 9, 2025

View reviewed changes

address comments

7f5889a

pan3793 commented Sep 9, 2025

View reviewed changes

pan3793 mentioned this pull request Sep 9, 2025

[SPARK-23817][SQL] Create file source V2 framework and migrate ORC read path #23383

Closed

sunchao reviewed Sep 10, 2025

View reviewed changes

pan3793 added 3 commits September 11, 2025 15:09

extract common code to ParquetFooterReader

a88f032

nit

45a6e7c

nit

f2f0da9

viirya reviewed Sep 11, 2025

View reviewed changes

pan3793 added 5 commits September 11, 2025 17:03

close inputstream

a5646c9

nit

e712ca3

nit

035742d

nit

9d2b2dd

style

f043c69

pan3793 requested review from sunchao and viirya September 11, 2025 16:38

close

9f3bd92

cloud-fan reviewed Sep 15, 2025

View reviewed changes

pan3793 mentioned this pull request Sep 18, 2025

[SPARK-53633][SQL] Reuse InputStream in vectorized Parquet reader #52384

Closed

pan3793 marked this pull request as draft September 23, 2025 05:35

[SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader #50765

Are you sure you want to change the base?

[SPARK-52011][SQL] Reduce HDFS NameNode RPC on vectorized Parquet reader #50765

Uh oh!

Conversation

pan3793 commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Pass UT

Manual test with TPC-H query.

Production Verification

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented May 6, 2025

Uh oh!

wangyum commented May 6, 2025

Uh oh!

Uh oh!

pan3793 commented Sep 5, 2025

Uh oh!

pan3793 commented Sep 8, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Sep 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Apr 30, 2025 •

edited

Loading

pan3793 Sep 9, 2025 •

edited

Loading

pan3793 Sep 11, 2025 •

edited

Loading

pan3793 Sep 15, 2025 •

edited

Loading

pan3793 Sep 18, 2025 •

edited

Loading