Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions dev/deps/spark-deps-hadoop-2.7-hive-2.3
Original file line number Diff line number Diff line change
Expand Up @@ -201,12 +201,12 @@ orc-shims/1.6.10//orc-shims-1.6.10.jar
oro/2.0.8//oro-2.0.8.jar
osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
paranamer/2.8//paranamer-2.8.jar
parquet-column/1.12.0//parquet-column-1.12.0.jar
parquet-common/1.12.0//parquet-common-1.12.0.jar
parquet-encoding/1.12.0//parquet-encoding-1.12.0.jar
parquet-format-structures/1.12.0//parquet-format-structures-1.12.0.jar
parquet-hadoop/1.12.0//parquet-hadoop-1.12.0.jar
parquet-jackson/1.12.0//parquet-jackson-1.12.0.jar
parquet-column/1.12.1//parquet-column-1.12.1.jar
parquet-common/1.12.1//parquet-common-1.12.1.jar
parquet-encoding/1.12.1//parquet-encoding-1.12.1.jar
parquet-format-structures/1.12.1//parquet-format-structures-1.12.1.jar
parquet-hadoop/1.12.1//parquet-hadoop-1.12.1.jar
parquet-jackson/1.12.1//parquet-jackson-1.12.1.jar
protobuf-java/2.5.0//protobuf-java-2.5.0.jar
py4j/0.10.9.2//py4j-0.10.9.2.jar
pyrolite/4.30//pyrolite-4.30.jar
Expand Down
12 changes: 6 additions & 6 deletions dev/deps/spark-deps-hadoop-3.2-hive-2.3
Original file line number Diff line number Diff line change
Expand Up @@ -172,12 +172,12 @@ orc-shims/1.6.10//orc-shims-1.6.10.jar
oro/2.0.8//oro-2.0.8.jar
osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
paranamer/2.8//paranamer-2.8.jar
parquet-column/1.12.0//parquet-column-1.12.0.jar
parquet-common/1.12.0//parquet-common-1.12.0.jar
parquet-encoding/1.12.0//parquet-encoding-1.12.0.jar
parquet-format-structures/1.12.0//parquet-format-structures-1.12.0.jar
parquet-hadoop/1.12.0//parquet-hadoop-1.12.0.jar
parquet-jackson/1.12.0//parquet-jackson-1.12.0.jar
parquet-column/1.12.1//parquet-column-1.12.1.jar
parquet-common/1.12.1//parquet-common-1.12.1.jar
parquet-encoding/1.12.1//parquet-encoding-1.12.1.jar
parquet-format-structures/1.12.1//parquet-format-structures-1.12.1.jar
parquet-hadoop/1.12.1//parquet-hadoop-1.12.1.jar
parquet-jackson/1.12.1//parquet-jackson-1.12.1.jar
protobuf-java/2.5.0//protobuf-java-2.5.0.jar
py4j/0.10.9.2//py4j-0.10.9.2.jar
pyrolite/4.30//pyrolite-4.30.jar
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@
<kafka.version>2.8.0</kafka.version>
<!-- After 10.15.1.3, the minimum required version is JDK9 -->
<derby.version>10.14.2.0</derby.version>
<parquet.version>1.12.0</parquet.version>
<parquet.version>1.12.1</parquet.version>
<orc.version>1.6.10</orc.version>
<jetty.version>9.4.43.v20210629</jetty.version>
<jakartaservlet.version>4.0.3</jakartaservlet.version>
Expand Down
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -855,6 +855,12 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSparkSession
}
}

test("SPARK-36726: test incorrect Parquet row group file offset") {
readParquetFile(testFile("test-data/malformed-file-offset.parquet")) { df =>
assert(df.count() == 3650)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we test for both vec/non vec code path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think readParquetFile already takes care of that, see FileBasedDataSourceTest.readFile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought vectorized is enabled by default, and non-vectorized path will not be tested. protected def readFile(path: String, testVectorized: Boolean = true), so we need

  Seq(true, false).foreach { vec =>
      readParquetFile(testFile("file.parquet"), vec) { df =>
         ...
     }
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how FileBasedDataSourceTest.readFile is defined:

  protected def readFile(path: String, testVectorized: Boolean = true)
      (f: DataFrame => Unit): Unit = {
    withSQLConf(vectorizedReaderEnabledKey -> "false") {
      f(spark.read.format(dataSourceName).load(path.toString))
    }
    if (testVectorized) {
      withSQLConf(vectorizedReaderEnabledKey -> "true") {
        f(spark.read.format(dataSourceName).load(path.toString))
      }
    }
  }

so to me it tests both paths. Did I miss anything?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised that the non-vectorized reader is always tested with/without testVectorized enabled...

  protected def readFile(path: String, testVectorized: Boolean = true)
      (f: DataFrame => Unit): Unit = {
    withSQLConf(vectorizedReaderEnabledKey -> "false") {
      f(spark.read.format(dataSourceName).load(path.toString))
    }
    if (testVectorized) {
      withSQLConf(vectorizedReaderEnabledKey -> "true") {
        f(spark.read.format(dataSourceName).load(path.toString))
      }
    }
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I guess it makes sense since in most cases we'd want to test both code paths and this is convenient for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I'm surprised that non-vectorized path is always tested. In many tests, we specifically test both paths, and they are not necessary given the non-vectorized will always be tested.

}
}

test("VectorizedParquetRecordReader - direct path read") {
val data = (0 to 10).map(i => (i, (i + 'a').toChar.toString))
withTempPath { dir =>
Expand Down