[SPARK-25237][SQL]remove updateBytesReadWithFileSize because we use Hadoop FileSystem s… #22232

dujunling · 2018-08-25T06:56:17Z

…tatistics to update the inputMetrics

What changes were proposed in this pull request?

In FileScanRdd, we will update inputMetrics's bytesRead using updateBytesRead every 1000 rows or when close the iterator.

but when close the iterator, we will invoke updateBytesReadWithFileSize to increase the inputMetrics's bytesRead with file's length.

this will result in the inputMetrics's bytesRead is wrong when run the query with limit such as select * from table limit 1.

because we do not support for Hadoop 2.5 and earlier now, we always get the bytesRead from Hadoop FileSystem statistics other than files's length.

How was this patch tested?

manual test

…tatistics to update the inputMetrics

dujunling · 2018-08-25T06:57:00Z

@wzhfy

wzhfy · 2018-08-25T06:58:21Z

ok to test

wzhfy · 2018-08-25T07:07:59Z

this seems to be caused by removing support for Hadoop 2.5 and earlier? cc original authors @cloud-fan @srowen to make sure

wzhfy · 2018-08-25T08:10:09Z

test this please

maropu · 2018-08-25T08:41:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala


      override def close(): Unit = {
        updateBytesRead()
-        updateBytesReadWithFileSize()


If we just remove this updateBytesReadWithFileSize, the issue in the description can be solved? We need to remove updateBytesReadWithFileSize in the line 142, too?

Yes, before SPARK-19464, there will only one works between updateBytesRead and updateBytesReadWithFileSize. If the hadoop version is 2.5 or earlier, updateBytesReadWithFileSize works, If the hadoop version is 2.6 or later, updateBytesRead works.

When there are one more files in the partition, the inputMetrics is wrong when updateBytesReadWithFileSize in the line 142 is exist.

aha, I see.

maropu · 2018-08-25T08:44:53Z

Is it difficult to add tests for checking the metric in the case select * from t limit 1?

maropu · 2018-08-25T08:45:49Z

btw, can you clean up the title and the description..?

dujunling · 2018-08-25T09:23:12Z

While metris suites are in core test , fileScanRdd should be in sql test, it is difficult to add tests to check the input metrics in sql module

maropu · 2018-08-25T11:18:44Z

I'm not sure we can test the case though, for example, how about the sequence below?


import org.apache.spark.TaskContext
spark.range(10).selectExpr("id AS c0", "rand() AS c1").write.parquet("/tmp/t1")
val df = spark.read.parquet("/tmp/t1")

val fileScanRdd = df.repartition(1).queryExecution.executedPlan.children(0).children(0).execute()

fileScanRdd.mapPartitions { part =>
  println(s"Initial bytesRead=${TaskContext.get.taskMetrics().inputMetrics.bytesRead}")

  TaskContext.get.addTaskCompletionListener[Unit] { taskCtx =>
    // Check if the metric is correct?
    println(s"Total bytesRead=${TaskContext.get.taskMetrics().inputMetrics.bytesRead}")
  }
  part
}.collect

srowen · 2018-08-25T13:48:28Z

It's OK to assume Hadoop 2.6+ only. In fact 2.6 is quite old anyway.

dujunling · 2018-08-27T03:36:37Z

@maropu I have added a ut to check the inputMetrics

srowen · 2018-08-28T21:49:17Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceSuite.scala

+      spark.sparkContext.listenerBus.waitUntilEmpty(500)
+      spark.sparkContext.removeSparkListener(bytesReadListener)
+
+      assert(bytesReads.sum < 3000)


The data above could be made deterministic so that you can assert the bytes read more exactly. I wonder if it's important to make sure the bytes read are exact, rather than just close, given that the change above would change the metric only a little I think.

You can just track the sum rather than all values written, but it doesn't matter.

maropu · 2018-08-29T04:48:52Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceSuite.scala

+  test("[SPARK-25237] remove updateBytesReadWithFileSize in FileScanRdd") {
+    withTempPath { p =>
+      val path = p.getAbsolutePath
+      spark.range(1000).selectExpr("id AS c0", "rand() AS c1").repartition(10).write.csv(path)


I think a single partition is ok for this test.

wzhfy · 2018-08-31T01:07:27Z

ok to test

SparkQA · 2018-08-31T04:37:42Z

Test build #95508 has finished for PR 22232 at commit 1c32646.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-09-03T13:15:10Z

@dujunling I personally think this can be merged, but only if the test is tightened up

maropu · 2018-09-03T13:24:20Z

@srowen I could take this over or do follow-up if the author is still inactive.

srowen · 2018-09-03T13:27:04Z

I think that's fine @maropu ; we can always apportion credit appropriately later.

## What changes were proposed in this pull request? This pr removed the method `updateBytesReadWithFileSize` in `FileScanRDD` because it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers. This is rework from #22232. Closes #22232 ## How was this patch tested? Added tests in `FileBasedDataSourceSuite`. Closes #22324 from maropu/pr22232-2. Lead-authored-by: dujunling <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit ed249db) Signed-off-by: Sean Owen <[email protected]>

This pr removed the method `updateBytesReadWithFileSize` in `FileScanRDD` because it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers. This is rework from apache#22232. Closes apache#22232 Added tests in `FileBasedDataSourceSuite`. Closes apache#22324 from maropu/pr22232-2. Lead-authored-by: dujunling <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit ed249db) Ref: LIHADOOP-41272 RB=1446834 BUG=LIHADOOP-41272 G=superfriends-reviewers R=fli,mshen,yezhou,edlu A=fli

remove updateBytesReadWithFileSize because we use Hadoop FileSystem s…

0f75257

…tatistics to update the inputMetrics

maropu reviewed Aug 25, 2018

View reviewed changes

dujunling added 2 commits August 27, 2018 11:26

add ut

53dd42c

ut

1c32646

srowen reviewed Aug 28, 2018

View reviewed changes

maropu reviewed Aug 29, 2018

View reviewed changes

maropu mentioned this pull request Sep 4, 2018

[SPARK-25237][SQL] Remove updateBytesReadWithFileSize in FileScanRDD #22324

Closed

asfgit closed this in ed249db Sep 7, 2018

[SPARK-25237][SQL]remove updateBytesReadWithFileSize because we use Hadoop FileSystem s… #22232

[SPARK-25237][SQL]remove updateBytesReadWithFileSize because we use Hadoop FileSystem s… #22232

Uh oh!

Conversation

dujunling commented Aug 25, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dujunling commented Aug 25, 2018

Uh oh!

wzhfy commented Aug 25, 2018

Uh oh!

wzhfy commented Aug 25, 2018

Uh oh!

wzhfy commented Aug 25, 2018

Uh oh!

maropu Aug 25, 2018

Choose a reason for hiding this comment

Uh oh!

dujunling Aug 25, 2018

Choose a reason for hiding this comment

Uh oh!

dujunling Aug 25, 2018

Choose a reason for hiding this comment

Uh oh!

maropu Aug 25, 2018

Choose a reason for hiding this comment

Uh oh!

maropu commented Aug 25, 2018

Uh oh!

maropu commented Aug 25, 2018

Uh oh!

dujunling commented Aug 25, 2018

Uh oh!

maropu commented Aug 25, 2018

Uh oh!

srowen commented Aug 25, 2018

Uh oh!

dujunling commented Aug 27, 2018

Uh oh!

srowen Aug 28, 2018

Choose a reason for hiding this comment

Uh oh!

maropu Aug 29, 2018

Choose a reason for hiding this comment

Uh oh!

wzhfy commented Aug 31, 2018

Uh oh!

SparkQA commented Aug 31, 2018

Uh oh!

srowen commented Sep 3, 2018

Uh oh!

maropu commented Sep 3, 2018

Uh oh!

srowen commented Sep 3, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants