-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25237][SQL]remove updateBytesReadWithFileSize because we use Hadoop FileSystem s… #22232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tatistics to update the inputMetrics
|
ok to test |
|
this seems to be caused by removing support for Hadoop 2.5 and earlier? cc original authors @cloud-fan @srowen to make sure |
|
test this please |
|
|
||
| override def close(): Unit = { | ||
| updateBytesRead() | ||
| updateBytesReadWithFileSize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just remove this updateBytesReadWithFileSize, the issue in the description can be solved? We need to remove updateBytesReadWithFileSize in the line 142, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, before SPARK-19464, there will only one works between updateBytesRead and updateBytesReadWithFileSize. If the hadoop version is 2.5 or earlier, updateBytesReadWithFileSize works, If the hadoop version is 2.6 or later, updateBytesRead works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there are one more files in the partition, the inputMetrics is wrong when updateBytesReadWithFileSize in the line 142 is exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha, I see.
|
Is it difficult to add tests for checking the metric in the case |
|
btw, can you clean up the title and the description..? |
|
While metris suites are in core test , fileScanRdd should be in sql test, it is difficult to add tests to check the input metrics in sql module |
|
I'm not sure we can test the case though, for example, how about the sequence below? |
|
It's OK to assume Hadoop 2.6+ only. In fact 2.6 is quite old anyway. |
|
@maropu I have added a ut to check the inputMetrics |
| spark.sparkContext.listenerBus.waitUntilEmpty(500) | ||
| spark.sparkContext.removeSparkListener(bytesReadListener) | ||
|
|
||
| assert(bytesReads.sum < 3000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data above could be made deterministic so that you can assert the bytes read more exactly. I wonder if it's important to make sure the bytes read are exact, rather than just close, given that the change above would change the metric only a little I think.
You can just track the sum rather than all values written, but it doesn't matter.
| test("[SPARK-25237] remove updateBytesReadWithFileSize in FileScanRdd") { | ||
| withTempPath { p => | ||
| val path = p.getAbsolutePath | ||
| spark.range(1000).selectExpr("id AS c0", "rand() AS c1").repartition(10).write.csv(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a single partition is ok for this test.
|
ok to test |
|
Test build #95508 has finished for PR 22232 at commit
|
|
@dujunling I personally think this can be merged, but only if the test is tightened up |
|
@srowen I could take this over or do follow-up if the author is still inactive. |
|
I think that's fine @maropu ; we can always apportion credit appropriately later. |
## What changes were proposed in this pull request? This pr removed the method `updateBytesReadWithFileSize` in `FileScanRDD` because it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers. This is rework from #22232. Closes #22232 ## How was this patch tested? Added tests in `FileBasedDataSourceSuite`. Closes #22324 from maropu/pr22232-2. Lead-authored-by: dujunling <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit ed249db) Signed-off-by: Sean Owen <[email protected]>
This pr removed the method `updateBytesReadWithFileSize` in `FileScanRDD` because it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers. This is rework from apache#22232. Closes apache#22232 Added tests in `FileBasedDataSourceSuite`. Closes apache#22324 from maropu/pr22232-2. Lead-authored-by: dujunling <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit ed249db) Ref: LIHADOOP-41272 RB=1446834 BUG=LIHADOOP-41272 G=superfriends-reviewers R=fli,mshen,yezhou,edlu A=fli
…tatistics to update the inputMetrics
What changes were proposed in this pull request?
In FileScanRdd, we will update inputMetrics's bytesRead using updateBytesRead every 1000 rows or when close the iterator.
but when close the iterator, we will invoke updateBytesReadWithFileSize to increase the inputMetrics's bytesRead with file's length.
this will result in the inputMetrics's bytesRead is wrong when run the query with limit such as select * from table limit 1.
because we do not support for Hadoop 2.5 and earlier now, we always get the bytesRead from Hadoop FileSystem statistics other than files's length.
How was this patch tested?
manual test