Skip to content

Conversation

@alismess-db
Copy link
Contributor

@alismess-db alismess-db commented May 15, 2020

What changes were proposed in this pull request?

This is a recreation of #28155, which was reverted due to causing test failures.

The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the sessionList and executionList respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log.

To improve robustness, we also make the following changes in HiveSessionImpl.close():

  • Catch any exception thrown by operationManager.closeOperation. If for any reason this throws an exception, other operations are not prevented from being closed.
  • Handle not being able to access the scratch directory. When closing, all .pipeout files are removed from the scratch directory, which would have resulted in an NPE if the directory does not exist.

Why are the changes needed?

The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this changes the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException.

In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

…2Listener

### What changes were proposed in this pull request?

The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the `sessionList` and `executionList` respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log.

Also, in HiveSessionImpl.close(), we catch any exception thrown by `operationManager.closeOperation`. If for any reason this throws an exception, other operations are not prevented from being closed.

### Why are the changes needed?

The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this hampers with the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException.

In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests

Closes apache#28155 from alismess-db/hive-thriftserver-listener-update-safer.

Authored-by: Ali Smesseim <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
(cherry picked from commit 6994c64)
Signed-off-by: Ali Smesseim <[email protected]>
@alismess-db
Copy link
Contributor Author

@juliuszsompolski

@alismess-db alismess-db changed the title [SPARK-31387] Handle unknown operation/session ID in HiveThriftServer2Listener [SPARK-31387][CHERRY-PICK] Handle unknown operation/session ID in HiveThriftServer2Listener May 15, 2020
@juliuszsompolski
Copy link
Contributor

ok to test

@alismess-db alismess-db changed the title [SPARK-31387][CHERRY-PICK] Handle unknown operation/session ID in HiveThriftServer2Listener [SPARK-31387][test-maven] Handle unknown operation/session ID in HiveThriftServer2Listener May 15, 2020
@hvanhovell
Copy link
Contributor

ok to test

@hvanhovell
Copy link
Contributor

add to whitelist

@SparkQA
Copy link

SparkQA commented May 15, 2020

Test build #122680 has finished for PR 28544 at commit d5c946b.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented May 18, 2020

Test build #122763 has finished for PR 28544 at commit d5c946b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
private def onSessionClosed(e: SparkListenerThriftServerSessionClosed): Unit =
Option(sessionList.get(e.sessionId)) match {
case None => logWarning(s"onSessionClosed called with unknown session id: ${e.sessionId}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move the None pattern into the place after the Some pattern?

}
private def onOperationStart(e: SparkListenerThriftServerOperationStart): Unit =
Option(sessionList.get(e.sessionId)) match {
case None => logWarning(s"onOperationStart called with unknown session id: ${e.sessionId}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep processing queries even in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sessionList.get(e.sessionId).totalExecution += 1 would result in an NPE if sessionList did not have the sessionId, so it was not possible previously to continue processing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu this code is in a listener that executes on the listener bus. The listener bus can drop events if it becomes to busy.
So it might happen that events are dropped, and id is unavailable. It will not affect query execution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu this code is in a listener that executes on the listener bus. The listener bus can drop events if it becomes to busy.
So it might happen that events are dropped, and id is unavailable. It will not affect query execution.

Does a server get back to a normal state by just ignoring the case? What I a bit worry about is that users can get the same warning message repeatedly if a session id gets lost (I'm not sure the case can happen in event sequences).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu the server does get to a normal state. This is used only for displaying SparkUI, so wouldn't affect the server functioning.
Before this change, a NullPointerException would be thrown on sessionList.get(e.sessionId).totalExecution += 1, so I think a WARN message is strictly better than a NPE in the logs.
But good catch that it could be further improved - don't fail on missing session, just log the query with that sessionId, and just don't do sessionList.get(e.sessionId).totalExecution += 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Thanks for the explanation. Looks okay.

@maropu maropu changed the title [SPARK-31387][test-maven] Handle unknown operation/session ID in HiveThriftServer2Listener [SPARK-31387][SQL][test-maven] Handle unknown operation/session ID in HiveThriftServer2Listener May 18, 2020
@juliuszsompolski
Copy link
Contributor

@dongjoon-hyun this PR passes "SPARK-29604 external listeners should be initialized with Spark classloader" in https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122763/testReport that was failing before in #28155. I have no idea how this change could have been related to that failure...

20/05/17 20:19:47 INFO SparkSQLEnvSuite: 

===== TEST OUTPUT FOR o.a.s.sql.hive.thriftserver.SparkSQLEnvSuite: 'SPARK-29604 external listeners should be initialized with Spark classloader' =====

SparkSQLEnvSuite:
20/05/17 20:19:47 INFO SparkContext: Running Spark version 3.1.0-SNAPSHOT
20/05/17 20:19:47 INFO ResourceUtils: ==============================================================
20/05/17 20:19:47 INFO ResourceUtils: No custom resources configured for spark.driver.
20/05/17 20:19:47 INFO ResourceUtils: ==============================================================
20/05/17 20:19:47 INFO SparkContext: Submitted application: testApp
20/05/17 20:19:47 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
20/05/17 20:19:47 INFO ResourceProfile: Limiting resource is cpu
20/05/17 20:19:47 INFO ResourceProfileManager: Added ResourceProfile id: 0
20/05/17 20:19:47 INFO SecurityManager: Changing view acls to: jenkins
20/05/17 20:19:47 INFO SecurityManager: Changing modify acls to: jenkins
20/05/17 20:19:47 INFO SecurityManager: Changing view acls groups to: 
20/05/17 20:19:47 INFO SecurityManager: Changing modify acls groups to: 
20/05/17 20:19:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(jenkins); groups with view permissions: Set(); users  with modify permissions: Set(jenkins); groups with modify permissions: Set()
20/05/17 20:19:47 INFO Utils: Successfully started service 'sparkDriver' on port 37165.
20/05/17 20:19:47 INFO SparkEnv: Registering MapOutputTracker
20/05/17 20:19:47 INFO SparkEnv: Registering BlockManagerMaster
20/05/17 20:19:47 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/05/17 20:19:47 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/05/17 20:19:47 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
20/05/17 20:19:47 INFO DiskBlockManager: Created local directory at /home/jenkins/workspace/SparkPullRequestBuilder/sql/hive-thriftserver/target/tmp/blockmgr-1d9a2431-0019-41bb-9fbe-c8ca6dd6bc1f
20/05/17 20:19:47 INFO MemoryStore: MemoryStore started with capacity 2.1 GiB
20/05/17 20:19:47 INFO SparkEnv: Registering OutputCommitCoordinator
20/05/17 20:19:47 INFO Executor: Starting executor ID driver on host amp-jenkins-worker-03.amp
20/05/17 20:19:47 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40300.
20/05/17 20:19:47 INFO NettyBlockTransferService: Server created on amp-jenkins-worker-03.amp:40300
20/05/17 20:19:47 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/05/17 20:19:47 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, amp-jenkins-worker-03.amp, 40300, None)
20/05/17 20:19:47 INFO BlockManagerMasterEndpoint: Registering block manager amp-jenkins-worker-03.amp:40300 with 2.1 GiB RAM, BlockManagerId(driver, amp-jenkins-worker-03.amp, 40300, None)
20/05/17 20:19:47 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, amp-jenkins-worker-03.amp, 40300, None)
20/05/17 20:19:47 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, amp-jenkins-worker-03.amp, 40300, None)
20/05/17 20:19:47 INFO SharedState: loading hive config file: file:/home/jenkins/workspace/SparkPullRequestBuilder/sql/hive/target/scala-2.12/test-classes/hive-site.xml
20/05/17 20:19:47 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('/home/jenkins/workspace/SparkPullRequestBuilder/sql/hive-thriftserver/target/tmp/warehouse-49d39f38-dc7f-4a52-93cd-782c0d1231cd').
20/05/17 20:19:47 INFO SharedState: Warehouse path is '/home/jenkins/workspace/SparkPullRequestBuilder/sql/hive-thriftserver/target/tmp/warehouse-49d39f38-dc7f-4a52-93cd-782c0d1231cd'.
20/05/17 20:19:47 WARN SharedState: Not allowing to set spark.sql.warehouse.dir or hive.metastore.warehouse.dir in SparkSession's options, it should be set statically for cross-session usages
20/05/17 20:19:47 INFO HiveUtils: Initializing HiveMetastoreConnection version 2.3.7 using maven.
20/05/17 20:19:53 INFO IsolatedClientLoader: Downloaded metastore jars to /home/jenkins/workspace/SparkPullRequestBuilder/sql/hive-thriftserver/target/tmp/hive-v2_3-a7e1202d-698a-456d-8609-2859a43dfba9
20/05/17 20:19:54 INFO HiveConf: Found configuration file null
20/05/17 20:19:54 INFO SessionState: Created HDFS directory: /tmp/hive/jenkins/eb4b5224-dc13-48e3-8b2b-e1fa57f042e0
20/05/17 20:19:54 INFO SessionState: Created local directory: /home/jenkins/workspace/SparkPullRequestBuilder/sql/hive-thriftserver/target/tmp/jenkins/eb4b5224-dc13-48e3-8b2b-e1fa57f042e0
20/05/17 20:19:54 INFO SessionState: Created HDFS directory: /tmp/hive/jenkins/eb4b5224-dc13-48e3-8b2b-e1fa57f042e0/_tmp_space.db
20/05/17 20:19:54 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.7) is /home/jenkins/workspace/SparkPullRequestBuilder/sql/hive-thriftserver/target/tmp/warehouse-49d39f38-dc7f-4a52-93cd-782c0d1231cd
20/05/17 20:19:55 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
20/05/17 20:19:55 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
20/05/17 20:19:55 INFO HiveMetaStore: 0: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore
20/05/17 20:19:55 INFO ObjectStore: ObjectStore, initialize called
20/05/17 20:19:55 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
20/05/17 20:19:55 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
20/05/17 20:19:57 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
20/05/17 20:19:59 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
20/05/17 20:20:00 INFO ObjectStore: Initialized ObjectStore
20/05/17 20:20:00 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
20/05/17 20:20:00 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore [email protected]
20/05/17 20:20:00 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
20/05/17 20:20:00 INFO HiveMetaStore: Added admin role in metastore
20/05/17 20:20:00 INFO HiveMetaStore: Added public role in metastore
20/05/17 20:20:00 INFO HiveMetaStore: No user is added in admin role, since config is empty
20/05/17 20:20:00 INFO HiveMetaStore: 0: get_all_functions
20/05/17 20:20:00 INFO audit: ugi=jenkins       ip=unknown-ip-addr      cmd=get_all_functions   
20/05/17 20:20:00 INFO HiveMetaStore: 0: get_database: default
20/05/17 20:20:00 INFO audit: ugi=jenkins       ip=unknown-ip-addr      cmd=get_database: default       
20/05/17 20:20:00 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/05/17 20:20:00 INFO StreamingQueryManager: Registered listener test.custom.listener.DummyStreamingQueryListener
20/05/17 20:20:00 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/05/17 20:20:00 INFO MemoryStore: MemoryStore cleared
20/05/17 20:20:00 INFO BlockManager: BlockManager stopped
20/05/17 20:20:00 INFO BlockManagerMaster: BlockManagerMaster stopped
20/05/17 20:20:00 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/05/17 20:20:00 INFO SparkContext: Successfully stopped SparkContext
20/05/17 20:20:00 INFO SparkSQLEnvSuite: 

===== FINISHED o.a.s.sql.hive.thriftserver.SparkSQLEnvSuite: 'SPARK-29604 external listeners should be initialized with Spark classloader' =====

- SPARK-29604 external listeners should be initialized with Spark classloader

@SparkQA
Copy link

SparkQA commented May 18, 2020

Test build #122800 has finished for PR 28544 at commit aa82589.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@juliuszsompolski
Copy link
Contributor

Retest this please

@SparkQA
Copy link

SparkQA commented May 18, 2020

Test build #122815 has finished for PR 28544 at commit aa82589.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

It seems to be recovered finally. I'll trigger once more.

@dongjoon-hyun
Copy link
Member

Retest this please.


test("SPARK-31387 - listener update methods should not throw exception with unknown input") {
val (statusStore: HiveThriftServer2AppStatusStore,
listener: HiveThriftServer2Listener) = createAppStatusStore(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation?

FileUtils.forceDelete(file);
} catch (Exception e) {
LOG.error("Failed to cleanup pipeout file: " + file, e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new from the original commit. If you don't mind, could you add some explanation about this into the PR description, @alismess-db ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We found it failing with a NullPointerException in some of the testing environments that we use at. It was related to how HiveSessionImplSuite mocked the session it created. We found it always a non-harmful check to check for NPE when searching for these pipeout files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation, @juliuszsompolski .

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-31387][SQL][test-maven] Handle unknown operation/session ID in HiveThriftServer2Listener [SPARK-31387][SQL][test-maven][test-hive1.2] Handle unknown operation/session ID in HiveThriftServer2Listener May 19, 2020
@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented May 20, 2020

Test build #122861 has finished for PR 28544 at commit aa82589.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 20, 2020

Test build #122858 has finished for PR 28544 at commit aa82589.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 20, 2020

Test build #122892 has finished for PR 28544 at commit ac3a881.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-31387][SQL][test-maven][test-hive1.2] Handle unknown operation/session ID in HiveThriftServer2Listener [SPARK-31387][SQL] Handle unknown operation/session ID in HiveThriftServer2Listener May 20, 2020
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, all!
Merged to master/3.0.

dongjoon-hyun pushed a commit that referenced this pull request May 20, 2020
…erver2Listener

### What changes were proposed in this pull request?

This is a recreation of #28155, which was reverted due to causing test failures.

The update methods in HiveThriftServer2Listener now check if the parameter operation/session ID actually exist in the `sessionList` and `executionList` respectively. This prevents NullPointerExceptions if the operation or session ID is unknown. Instead, a warning is written to the log.

To improve robustness, we also make the following changes in HiveSessionImpl.close():

- Catch any exception thrown by `operationManager.closeOperation`. If for any reason this throws an exception, other operations are not prevented from being closed.
- Handle not being able to access the scratch directory. When closing, all `.pipeout` files are removed from the scratch directory, which would have resulted in an NPE if the directory does not exist.

### Why are the changes needed?

The listener's update methods would throw an exception if the operation or session ID is unknown. In Spark 2, where the listener is called directly, this changes the caller's control flow. In Spark 3, the exception is caught by the ListenerBus but results in an uninformative NullPointerException.

In HiveSessionImpl.close(), if an exception is thrown when closing an operation, all following operations are not closed.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests

Closes #28544 from alismess-db/hive-thriftserver-listener-update-safer-2.

Authored-by: Ali Smesseim <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit d40ecfa)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Ur, @juliuszsompolski . In #28544 (comment), did you mean this PR is still failing in some environment?

We found it failing with a NullPointerException in some of the testing environments that we use at. It was related to how HiveSessionImplSuite mocked the session it created. We found it always a non-harmful check to check for NPE when searching for these pipeout files.

@dongjoon-hyun
Copy link
Member

It seems that this causes consistent failures in SBT in branch-3.0 with Hive 2.7.

This is the opposite situation. Sorry again. My bad. I had a wrong assumption that this PR has no problem in SBT because it passed SBT before. For now, the new failure is only at 3.0. To stabilize branch-3.0, I'll revert this from branch-3.0 only.

@juliuszsompolski
Copy link
Contributor

Ur, @juliuszsompolski . In #28544 (comment), did you mean this PR is still failing in some environment?

No, with this PR we didn't see any failures. With the previous PR #28155, we experienced an NPE on these pipeopt filess in some environments specific to Databricks. We added this null checking for pipeout files as an additional safety check there.

It runs for me, and in all testing envs at Databricks.

$ git checkout 5198b6853b2e3bc69fc013c653aa163c79168366
HEAD is now at 5198b6853b2 [SPARK-31387][SQL] Handle unknown operation/session ID in HiveThriftServer2Listener
$ build/sbt -Phive2.3 -Phadoop2.7 -Phive-thriftserver
> project hive-thriftserver
[info] Set current project to spark-hive-thriftserver (in build file:/home/julek/dev/spark3/)
> test-only *HiveSessionImplSuite
[info] HiveSessionImplSuite:
[info] - SPARK-31387 - session.close() closes all sessions regardless of thrown exceptions (2 seconds, 787 milliseconds)
[info] ScalaTest
[info] Run completed in 7 seconds, 164 milliseconds.
[info] Total number of tests run: 1
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

If the change causes trouble in branch-3.0, let it stay in master only...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants