Skip to content

Conversation

@jkotalik
Copy link
Contributor

Fixes #24485

Today, IIS will get into an unrecoverable state if the startup timeout limit is hit, until someone redeploys the site. This usually isn't a concern, however many people have multiple w3wp sites starting up at the same time, which occasionally causes w3wp process to timeout on the startup time limit, especially if there is a lot going on before startup.

This change makes it so instead of being in an unrecoverable state, ANCM will queue to recycle the worker process instead.

There are some drawbacks that should be noted with this change. By now restarting the process, more resources will be consumed by IIS. I think this drawback is fine for normal scenarios where the app just failed to start, restarting the app is preferable here as a restart most likely will fix the timeout hit.

Let's think of the worst case scenario. Let's say Program.Main has a Thread.Sleep, causing the process to always fail to start. If the process fails to start, we recycle the worker process. From what I can tell, the rapidFailureProtectionModule will not trigger as we are enqueuing for recycling the worker process rather than the worker process crashing (need to confirm this, but #25163 is blocking validation). So, if the startup time limit is set to 1 second, I'm concerned about IIS constantly needing to recycle the process. I'll chat with some IIS folk about that as well.

Other options besides this are to either increase the startup timeout, which requires a schema update (which are hard to deploy) and/or shim changes, or disabling the limit with in-process, which is concerning because people may want to use the limit itself.

@jkotalik jkotalik added this to the 5.0.0-rc2 milestone Aug 27, 2020
@jkotalik jkotalik requested a review from halter73 as a code owner August 27, 2020 18:49
@Tratcher
Copy link
Member

From what I can tell, the rapidFailureProtectionModule will not trigger as we are enqueuing for recycling the worker process rather than the worker process crashing (need to confirm this, but #25163 is blocking validation). So, if the startup time limit is set to 1 second, I'm concerned about IIS constantly needing to recycle the process.

This is really concerning because you don't want a site stuck in an infinite restart loop. That would be especially bad in a cloud environment where you're paying for cpu time.

How bad would it be to forcibly crash so that the rapid failure protection would kick in?

@jkotalik
Copy link
Contributor Author

jkotalik commented Aug 27, 2020

This is really concerning because you don't want a site stuck in an infinite restart loop. That would be especially bad in a cloud environment where you're paying for cpu time.

I disagree with this being "really concerning" considering the tradeoff is you won't have a site in an unrecoverable state where there isn't a clear work around besides redeploying. It's a tradeoff we need to evaluate, and I currently think the pros outweigh the cons.

How bad would it be to forcibly crash so that the rapid failure protection would kick in?

I'll get back to you on that.

@jkotalik
Copy link
Contributor Author

I pinged people on app services and IIS, waiting for responses now 😄

@jkotalik
Copy link
Contributor Author

Chatted with @Tratcher, fine with change but let's add an opt-out flag to ANCM to add a work around if people hit issues.

@jkotalik jkotalik force-pushed the jkotalik/startupTimeLimit branch from 76ecacc to bc058fd Compare September 1, 2020 16:52
@jkotalik jkotalik changed the base branch from release/5.0 to release/5.0-rc2 September 1, 2020 16:52
@jkotalik
Copy link
Contributor Author

jkotalik commented Sep 1, 2020

@Tratcher updated. Let me know if you like the name of the config section.

@jkotalik jkotalik added the ask-mode This issue / PR is a patch candidate which we will bar-check internally before patching it. label Sep 1, 2020
@jkotalik
Copy link
Contributor Author

jkotalik commented Sep 1, 2020

@Pilchie please merge when green 😄

@Pilchie Pilchie added the Servicing-approved Shiproom has approved the issue label Sep 1, 2020
@Pilchie
Copy link
Member

Pilchie commented Sep 1, 2020

Approved for RC2.

@Pilchie Pilchie merged commit ee71226 into release/5.0-rc2 Sep 1, 2020
@Pilchie Pilchie deleted the jkotalik/startupTimeLimit branch September 1, 2020 22:30
@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions ask-mode This issue / PR is a patch candidate which we will bar-check internally before patching it. Servicing-approved Shiproom has approved the issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants