-
Notifications
You must be signed in to change notification settings - Fork 40
Add StartupOptions for time-based retries on startup #488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationOptions.cs
Show resolved
Hide resolved
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/ConfigurationClientManager.cs
Outdated
Show resolved
Hide resolved
| { | ||
| private const int MaxRetries = 2; | ||
| private static readonly TimeSpan MaxRetryDelay = TimeSpan.FromMinutes(1); | ||
| internal static readonly TimeSpan MaxRetryDelay = TimeSpan.FromMinutes(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referenced it in StartupOptions to try and keep this value consistent for both files, but I could just use a constant there too. Also, not entirely sure about the MaxRetryDelay property being necessary anyway, so open to removing that/replacing it with others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to support a different MaxRetryDelay for startup vs refresh?
@jimmyca15
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why we would.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like my second comment got swallowed. Although I expect we would use the same time, I don't think this should be shared in that manner. A separate field to reference where needed seems appropriate.
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs
Outdated
Show resolved
Hide resolved
Is this only triggered during refresh? If so, it means the startup is already done (though initial configuration load had failed). We shouldn't use the startup logic here, because if anything fails here it will be tried again in the next refresh. #Closed Refers to: src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs:211 in c8ada35. [](commit_id = c8ada35, deletion_comment = False) |
| }, | ||
| cancellationToken) | ||
| .ConfigureAwait(false); | ||
| await Task.Delay(1000).ConfigureAwait(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this 1s a const? This is the minimum delay between retries. Not sure if we should have a maximal delay and slowly grow the delay between retries. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual server calls are already retried with some exponential backoff strategy. I dont see why we would need to slow down the overall startup retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless the concern is that this can be misused by customers if they set SDK retries to 0. If SDK is not retrying, we'll keep retrying every 1 second which is not good. So having this second layer of retries with exponential backoff might be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagining there is only one client, we don't want to beat that client in a tight loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Amer will be reworking this a bit. I agree that we should backoff on the top level. I suggested a flow like below.
public async Task LoadAsync()
{
var cts = new CancellationTokenSource(_startupOptions.Timeout);
var errors = new List<Exception>();
int attempts = 0;
while (true)
{
attempts++;
IEnumerable<ConfigurationClient> clients = _clientManager.GetClients(ignoreBackoff: true);
if (await TryInitializeAsync(clients, errors, cts.Token))
{
break;
}
try
{
await Task.Delay(Timespan.FromSeconds(1).Backoff(attempts), cts.Token);
}
catch (OperationCanceledException)
{
throw new TimeoutException(
new AggregateException(errors));
}
}
}
//
// TryInitializeAsync
// It swallows (returns false for) operation canceled exception
// It swallows failoverable exceptions
// It propagates non-failoverable exceptions
// It fills the 'errors' list with swallowed exceptions
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/StartupOptions.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationOptions.cs
Outdated
Show resolved
Hide resolved
| /// <summary> | ||
| /// The maximum delay before timing out when loading data from Azure App Configuration on startup. | ||
| /// </summary> | ||
| public TimeSpan Timeout { get; set; } = TimeSpan.FromMinutes(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default timeout of one minute seems a bit low. I'd go with 10 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned that we may get complaints like Azure/AppConfiguration#255 all over again if we set the default startup timeout to anything more than a minute.
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/StartupOptions.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs
Outdated
Show resolved
Hide resolved
| internal interface IConfigurationClientManager | ||
| { | ||
| IEnumerable<ConfigurationClient> GetAvailableClients(DateTimeOffset time); | ||
| IEnumerable<ConfigurationClient> GetAvailableClients(DateTimeOffset time, bool ignoreBackoff = false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make sense to pass the time and the ignore backoff parameter.
Perhaps a separate method is warranted. GetClients, which returns all clients. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about GetAllClients() to better differentiate from GetAvailableClients?
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs
Outdated
Show resolved
Hide resolved
|
pr example usage needs an update. |
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationOptions.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationSource.cs
Outdated
Show resolved
Hide resolved
| // | ||
| using Azure; | ||
| using Azure.Data.AppConfiguration; | ||
| using Microsoft.Extensions.Configuration.AzureAppConfiguration.Constants; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have a Constants namespace. it's not useful.
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| if (exception.InnerExceptions?.Any(e => e is RequestFailedException) ?? false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (exception.InnerExceptions?.Any(e => e is RequestFailedException) ?? false) | |
| if (exception.InnerExceptions != null && | |
| exception.InnerExceptions.Any(e => e is RequestFailedException)) |
A suggestion, I feel the ?? false is a bit weird to read.
| { | ||
| if (startupStopwatch.Elapsed < FailOverConstants.StartupFixedBackoffDuration) | ||
| { | ||
| await Task.Delay(startupStopwatch.Elapsed.CalculateFixedStartupBackoffDuration(), cancellationToken).ConfigureAwait(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the timespan you are passing to the method has a different value than the one you have checked in the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid the problem, you could cache elapsed in a variable, but I'd recommend an approach like
TimeSpan delay;
if (startupStopwatch.TryGetFixedBackoff(out Timespan backoff))
{
delay = backoff;
}
else
{
postFixedWindowAttempts++;
delay = FailoverConstants.MaxFixedStartupBackoff.CalculateBackoff(
attempts: postFixedWindowAttempts,
maxDuration: FailOverConstants.MaxBackoffDuration
);
}
await Task.Delay(delay, ct);
a5f4c2a to
1fd54a0
Compare
|
|
||
| double offset = interval * rand.NextDouble(); | ||
|
|
||
| return TimeSpan.FromMilliseconds(lowest + offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a ratio of 0.5, do we expect a jitted time of 10s falls in [5, 15]? If so, above implementation is incorrect as it generates the range of [7.5, 12.5]. I hope something like below is easier to understand:
double jitter = ratio * (rand.NextDouble() * 2 - 1);
double interval = timeSpan.TotalMilliseconds * (1 + jitter);
return TimeSpan.FromMilliseconds(interval);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jittering a 10s duration by a ratio of .5 introduces a 5 second (10 x .5) window. So the expectation is [7.5, 12.5]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formula can be updated easily to achieve the desired results, but that's not intuitive to me. The 50% jitter doesn't give me 50% more (or less) than what I had. So if I want 100% more (or less) than what I had, I actually need a 200% jitter. Isn't that weird?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed separately with @zhenlan. Seems more people interpret the ratio here as the delta from the extreme end of the possible duration window to the original duration. In that case, we would benefit to change it.
So jitter(10, .5) yields [5, 15] rather than [7.5, 12.5]
|
|
||
| var startupExceptions = new List<Exception>(); | ||
|
|
||
| IEnumerable<ConfigurationClient> clients = _configClientManager.GetAllClients(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this into the loop so you get new clients when replica auto discovery is enabled.
| public static readonly TimeSpan MinBackoffDuration = TimeSpan.FromSeconds(30); | ||
| public static readonly TimeSpan MaxBackoffDuration = TimeSpan.FromMinutes(10); | ||
| public static readonly TimeSpan StartupFixedBackoffDuration = TimeSpan.FromMinutes(10); | ||
| public static readonly TimeSpan MaxFixedStartupBackoff = TimeSpan.FromSeconds(30); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MinStartupBackoffDuration with comments
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs
Outdated
Show resolved
Hide resolved
| calculatedMilliseconds = maxDuration.TotalMilliseconds; | ||
| } | ||
|
|
||
| TimeSpan maxBackoffDuration = CalculateMaxStartupBackoffDuration(startupTimeElapsed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the variable shouldn't even be needed.
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/Extensions/TimeSpanExtensions.cs
Outdated
Show resolved
Hide resolved
| calculatedMilliseconds = maxDuration.TotalMilliseconds; | ||
| } | ||
|
|
||
| TimeSpan maxBackoffDuration = CalculateMaxStartupBackoffDuration(startupTimeElapsed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The maxBackoffDuration variable shouldn't be needed. The max backoff in this case should always be our 10 min max backoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of the maxDuration parameter then? I don't see it used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I missed some things here. Zhenlan suggested just using CalculateBackoffDuration and adjusting that method to use jitter instead of the randomizing it does in the last line. At this point we're using the same backoff duration values, so unless there's a reason refresh is different from startup retries then I think it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single method will be better. That's your suggestion right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, just one method and removing the CalculateExponentialStartupBackoffDuration method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, let's do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just update that implementation with this one should be good right?
| return TimeSpan.FromMilliseconds(lowest + offset); | ||
| } | ||
|
|
||
| private static TimeSpan CalculateMaxStartupBackoffDuration(TimeSpan startupTimeElapsed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really only see a need for this logic in TryGetFixedBackoff, and in that case the default return isn't needed since TryGetFixedBackoff would return false. I recommend moving it in there.
| internal static class TimeSpanExtensions | ||
| { | ||
| private const int MaxAttempts = 63; | ||
| private const double JitterRatio = 0.5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @jimmyca15 prefers this to be 0.25.
zhenlan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![]()
src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs
Show resolved
Hide resolved
| rfe.Status >= (int)HttpStatusCode.InternalServerError; | ||
| } | ||
|
|
||
| if (innerException is HttpRequestException hre && hre.InnerException != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems solid, but the multiassignment to innerException threw me off a bit. I think it would help if the assignment to inner exception only happened once because it's really behaving in a branched manner.
I'd suggest
if (rfe.Status == HttpStatusCodes.TooManyRequests ||
rfe.Status == (int)HttpStatusCode.RequestTimeout ||
rfe.Status >= (int)HttpStatusCode.InternalServerError)
{
return true;
}
Exception innerException;
if (rfe.InnerException is HttpRequestException hre)
{
innerException = hre.InnerException;
}
else
{
innerException = rfe.InnerException;
}
// The InnerException could be SocketException or WebException when an endpoint is invalid and IOException if it's a network issue.
return innerException is WebException ||
innerException is SocketException ||
innerException is IOException;
StartupOptions and the Timeout property
This PR adds the new
Startupproperty toAzureAppConfigurationOptions. You can set theStartup.Timeoutproperty equal to aTimeSpanthat represents how long the provider should continue retrying the store and replicas passed to the provider before failing.Example usage: