Skip to content

Conversation

@amerjusupovic
Copy link
Contributor

@amerjusupovic amerjusupovic commented Nov 2, 2023

StartupOptions and the Timeout property

This PR adds the new Startup property to AzureAppConfigurationOptions. You can set the Startup.Timeout property equal to a TimeSpan that represents how long the provider should continue retrying the store and replicas passed to the provider before failing.

Example usage:

config.AddAzureAppConfiguration(options =>
{
   // . . .
    options.ConfigureStartupOptions(startupOptions =>
    {
        startupOptions.Timeout = TimeSpan.FromMinutes(5);
    });
});

@amerjusupovic amerjusupovic linked an issue Nov 2, 2023 that may be closed by this pull request
@amerjusupovic amerjusupovic marked this pull request as ready for review November 6, 2023 19:51
{
private const int MaxRetries = 2;
private static readonly TimeSpan MaxRetryDelay = TimeSpan.FromMinutes(1);
internal static readonly TimeSpan MaxRetryDelay = TimeSpan.FromMinutes(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary?

Copy link
Contributor Author

@amerjusupovic amerjusupovic Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referenced it in StartupOptions to try and keep this value consistent for both files, but I could just use a constant there too. Also, not entirely sure about the MaxRetryDelay property being necessary anyway, so open to removing that/replacing it with others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to support a different MaxRetryDelay for startup vs refresh?
@jimmyca15

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we would.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like my second comment got swallowed. Although I expect we would use the same time, I don't think this should be shared in that manner. A separate field to reference where needed seems appropriate.

@zhenlan
Copy link
Member

zhenlan commented Nov 8, 2023

                if (_mappedData == null)

Is this only triggered during refresh? If so, it means the startup is already done (though initial configuration load had failed). We shouldn't use the startup logic here, because if anything fails here it will be tried again in the next refresh. #Closed


Refers to: src/Microsoft.Extensions.Configuration.AzureAppConfiguration/AzureAppConfigurationProvider.cs:211 in c8ada35. [](commit_id = c8ada35, deletion_comment = False)

},
cancellationToken)
.ConfigureAwait(false);
await Task.Delay(1000).ConfigureAwait(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this 1s a const? This is the minimum delay between retries. Not sure if we should have a maximal delay and slowly grow the delay between retries. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual server calls are already retried with some exponential backoff strategy. I dont see why we would need to slow down the overall startup retries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the concern is that this can be misused by customers if they set SDK retries to 0. If SDK is not retrying, we'll keep retrying every 1 second which is not good. So having this second layer of retries with exponential backoff might be good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagining there is only one client, we don't want to beat that client in a tight loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Amer will be reworking this a bit. I agree that we should backoff on the top level. I suggested a flow like below.

public async Task LoadAsync()
{
    var cts = new CancellationTokenSource(_startupOptions.Timeout);

    var errors = new List<Exception>();

    int attempts = 0;

    while (true)
    {
        attempts++;

        IEnumerable<ConfigurationClient> clients = _clientManager.GetClients(ignoreBackoff: true);

        if (await TryInitializeAsync(clients, errors, cts.Token))
        {
            break;
        }

        try
        {
            await Task.Delay(Timespan.FromSeconds(1).Backoff(attempts), cts.Token);
        }
        catch (OperationCanceledException)
        {
            throw new TimeoutException(
                new AggregateException(errors));
        }
    }
}

//
// TryInitializeAsync
// It swallows (returns false for) operation canceled exception
// It swallows failoverable exceptions
// It propagates non-failoverable exceptions
// It fills the 'errors' list with swallowed exceptions

/// <summary>
/// The maximum delay before timing out when loading data from Azure App Configuration on startup.
/// </summary>
public TimeSpan Timeout { get; set; } = TimeSpan.FromMinutes(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default timeout of one minute seems a bit low. I'd go with 10 minutes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned that we may get complaints like Azure/AppConfiguration#255 all over again if we set the default startup timeout to anything more than a minute.

internal interface IConfigurationClientManager
{
IEnumerable<ConfigurationClient> GetAvailableClients(DateTimeOffset time);
IEnumerable<ConfigurationClient> GetAvailableClients(DateTimeOffset time, bool ignoreBackoff = false);
Copy link
Member

@jimmyca15 jimmyca15 Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make sense to pass the time and the ignore backoff parameter.

Perhaps a separate method is warranted. GetClients, which returns all clients. #Closed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about GetAllClients() to better differentiate from GetAvailableClients?

@jimmyca15
Copy link
Member

pr example usage needs an update.

//
using Azure;
using Azure.Data.AppConfiguration;
using Microsoft.Extensions.Configuration.AzureAppConfiguration.Constants;
Copy link
Member

@jimmyca15 jimmyca15 Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have a Constants namespace. it's not useful.

}
}

if (exception.InnerExceptions?.Any(e => e is RequestFailedException) ?? false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (exception.InnerExceptions?.Any(e => e is RequestFailedException) ?? false)
if (exception.InnerExceptions != null &&
exception.InnerExceptions.Any(e => e is RequestFailedException))

A suggestion, I feel the ?? false is a bit weird to read.

{
if (startupStopwatch.Elapsed < FailOverConstants.StartupFixedBackoffDuration)
{
await Task.Delay(startupStopwatch.Elapsed.CalculateFixedStartupBackoffDuration(), cancellationToken).ConfigureAwait(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the timespan you are passing to the method has a different value than the one you have checked in the condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the problem, you could cache elapsed in a variable, but I'd recommend an approach like

TimeSpan delay;
if (startupStopwatch.TryGetFixedBackoff(out Timespan backoff))
{
    delay = backoff;
}
else
{
    postFixedWindowAttempts++;

    delay = FailoverConstants.MaxFixedStartupBackoff.CalculateBackoff(
        attempts: postFixedWindowAttempts,
        maxDuration: FailOverConstants.MaxBackoffDuration
    );
}

await Task.Delay(delay, ct);

@amerjusupovic amerjusupovic force-pushed the ajusupovic/time-based-retries branch from a5f4c2a to 1fd54a0 Compare November 15, 2023 21:45
@amerjusupovic amerjusupovic changed the base branch from main to preview November 15, 2023 21:45

double offset = interval * rand.NextDouble();

return TimeSpan.FromMilliseconds(lowest + offset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a ratio of 0.5, do we expect a jitted time of 10s falls in [5, 15]? If so, above implementation is incorrect as it generates the range of [7.5, 12.5]. I hope something like below is easier to understand:

double jitter = ratio * (rand.NextDouble() * 2 - 1);
double interval = timeSpan.TotalMilliseconds * (1 + jitter);
return TimeSpan.FromMilliseconds(interval);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jittering a 10s duration by a ratio of .5 introduces a 5 second (10 x .5) window. So the expectation is [7.5, 12.5]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formula can be updated easily to achieve the desired results, but that's not intuitive to me. The 50% jitter doesn't give me 50% more (or less) than what I had. So if I want 100% more (or less) than what I had, I actually need a 200% jitter. Isn't that weird?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed separately with @zhenlan. Seems more people interpret the ratio here as the delta from the extreme end of the possible duration window to the original duration. In that case, we would benefit to change it.

So jitter(10, .5) yields [5, 15] rather than [7.5, 12.5]


var startupExceptions = new List<Exception>();

IEnumerable<ConfigurationClient> clients = _configClientManager.GetAllClients();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this into the loop so you get new clients when replica auto discovery is enabled.

public static readonly TimeSpan MinBackoffDuration = TimeSpan.FromSeconds(30);
public static readonly TimeSpan MaxBackoffDuration = TimeSpan.FromMinutes(10);
public static readonly TimeSpan StartupFixedBackoffDuration = TimeSpan.FromMinutes(10);
public static readonly TimeSpan MaxFixedStartupBackoff = TimeSpan.FromSeconds(30);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MinStartupBackoffDuration with comments

calculatedMilliseconds = maxDuration.TotalMilliseconds;
}

TimeSpan maxBackoffDuration = CalculateMaxStartupBackoffDuration(startupTimeElapsed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the variable shouldn't even be needed.

calculatedMilliseconds = maxDuration.TotalMilliseconds;
}

TimeSpan maxBackoffDuration = CalculateMaxStartupBackoffDuration(startupTimeElapsed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maxBackoffDuration variable shouldn't be needed. The max backoff in this case should always be our 10 min max backoff.

Copy link
Member

@jimmyca15 jimmyca15 Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of the maxDuration parameter then? I don't see it used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I missed some things here. Zhenlan suggested just using CalculateBackoffDuration and adjusting that method to use jitter instead of the randomizing it does in the last line. At this point we're using the same backoff duration values, so unless there's a reason refresh is different from startup retries then I think it makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single method will be better. That's your suggestion right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, just one method and removing the CalculateExponentialStartupBackoffDuration method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, let's do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just update that implementation with this one should be good right?

return TimeSpan.FromMilliseconds(lowest + offset);
}

private static TimeSpan CalculateMaxStartupBackoffDuration(TimeSpan startupTimeElapsed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really only see a need for this logic in TryGetFixedBackoff, and in that case the default return isn't needed since TryGetFixedBackoff would return false. I recommend moving it in there.

internal static class TimeSpanExtensions
{
private const int MaxAttempts = 63;
private const double JitterRatio = 0.5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @jimmyca15 prefers this to be 0.25.

Copy link
Member

@zhenlan zhenlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

rfe.Status >= (int)HttpStatusCode.InternalServerError;
}

if (innerException is HttpRequestException hre && hre.InnerException != null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems solid, but the multiassignment to innerException threw me off a bit. I think it would help if the assignment to inner exception only happened once because it's really behaving in a branched manner.

I'd suggest

            if (rfe.Status == HttpStatusCodes.TooManyRequests ||
                rfe.Status == (int)HttpStatusCode.RequestTimeout ||
                rfe.Status >= (int)HttpStatusCode.InternalServerError)
            {
                return true;
            }

            Exception innerException;

            if (rfe.InnerException is HttpRequestException hre)
            {
                innerException = hre.InnerException;
            }
            else
            {
                innerException = rfe.InnerException;
            }

            // The InnerException could be SocketException or WebException when an endpoint is invalid and IOException if it's a network issue.
            return innerException is WebException ||
                   innerException is SocketException ||
                   innerException is IOException;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use time-based retry for startup

5 participants