DistributedCache Redis Lock Timeouts #8836

kfrancis@clinicalsupportsystems.com created 26 days ago

ABP Framework version: v8.3.0
UI Type: MVC
Database System: EF Core (SQL Server)
Tiered (for MVC) or Auth Server Separated (for Angular): yes

We are getting this exception very often and it's causing the system to be unusable. We had a failed go-live because of this and we've not been successful at locating the reason:

2025-02-20 15:11:51.633] [Error] wn0mdwk000176 (77)  A task was canceled.
System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Volo.Abp.Threading.SemaphoreSlimExtensions.LockAsync(SemaphoreSlim semaphoreSlim, CancellationToken cancellationToken)
   at Volo.Abp.Caching.DistributedCache`2.GetOrAddAsync(TCacheKey key, Func`1 factory, Func`1 optionsFactory, Nullable`1 hideErrors, Boolean considerUow, CancellationToken token)
   at Volo.Abp.LanguageManagement.DynamicResourceLocalizer.FillAsync(LocalizationResourceBase resource, String cultureName, Dictionary`2 dictionary)
   at Volo.Abp.Localization.LocalizationResourceContributorList.FillAsync(String cultureName, Dictionary`2 dictionary, Boolean includeDynamicContributors)
   at Volo.Abp.Localization.AbpDictionaryBasedStringLocalizer.GetAllStringsAsync(String cultureName, Boolean includeParentCultures, Boolean includeBaseLocalizers, Boolean includeDynamicContributors)
   at Volo.Abp.Localization.AbpDictionaryBasedStringLocalizer.GetAllStringsAsync(Boolean includeParentCultures, Boolean includeBaseLocalizers, Boolean includeDynamicContributors)
   at Volo.Abp.Localization.AbpStringLocalizerExtensions.GetAllStringsAsync(IStringLocalizer stringLocalizer, Boolean includeParentCultures, Boolean includeBaseLocalizers, Boolean includeDynamicContributors)
   at Volo.Abp.AspNetCore.Mvc.ApplicationConfigurations.AbpApplicationLocalizationAppService.GetAsync(ApplicationLocalizationRequestDto input)
   at Castle.DynamicProxy.AsyncInterceptorBase.ProceedAsynchronous[TResult](IInvocation invocation, IInvocationProceedInfo proceedInfo)
   at Volo.Abp.Castle.DynamicProxy.CastleAbpMethodInvocationAdapterWithReturnValue`1.ProceedAsync()
   at Volo.Abp.GlobalFeatures.GlobalFeatureInterceptor.InterceptAsync(IAbpMethodInvocation invocation)
   at Volo.Abp.Castle.DynamicProxy.CastleAsyncAbpInterceptorAdapter`1.InterceptAsync[TResult](IInvocation invocation, IInvocationProceedInfo proceedInfo, Func`3 proceed)
   at Castle.DynamicProxy.AsyncInterceptorBase.ProceedAsynchronous[TResult](IInvocation invocation, IInvocationProceedInfo proceedInfo)
   at Volo.Abp.Castle.DynamicProxy.CastleAbpMethodInvocationAdapterWithReturnValue`1.ProceedAsync()
   at Volo.Abp.Auditing.AuditingInterceptor.ProceedByLoggingAsync(IAbpMethodInvocation invocation, AbpAuditingOptions options, IAuditingHelper auditingHelper, IAuditLogScope auditLogScope)
   at Volo.Abp.Auditing.AuditingInterceptor.ProcessWithNewAuditingScopeAsync(IAbpMethodInvocation invocation, AbpAuditingOptions options, ICurrentUser currentUser, IAuditingManager auditingManager, IAuditingHelper auditingHelper, IUnitOfWorkManager unitOfWorkManager)
   at Volo.Abp.Auditing.AuditingInterceptor.ProcessWithNewAuditingScopeAsync(IAbpMethodInvocation invocation, AbpAuditingOptions options, ICurrentUser currentUser, IAuditingManager auditingManager, IAuditingHelper auditingHelper, IUnitOfWorkManager unitOfWorkManager)
   at Volo.Abp.Auditing.AuditingInterceptor.InterceptAsync(IAbpMethodInvocation invocation)
   at Volo.Abp.Castle.DynamicProxy.CastleAsyncAbpInterceptorAdapter`1.InterceptAsync[TResult](IInvocation invocation, IInvocationProceedInfo proceedInfo, Func`3 proceed)
   at Castle.DynamicProxy.AsyncInterceptorBase.ProceedAsynchronous[TResult](IInvocation invocation, IInvocationProceedInfo proceedInfo)
   at Volo.Abp.Castle.DynamicProxy.CastleAbpMethodInvocationAdapterWithReturnValue`1.ProceedAsync()
   at Volo.Abp.Validation.ValidationInterceptor.InterceptAsync(IAbpMethodInvocation invocation)
   at Volo.Abp.Castle.DynamicProxy.CastleAsyncAbpInterceptorAdapter`1.InterceptAsync[TResult](IInvocation invocation, IInvocationProceedInfo proceedInfo, Func`3 proceed)
   at Castle.DynamicProxy.AsyncInterceptorBase.ProceedAsynchronous[TResult](IInvocation invocation, IInvocationProceedInfo proceedInfo)
   at Volo.Abp.Castle.DynamicProxy.CastleAbpMethodInvocationAdapterWithReturnValue`1.ProceedAsync()
   at Volo.Abp.Uow.UnitOfWorkInterceptor.InterceptAsync(IAbpMethodInvocation invocation)
   at Volo.Abp.Castle.DynamicProxy.CastleAsyncAbpInterceptorAdapter`1.InterceptAsync[TResult](IInvocation invocation, IInvocationProceedInfo proceedInfo, Func`3 proceed)
   at Volo.Abp.AspNetCore.Mvc.ApplicationConfigurations.AbpApplicationLocalizationController.GetAsync(ApplicationLocalizationRequestDto input)
   at lambda_method3566(Closure, Object)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.AwaitableObjectResultExecutor.Execute(ActionContext actionContext, IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.g__Logged|12_1(ControllerActionInvoker invoker)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.g__Awaited|10_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow(ActionExecutedContextSealed context)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.g__Awaited|13_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.g__Awaited|26_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)

Pretty standard deploy: web, auth, API all pointing to the same redis instance at azure.

Redis health check is good:

Redis latency seems ok:

Help!

kfrancis@clinicalsupportsystems.com created 26 days ago

I've spent the morning trying to produce a simple project to show the issue, but it's hard. I never see this issue locally, only when deployed.

https://github.com/kfrancis/abp-cache-issue-repo

I've tried to get something similar, run TestConcurrentLocks.ps1 but so far I've not been able to reproduce locally.

It's just curious that of all the similar exceptions, they are all the same. DynamicResourceLocalizer, LockAsync causing cancel. And generally, it feels like something is wrong with caching. We regularly see issues that we associate with caching, like cached results quickly getting thrown out, issues with permissions that seem to flip/flop (sometimes there are items in the menu that should be there, sometimes not), etc.

enisn created 25 days ago

Support Team .NET Developer

Hi,

I'm delivering this issue to our core team and they'll start investigate. I just created an internal issue for this.

Until this investigation in the framework level, I can ask for some more details.

Is your redis instance under high load?
If yes, can you try increaasing maxmemory for redis instance, If you have a custom redis.conf, it might be limited

kfrancis@clinicalsupportsystems.com created 25 days ago

No, there's barely any usage yet because the system can't handle it.

It feels like a race condition, IMHO.

In SemaphoreSlimExtensions.LockAsync:

There's a small window between WaitAsync completing and GetDispose being called where cancellation could occur
If cancellation happens in this window, the semaphore would be acquired but the IDisposable might not be returned, potentially leaving the semaphore locked

In GetOrAddAsync:

The double-check pattern used here assumes the first GetAsync result remains valid when entering the lock
Between the first GetAsync and obtaining the lock, another thread could have modified or removed the cache value
Within the lock, after the factory() call and before SetAsync, an exception/cancellation could leave cache in an inconsistent state

The most concerning race condition is in GetOrAddAsync where cache reads aren't transactional with respect to the lock. The code assumes cache state observed before taking the lock remains valid inside the lock, which may not be true in a distributed system. That might explain why this issue isn't happening in dev, where the instances are running on the same system.

kfrancis@clinicalsupportsystems.com created 21 days ago

Just a heads up while you look into this, we are working on a bit of an "abp caching playground" to assist from our side: https://github.com/Clinical-Support-Systems/abp-caching-playground

It's meant to help us determine how changes in the caching implementation change the overall caching health/throughput, but also (hopefully) expose possible issues with the caching implementation.

Cool things:

We've figured out how to support k6 for load testing in aspire, this being how we wanted to "stress" the caching implementation to make sure it's working.
We've added redis caching metrics to the aspire dashboard even though the aspire documentation says that metrics aren't possible.
This approach is also helping us do more representative load/burst testing, as one issue we're struggling with is that load testing locally produces wildly different results than any deployed release

berkansasmaz created 18 days ago

Support Team .NET Developer

Sorry for the late reply, but we missed this question because the friend dealing with the subject was on vacation. However, next Monday, the friend who is interested in the subject will return to you. Thank you for your patience.

kfrancis@clinicalsupportsystems.com created 13 days ago

Please don't close this ticket. It is unresolved. Any news?

I'll note, though, while I'm responding, that I see very similar caching issues on a completely separate project. It feels like a cache stampede issue causing a race condition on the lock internal to the implementation, which then causes the cache to need to be loaded again, etc.

I've been side tracked on the AbpCachingPlayground while we separated out the aspire k6 components (https://github.com/kfrancis/k6-aspire-hosting), but I'll be getting back to it shortly to test my hypothesis.

enisn created 12 days ago

Support Team .NET Developer

There can be a race condition while acquiring the lock, hard to determine.

Can you reproduce it with newly created solution? Can you share us a minimal reproduction steps or the minimal application that reproduces the problem so we can check if there is a problem in framework level