-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Reliability fixes for the Thread Pools #121887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I think this is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses critical reliability issues in three thread pool implementations by reverting problematic changes from #100506 that led to deadlocks in .NET 9. The fix restores a simpler and safer enqueuer/worker handshake protocol that guarantees work items will always have a worker thread available to process them.
Key Changes:
- Simplified the
QueueProcessingStageenum by removing theDeterminingstate, leaving onlyNotScheduledandScheduled - Changed the worker thread protocol to reset the processing stage to
NotScheduledbefore checking for work items (preventing a race condition window) - Removed complex retry/dequeue logic and the
_nextWorkItemToProcessoptimization in favor of always requesting an additional worker when processing an item
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs | Simplified general-purpose ThreadPool enqueuer/worker handshake by removing Determining state, _nextWorkItemToProcess field, and complex retry logic; streamlined Dispatch() to always request a worker after dequeuing an item; |
| src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs | Applied same handshake simplification to Unix socket async engine; removed UpdateEventQueueProcessingStage() method; simplified Execute() to use consistent pattern of resetting state before checking queue |
src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs
Outdated
Show resolved
Hide resolved
…ToHighPriorityGlobalQueue. them
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
When inserting workitems into threadpool queues, we must always guarantee that for every workitem there will be some worker at some point in the future that will certainly notice the presence of the workitem and executes it.
There was an attempt to relax the requirements in #100506.
Sadly, it leads to occasional deadlocks when items are present in the work queues and no workers are coming to pick them up.
The same change was made in all 3 threadpools - IO completion, Sockets and the general purpose ThreadPool. The fix is applied to all three threadpools.
We have seen reports about deadlocks when running on net9 or later releases:
The fix will need to be ported to net10 and net9. Thus this PR tries to restore just the part which changed the enqueuers/workers handshake algorithm.
More stuff was piled up into threadpool since the change, so doing the minimal fix without disturbing the rest is somewhat tricky.
Fixes: #121608 (definitely, I have tried with the repro)
Fixes: #119043 (likely, I do not have a repro to try, but symptoms seem like from the same issue)