3

Yesterday I encountered a very strange error and after a day I barely made any progress so I guess it's a good candidate for asking the community. I will ask for some patiecne cause I think it's a though one.

I have a C# Winforms app which hangs after a few clicks in production. The same never happens in development environment only in production. When the hang occures nothing really happens (no error messages, however the task goes to "not responding" state according to the task manager) but the GUI becomes irresponsive. I tried it on the same environment and I can confirm the behavior.

Unfortunatelly it is not possible to install the development tools and debug the application in prod env. The best I could do was to make memory dumps from the application when it stopped. The problem is that I totally don't understand what I see in the crash dump: my Main Thread (the GUI thread) seems to be stuck in an instruction for which I cannot find any reason.

Here is the stack trace of my main thread:

KERNELBASE.dll!_RaiseException@16()  + 0x54 bytes    
[External Code]    
CFAPControlLibrary.dll!CFAPControlLibrary.Communication.Base.GetSetting(string settingName) Line 850 + 0x10 bytes    C#
CFAPControlLibrary.dll!CFAPControlLibrary.ConfigHelper.Get<CFAPControlLibrary.DataTypes.ActionSortingOption>(string settingName) Line 25 + 0x35 bytes    C#
CFAPControlLibrary.dll!CFAPControlLibrary.ConfigHelper.Get<CFAPControlLibrary.DataTypes.ActionSortingOption>(string settingName, CFAPControlLibrary.DataTypes.ActionSortingOption defaultVal) Line 15 + 0x9 bytes    C#    CFAPControlLibrary.dll!CFAPControlLibrary.DataTypes.ActionStorage.Sort(System.Collections.Generic.List<CFAPControlLibrary.DataTypes.ActionClass> subject) Line 167 + 0xe bytes    C#
CFAPControlLibrary.dll!CFAPControlLibrary.DataTypes.ActionStorage.GetByStatus(string pStatus) Line 162 + 0x46 bytes    C#
CFAPControlLibrary.dll!CFAPControlLibrary.ActionSelector.FillNodes() Line 48 + 0x26 bytes    C#
CFAPControlLibrary.dll!CFAPControlLibrary.CFAPMain.OnActionDetailsArrived(CFAPControlLibrary.CFAPMain.RawActionDetails bwr) Line 371 + 0x10 bytes    C#
CFAPControlLibrary.dll!CFAPControlLibrary.CFAPMain.OnGetDetailsCompleted(object sender, System.ComponentModel.RunWorkerCompletedEventArgs e) Line 337 + 0xb bytes    C#
user32.dll!_InternalCallWinProc@20()  + 0x23 bytes    
user32.dll!_UserCallWinProcCheckWow@32()  + 0xb3 bytes    
user32.dll!_DispatchMessageWorker@8()  + 0xe6 bytes    
user32.dll!_DispatchMessageW@4()  + 0xf bytes    
[External Code]    
CFAPHost.exe!CFAPHost.Program.Main(string[] args) Line 50 + 0x1d bytes    C#
[External Code]    
mscoreei.dll!__CorExeMain@0()  + 0x38 bytes    
mscoree.dll!_ShellShim__CorExeMain@0()  + 0x227 bytes    
mscoree.dll!__CorExeMain_Exported@0()  + 0x8 bytes    
kernel32.dll!@BaseThreadInitThunk@12()  + 0x12 bytes    
ntdll.dll!___RtlUserThreadStart@8()  + 0x27 bytes    
ntdll.dll!__RtlUserThreadStart@8()  + 0x1b bytes

And here are my source code from the top stack frames: The disassembly from KernelBase.dll: Frame from KernelBase.dll

Than the last frame from my code, m_SettingCache is a Dictionary and it does not contain the requested key: Base.GetSetting

The next couple of frames: Frame from KernelBase.dll Frame from KernelBase.dll Frame from KernelBase.dll

I think the code is pretty straightforward its just generic setting reading with default value. If something goes wrong (setting name is undefined or conversion is not possible) the default value will be returned. The code surely works. What I see from the dump is the read from the dictionary never returns although it should throw a KeyNotFoundException but that never happens. Any suggestions?

Note: the main thread is indeed stopped in the state captured by the dump: every time I make a dump the result is the same.

Note2: the hang never happens on the first execution of this code path, in every scenario this very same code path was executed before the hang (deduced from the app log)

I will provide more details on request. Thanks in advance.

Edit:

CFAPControlLibrary.dll is the main assembly of the application. It contains the windows forms and their corresponding logic. Communication with the server is achived with WCF. And the bigger requests are made in a paralell thread using a BackgroundWorker. The execution path you see in the call stack is invoked by the completition event of such a BackgroundWorker.

I pasted the requested code bits here

My AppDomain.CurrentDomain.UnhandledException handler is here

The part of the stack wchich I considered irrevelant first but later proved to be important (sensitive string literals are deleted from the image):

Evidence for Application.Run This shows that Application.Run was called, I have no idea why it is not shown in the call stack.

Update

After spending three days without finding the cause of the problem I decided to try a workaround. Since the memory dumps showed that the application hangs always at the very same point: when a KeyNotFound exception should have been thrown. The most straightforward workaround was to refactor that code to not throw if possible. That version passed the tests and never hang. This is not a solution at all but we couldn't spend anymore time on this. So basically I cross my fingers ship the code and hope I never see this crash again.

Thank you for all the suggestions

7
  • Can you post the code which you think causing the issue ? Commented Sep 25, 2013 at 11:03
  • @AccessDenied I think the relevant code can be seen in the images. If you would like to see it in a textual form, than write back I will post the most important bits somewhere. If you mean the full project I am afraid I cannot upload it without a very good reason since my company considers it business secret. Commented Sep 25, 2013 at 11:07
  • 1
    Some additional info might help. What is CFAPControlLibrary and do you use any threading or async I/O ? Commented Sep 25, 2013 at 11:11
  • @HenkHolterman Question edited to include that information Commented Sep 25, 2013 at 11:19
  • Can you post (shortened versions of) the Completed handler and the code that starts the Bgw? Any Control.Invoke() inside DoWork() ? Commented Sep 25, 2013 at 11:22

1 Answer 1

5
user32.dll!_DispatchMessageW@4()  + 0xf bytes    
[External Code]    
CFAPHost.exe!CFAPHost.Program.Main(string[] args) Line 50 + 0x1d bytes    C#

Rewrite. There is something seriously wrong with this part of the stack trace. The Main() method should always call Application.Run() to start pumping the message loop. Or a ShowDialog() call should be present, the two normal ways in which messages can be dispatched. Neither are present, nevertheless the DispatchMessage() winapi function is getting called anyway.

There is a very obscure other way in which messages can get pumped in the CLR. It happens when an application uses the lock statement on an [STAThread], like the main thread of a GUI app. Or WaitHandle.WaitOne() or Thread.Join(), the other common methods that block. Blocking an STA thread is illegal since it is so likely to cause deadlock, so the CLR pumps to avoid trouble. The code that does that would be hidden in the [External Code] section.

There's certainly evidence for that in the posted code, it uses lock in very inappropriate places. Using lock in UI code is never correct.

Seeing deadlock when the app crashes is then also easily explained.

This is a serious structural problem in the code, you'll need to fix it. Start from the Main() method, this goes wrong very early. Easy to check on your dev machine as well, just look at the call stack.

Sign up to request clarification or add additional context in comments.

3 Comments

The strange thing is that I have registered a handler for the AppDomain.CurrentDomain.UnhandledException event and it never gets executed. Well at least as far as I can tell, maybe the handler is to complicated itself and crashes? I will add a pastebin link with my handler to the OP. I can compile a 64 bit debug version to test this scenario, what do you think would it help to see more details?
Thanks for the tip. I will try to utilize what I can learn from that question. But only tomorrow since today I don't have access anymore to the prod environment. Concerning your locking advice: I revised the code and all the locks guarding that collection were on the GUI thread, so the locking was completely pointless. I removed all of them, but again I cannot test them on the failing environment (well no big expectations here after finding out that all the locks were on the same thread). Even if I cannot track the problem down, removing the locks is a big gain, many thanks!
Your reasoning based on the stack trace is absolutely correct. I wish I provided all the important information at the beggining. It turned out that Application.Run was called, its just not shown in the stack trace. I edited my answer to show the call in the code window. This is still the old dump, Today I will make a couple of tests with the new version (refactored locks)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.