I am working on a game (shameless plug: Cosmoteer) that is written in a custom game engine on top of Direct3D 11. (It's written in C# using SharpDX, though I think that's immaterial to the problem at hand.)
The problem I'm having is that a small but understandably-frustrated percentage of my players (about 1.5% of about 10K players/day) are getting frequent device hangs. Specifically, the call to IDXGISwapChain::Present() is failing with DXGI_ERROR_DEVICE_REMOVED, and calling GetDeviceRemovedReason() returns DXGI_ERROR_DEVICE_HUNG. I'm not ready to dismiss the errors as unsolveable driver issues because these players claim to not be having problems with any other games, and there are more complaints on my own forums about this issue than there are for games with orders of magnitude more players.
My first debugging step was, of course, to turn on the Direct3D debug layer and look for any errors/warnings in the output. Locally, the game runs 100% free of any errors or warnings. (And yes, I verified that I'm actually getting debug output by deliberately causing a warning.) I've also had several players run the game with the debug layer turned on, and they are also 100% free of errors/warnings, except for the actual hung device:
[MessageIdDeviceRemovalProcessAtFault] [Error] [Execution] : ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when the hang occurred. The application may want to respawn and fallback to less aggressive use of the display hardware).
So something my game is doing is causing the device to hang and the TDR to be triggered for a small percentage of players. The latest update of my game measures the time spent in IDXGISwapChain::Present(), and indeed in every case of a hung device, it spends more than 2 seconds in Present() before returning the error. AFAIK my game isn't doing anything particularly "aggressive" with the display hardware, and logs report that average FPS for the few seconds before the hang is usually 60+.
So now I'm pretty stumped! I have zero clues about what specifically could be causing the hung device for these players, and I can only debug post-mortem since I can't reproduce the issue locally. Are there any additional ways to figure out what could be causing a hung device? Are there any common causes of this?
Here's my remarkably un-interesting Present() call:
SwapChain.Present(_vsyncIn ? 1 : 0, PresentFlags.None);
I'd be happy to share any other code that might be relevant, though I don't myself know what that might be. (And if anyone is feeling especially generous with their time and wants to look at my full code, I can give you read access to my Git repo on Bitbucket.)
Some additional clues and things I've already investigated:
1. The errors happen on all OS'es my game supports (Windows 7, 8, 10, both 32-bit and 64-bit), GPU vendors (Intel, Nvidia, AMD), and driver versions. I've been unable to discern any patterns with the game hanging on specific hardware or drivers.
2. For the most part, the hang seems to happen at random. Some individual players report it crashes in somewhat consistent places (such as on startup or when doing a certain action in the game), but there is no consistency between players.
3. Many players have reported that turning on V-Sync significantly reduces (but does not eliminate) the errors.
4. I have assured that my code never makes calls to the immediate context or DXGI on multiple threads at the same time by wrapping literally every call to the immediate context and DXGI in a mutex region (C# lock statement). (My code *does* sometimes make calls to the immediate context off the main thread to create resources, but these calls are always synchronized with the main thread.) I also tried synchronizing all calls to the D3D device as well, even though that's supposed to be thread-safe. (Which did not solve *this* problem, but did, curiously, fix another crash a few players were having.)
5. The handful of places where my game accesses memory through pointers (it's written in C#, so it's pretty rare to use raw pointers) are done through a special SafePtr that guards against out-of-bounds access and checks to make sure the memory hasn't been deallocated/unmapped. So I'm 99% sure I'm not writing to memory I shouldn't be writing to.
6. None of my shaders use any loops.
Thanks for any clues or insights you can provide. I know there's not a lot to go on here, which is part of my problem. I'm coming to you all because I'm out of ideas for what do investigate next, and I'm hoping someone else here has ideas for possible causes I can investigate.
Thanks again!
↧