Hi
Just a simple question about compute shaders (CS5, DX11).
Do the atomic operations (InterlockedAdd in my case) should work without any issues on RWByteAddressBuffer and be globaly coherent ?
I'v come back from CUDA world and commited fairly simple kernel that does some job, the pseudo-code is as follows:
(both kernels use that same RWByteAddressBuffer)
first kernel does some job and sets Result[0] = 0;
(using Result.Store(0, 0))
I'v checked with debugger, and indeed the value stored at dword 0 is 0
now my second kernel
RWByteAddressBuffer Result;
[numthreads(8, 8, 8)]
void main()
{
for (int i = 0; i < 5; i++)
{
uint4 v0 = DoSomeCalculations1();
uint4 v1 = DoSomeCalculations2();
uint4 v2 = DoSomeCalculations3();
if (v0.w == 0 && v1.w == 0 && v2.w)
continue;
// increment counter by 3, and get it previous value
// this should basically allocate space for 3 uint4 values in buffer
uint prev;
Result.InterlockedAdd(0, 3, prev);
// this fills the buffer with 3 uint4 values (+1 is here as the first 16 bytes is occupied by DrawInstancedIndirect data)
Result.Store4((prev+0+1)*16, v0);
Result.Store4((prev+1+1)*16, v1);
Result.Store4((prev+2+1)*16, v2);
}
}
Now I invoke it with Dispatch(4,4,4)
Now I use DrawInstancedIndirect to draw the buffer, but ocassionaly there is missed triangle here and there for a frame, as if the atomic counter does not work as expected
do I need any additional synchronization there ?
I'v tried 'AllMemoryBarrierWithGroupSync' at the end of kernel, but without effect.
If I do not use atomic counter, and istead just output empty vertices (that will transform into degenerated triangles) the all is OK - as if I'm missing some form of synchronization, but I do not see such a thing in DX11.
I'v tested on both old and new nvidia hardware (680M and 1080, the behaviour is that same).
↧