Quantcast
Channel: GameDev.net
Viewing all articles
Browse latest Browse all 17825

DirectCompute: sync within warp

$
0
0
In countless sources I've found that, when operating within a warp, one might skip syncthreads because all instructions are synchronous within a single warp. In CUDA-related sources. I followed that advice and applied it in DirectCompute (I use NV's GPU). I wrote this code that does nothing else but good old prefix-sum of 64 elements (64 is the size of my block): groupshared float errs1_shared[64]; groupshared float errs2_shared[64]; groupshared float errs4_shared[64]; groupshared float errs8_shared[64]; groupshared float errs16_shared[64]; groupshared float errs32_shared[64]; groupshared float errs64_shared[64]; void CalculateErrs(uint threadIdx) { if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[2*threadIdx] + errs1_shared[2*threadIdx + 1]; if (threadIdx < 16) errs4_shared[threadIdx] = errs2_shared[2*threadIdx] + errs2_shared[2*threadIdx + 1]; if (threadIdx < 8) errs8_shared[threadIdx] = errs4_shared[2*threadIdx] + errs4_shared[2*threadIdx + 1]; if (threadIdx < 4) errs16_shared[threadIdx] = errs8_shared[2*threadIdx] + errs8_shared[2*threadIdx + 1]; if (threadIdx < 2) errs32_shared[threadIdx] = errs16_shared[2*threadIdx] + errs16_shared[2*threadIdx + 1]; if (threadIdx < 1) errs64_shared[threadIdx] = errs32_shared[2*threadIdx] + errs32_shared[2*threadIdx + 1]; } This works flawlessly. I noticed that I have bank conflicts in here so I changed that code to this: void CalculateErrs(uint threadIdx) { if (threadIdx < 32) errs2_shared[threadIdx] = errs1_shared[threadIdx] + errs1_shared[threadIdx + 32]; if (threadIdx < 16) errs4_shared[threadIdx] = errs2_shared[threadIdx] + errs2_shared[threadIdx + 16]; if (threadIdx < 8) errs8_shared[threadIdx] = errs4_shared[threadIdx] + errs4_shared[threadIdx + 8]; if (threadIdx < 4) errs16_shared[threadIdx] = errs8_shared[threadIdx] + errs8_shared[threadIdx + 4]; if (threadIdx < 2) errs32_shared[threadIdx] = errs16_shared[threadIdx] + errs16_shared[threadIdx + 2]; if (threadIdx < 1) errs64_shared[threadIdx] = errs32_shared[threadIdx] + errs32_shared[threadIdx + 1]; } And to my surprise this one causes race conditions. Is it because I should not rely on that functionality (auto-sync within warp) when working with DirectCompute instead of CUDA? Because that hurts my performance by measurable margin. With bank conflicts (first version) I am still faster by around 15-20% than in the second version, which is conflict-free but I have to add GroupMemoryBarrierWithGroupSync in between each assignment.

Viewing all articles
Browse latest Browse all 17825

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>