Sub-Question 1:
reduce6 supposedly adds more work - each thread sums up several elements. However, the total number of threads decreases by the same factor, so the total amount of work remains O(n). As for the 'step complexity' - that should be the number of kernel invocations. It is indeed O(log(n)), as the number of kernel invocations has dropped (each invocation reduces the data vector by a larger factor). I think it's still Theta(log(n)) though, if I remember correctly how the grid dimensions are set.
Sub-Question 2
Well, the kernel assumes each warp is 'full' in the sense that there's enough input data for all of its threads. A warp is 32 threads, and each thread reads at least 2 input elements, so the kernel assumes there are at least 64 input elements (whose size is 64*sizeof(T)). This also applies to shared memory, since the single-warp part of the kernel takes its input from shared memory, reading 32*2 elements of sizeof(T) bytes from it.
Sub-Question 3
When nIsPow2
is set to true - and note this is not a parameter the kernel is getting, but rather a different version of the code, compiled separately than for nIsPow2
being false - the condition nIsPow2 || i + blockSize < n
always holds, and the compiler (assuming it's smart enough) will avoid the bound check altogether. Thus every thread will (safely) perform mySum += g_idata[i+blockSize]
.
Sub-Question 4
mysum
is the per-warp sum. After we finish calculating that, some thread from each warp needs to share it with the other warps - and this is done via shared memory (which all warps in the block can access). Now, note that the 'double assignments' - both to mysum
and to shared memory location - are due to further use by different threads: Each thread calculates its own mysum
value - repeatedly adding to it. After the final assignment of mysum
to a shared memory location, the thread exists - and a lower-index thread uses that assignment. But all threads execute the same code, so they perform 'unnecessary' assignments from mysum to shared memory.