../../bin/linux/release/parallel_sort check **===-------------------------------------------------===** Processing 16000000 elements... Host CPU Processing time: 5410.590820 (ms) G80 CUDA Processing time: 495.165009 (ms) Speedup: 10.926844X Test PASSED For this implimentation I used a fairly straight bitonic sort implimentation. Each step of the algorithm requires a swap network with N/2 swaps, so I create N/2 threads, each responsible for one swap at each step. The threads are arranged in blocks such that subsequent threads load subsequent spots in memory (from both end of the swap). This allows the memory accesses to be more efficiently coalesced. The difficulty comes when the comparisons span blocks, because then all the threads in all the blocks must by synchronized after each swap. Once all the comparisons take place in a single block (which must happen with block sizes that are powers of two) then the blocks can seperate and work on their individual portions of the complete array in shared memory. I used a block size of 256 threads, which covers 512 peices of data (since each thread is a swap, so needs two points to compare). With a larger block size like 512 then the sort can spend longer in the faster single block merge, but each SM will handle fewer threads so this proved slower. With a smaller block size of 128 threads then we need 65536 linear blocks to handle the array, and I started to run into errors when attempting this. This may be a problem in my code or CUDA, I am not sure. Since each thread (for both version of merge) uses at 10 or less registers and each block uses at most 2*BLOCK_SIZE*sizeof(float) or 2KB of shared memory, we can fit 3 blocks of 256 threads onto each SM, giving the max of 768 threads. This gives 24 warps a block, so should be able to cover 200-cycle memory latency. There will be some bank conflicts in the single block merge function, but otherwise it seems the execution throughput is the main bottleneck.