MP 5 Answers Alan Kaatz 1. Set #define DEFAULT_NUM_ELEMENTS to 16000000 (16 million). Done. 2. Copy and paste into your report the performance results when run without arguments, make sure that the timing measurements do not include host/cuda memory allocation, or cudaMemcpy times. Only include the device processing time ("device processing time" includes the execution time for host function which may call kernels functions multiple times). In your report include this device processing time, host time, and the speedup. Make sure you're using the release configuration for this test. **===-------------------------------------------------===** Processing 16000000 elements... Host CPU Processing time: 6268.899902 (ms) G80 CUDA Processing time: 311.648010 (ms) Speedup: 20.115321X Test PASSED 3. Briefly discuss your implementation. How did you partition the data, sort the partitions, and merge the data? Power-of-two bitonic sort was used. Padding (negative infinity) was added if the array to sort was not a power-of-two. The array was first partitioned into chunks that would fit in shared memory so that a shared-memory kernel sort could be performed. A host loop performed the sort and called the necessary global memory merge or shared memory merge kernel. 4. Is your implementation clearly bottlenecked by either memory bandwidth or execution throughput? Show your calculations to justify your answer. The total amount of global memory loaded for a sort of 16 million elements (padded to 2^24) (factoring in the reduced number of global loads from the shared memory kernels) can be expressed by: memory loaded = (15(15 + 1)/2 + 1) * (2^24) * 4B = 8120172544B (derived by knowing that the sum of numbers can be found with the formula n(n + 1)/2) which is roughly 8 GB. Because of swaps (writes back to memory), an estimated 16 GB of global memory data transfer is assumed (less writes, but less coalescing). Assuming a transfer rate about 70 GB/s, this accounts for about 2/3 of the total G80 processing time. Roughly speaking... Another way to look at it is the FLOPs (in this case, float comparisons) rate. The total number of comparisons is: n = total num of elements = 2^24 g = log2(n) = 24 (n / 2) * (g(g + 1) / 2) = 2516582400 FLOP comparisons 2.5 / .311 = 8 GFLOP/second, yielding far below the theoretical maximum. Of course, the other code consisting of array indexing, value checking, etc. is not factored in, but it's clear both calculations indicate that the bottleneck is memory bandwidth, and not execution throughput.