Efficient Parallel Merge Sort For Fixed and Variable Length Keys
Efficient Parallel Merge Sort For Fixed and Variable Length Keys
ABSTRACT ison sorts on the GPU include a bitonic sort by Peters et al. [16],
We design a high-performance parallel merge sort for highly paral- a bitonic-based merge sort (named Warpsort) by Ye et al. [26] a
lel systems. Our merge sort is designed to use more register com- Quicksort by Cederman and Tsigas [5] and sample sorts by Leis-
munication (not shared memory), and does not suffer from over- chner et al. [12] and Dehne and Zaboli [8].
segmentation as opposed to previous comparison based sorts. Us- In this work we implement a merge-sort-based comparison sort
ing these techniques we are able to achieve a sorting rate of 250 that is well-suited for massively parallel architectures like the GPU.
MKeys/sec, which is about 2.5 times faster than Thrust merge sort Since a GPU requires hundreds or thousands of threads to reach
performance, and 70% faster than non-stable state-of-the-art GPU bandwidth saturation, an efficient GPU comparison sort must se-
merge sorts. lect a sorting implementation that has ample independent work at
Building on this sorting algorithm, we develop a scheme for every stage. Merge sort is therefore well-suited for the GPU as
sorting variable-length key/value pairs, with a special emphasis on any two pre-sorted blocks can be merged independently. We fo-
string keys. Sorting non-uniform, unaligned data such as strings is cus on designing an efficient stable merge sort (order preserved on
a fundamental step in a variety of algorithms, yet it has received ties) that reduces warp divergence, avoids over-segmenting blocks
comparatively little attention. To our knowledge, our system is the of data, and increases register utilization when possible. We ex-
first published description of an efficient string sort for GPUs. We tend our techniques to also implement an efficient variable-key sort
are able to sort strings at a rate of 70 MStrings/s on one dataset and (string-sort). Our two major contributions are a fast stable merge
up to 1.25 GB/s on another dataset using a GTX 580. sort that is the fastest current comparison sort on GPUs, and the
first GPU-based string-sort of which we are aware.
1. INTRODUCTION
Sorting is a widely-studied fundamental computing primitive that 2. RELATED WORK
is found in a plethora of algorithms and methods. Sort is useful for Sorting has been widely studied on a broad range of architec-
organizing data structures in applications such as sparse matrix- tures. Here we concentrate on GPU sorts, which can generally be
vector multiplication [3], the Burrows-Wheeler transform [1, 15], classified as radix or comparison sorts.
and Bounding Volume Hierarchies (LBVH) [11]. While CPU-based Radix sorts rely on a binary representation of the sort key. Each
algorithms for sort have been thoroughly studied, with the shift in iteration of a radix sort processes b bits of the key, partitioning its
modern computing to highly parallel systems in recent years, there output into 2b parts. The complexity of the sort is proportional to
has been a resurgence of interest in mapping sorting algorithms b, the number of bits, and n, the size of the input (O(bn)), and fast
onto these architectures. scan-based split routines that efficiently perform these partitions
For fixed key lengths where direct manipulation of keys is al- have made the radix sort the sort of choice for key types that are
lowed, radix sort on the GPU has proven to be very efficient, with suitable for the radix approach, such as integers and floating-point
recent implementations achieving over 1 GKeys/sec [13]. How- numbers. Merrill and Grimshaws radix sort [13] is integrated into
ever, for long or variable-length keys (such as strings), radix sort the Thrust library and is representative of the fastest GPU-based
is not as appealing an approach: the cost of radix sort scales with radix sorts today. However, as keys become longer, radix sort be-
key length. Rather, comparison-based sorts such as merge sort are comes proportionally more expensive from a computational per-
more appealing since one can modify the comparison operator to spective, and radix sort is not suitable for all key types/comparisons
handle variable-length keys. The current state of the art in compar- (consider sorting integers in Morton order [14], for instance).
Comparison sorts can sort any sequence using only a user-specified
comparison function between two elements and can thus sort se-
quences that are unsuitable for a radix sort. Sorting networks stipu-
late a set of comparisons between elements that result in a sorted se-
c 20xx IEEE Personal use of this material is permitted. Permission from quence, traditionally with O(n log2 n) complexity. Because those
IEEE must be obtained for all other uses, in any current or future media, comparisons have ample parallelism and are oblivious to the in-
including reprinting/republishing this material for advertising or promo- put, they have been used for sorting since the earliest days of GPU
tional purposes, creating new collective works, for resale or redistribution computing [17]. Recent sorting-network successes include an im-
to servers or lists, or reuse of any copyrighted component of this work in
other works. plementation of Batchers bitonic sort [16].
. The classic Quicksort is also a comparison sort that lacks the
Registers 400
Unsorted data (Single Thread)
E1 1001
a Block Sort 999
1002
E2 1004
1003
1005
E3 1004 1006
1008
Simple Merge
b E4 1007
...
2000
a b
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Figure 2: Our block sort consists of two steps. First each thread
performs an eight-way bitonic sort in registers (as seen in a). Then
c Multi Merge
each thread is responsible for merging those eight elements in the
log(numElements/8) stages.
Sorted data
MStrings/sec
250 1.25 GB/s 1.211 GB/s
1.044 GB/s
10 0.98 GB/s 10
200 5 5
MKeys/sec
0 0
200 400 600 800 1000
150
Number of Elements (x1024)
100 20
MStrings/sec
Our Key-Only Sort 80
Our Key-Value Sort 70
568.88 MB/s
10
50 Satish Key-Value Sort 488.8 MB/s
60 462.488 MB/s
Thrust Key-Only Sort 443.2 MB/s
5
Thrust Key-Value Sort 50
0 40 0
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 200 400 600 800 1000
Number of Elements (x1024) Number of Elements (x1024)
Figure 7: Performance in MKeys/sec of our merge sort for string, key and key-value sorts on a GTX 580. We also compare the performance
ratio between our key-value sort and string sort. Though our string-sort has a slower stings/sec sorting rate for our sentences database,
since each string is much longer the overall MB/s sorting rate is higher.
this performance degradation, and discuss future optimizations. are sorting four million elements, we will have a blocksort stage
Though merge sort is an n log n method, we require a certain and twelve merge stages, each requiring at minimum of two global
number of elements to reach peak performance. We expect our reads (each element is placed once in registers to search, and once
sorting rate to degrade due to the asymptotic O( log1 n ) factor, which in a shared memory search space) and one write (totaling 38n). Un-
begins to take effect after about 4 million elements. The thrust sort der these conditions (four million elements) our theoretical sorting
pays a much higher initial overhead, leading to a flatter degradation rate cap is at 1.26 GKeys/s, which is about 5x faster than what we
over time. are able to achieve. Similarly, we can show that our cap for key-
We test our implementation on a GTX 580 which has up to value pairs is 941 Mpairs/s, which is also about 5x faster than our
1.58 TFlop/s and a theoretical bandwidth of 192.4 GB/s. Since achieved rate.
sorts are traditionally not computationally intensive, the theoretical We can attempt to show a tighter theoretical bound by includ-
bandwidth bounds our performance. Our next section will analyze ing the minimum shared memory communication required at every
our performance, give a rough estimate and compare our perfor- stage. Under our conditions:
mance to the theoretical bounds of a merge sort algorithm, and dis-
Each thread is responsible for k elements;
cuss factors limiting our performance. We will also discuss the
performance and limiting factors of our string sort implementation, Each thread performs a binary search into a shared memory
as well as future work that may improve performance. window of size p;
For k 1 stages we perform a linear search; and
6. PERFORMANCE ANALYSIS The sum of all search spaces loaded into shared memory is
In this section we will analyze our performance and determine at least n.
(1) how well our implementation compares to a theoretical bound
of our key and key-value sorts; (2) where our time is going; and (3) Therefore we can get a lower bound on the minimum shared mem-
where efforts for future improvements lie. ory communication needed by calculating the lower bound per thread.
Each thread requires log(p) + (k 1) shared memory reads to
search, and all threads combined will load the entire input set.
Theoretical Upper Bound. Since there are again log( np ) merge stages, the amount of shared
First we will derive a loose upper bound on the possible sorting elements loads necessary are at least n log( np )(1 + (log(p)+k1)
k
)).
rate for a key-only and key-value merge sort (blocks of size p) using Since the theoretical maximum bandwidth of shared memory is
our strategy. We do this to show that our method is reasonably about 1.2 TB/s, we can plug in the same p, n, and choosing k as
efficient when compared to this bound, and provide a comparison four we add an extra 0.885 ms to sort four million elements on
for future merge sort implementations. We will use as a limiting an NVIDIA GTX 580. This reduces the theoretical sorting rate
factor our global memory bandwidth B. If we assume that for our to 997.4 MKeys/s for key-only sort and 785.45 MPairs/s for key-
blocksort stage and each merge stage we must read elements in at value sort. Therefore our sort performance is about 4x and 4.2x
least twice (once for register window, and once for shared memory away from the theoretical bound for key-only and key-value pairs
window) and write out elements at least once, we have 2 (1 + respectively.
log( np )) global memory accesses, for a key-only sort. Though it is unlikely for an algorithm that requires both syn-
As an example, if we assume our blocksize p is 1024, and we chronization and moving memory windows to be close to the global
10000 1200 105
Distribution of Window Ties (first 4 characters)
Sentences Words Sentences Dataset
104
Number of Windows
Partition Size / Window Size
Number of Ties (x1000)
8000
1000
103
102
800
101
6000
100 0
600 10 101 102 103 104 105
105 Length of Concurrent Ties
4000 Wiki Words Dataset
104
Number of Windows
400
103
2000
200 102
101
0 0 100 0
0 2 4 6 8 10 10 101 102 103 104
Merge Step Length of Concurrent Ties
Figure 8: Figure 8a shows the total number of global memory accesses needed to resolve ties per merge stage for a million strings for both of
our datasets. Figure 8b measures the number of shared memory windows with duplicate keys, and the length of these windows after our sort
is complete. As the size of our partition grows (while our window size remains fixed), the variance between the largest and smallest key in our
window shrinks. This leads to ties becoming more probable, forcing global memory reads to resolve the ties, and degrading performance.
For our dataset involving book sentences, this variance is even smaller leading to more global memory lookups and a lower Strings/sec
sorting rate. We also test the number of key ties in a row once the sort is finished, and report the number of ties. Since our shared window
size is relatively small (we select as a heuristic 1024 elements), performing a binary or linear search within long blocks with the same key
will be relatively useless, and require a large number of global memory tie breaks.
0.05 Distribution of String Length 0.08 Distribution of String Ties (After Sort)
Sentences Dataset 0.07 Sentences Dataset
0.04 0.06
Fraction of Strings
Fraction of Strings
0.03 0.05
0.04
0.02 0.03
0.01 0.02
0.01
0.000 10 20 30 40 50 60 0.000 20 40 60 80 100
String Length (x4) Number of Character Ties
0.05 0.25
Wiki Words Dataset Wikipedia Words Dataset
0.04 0.20
Ratio of Strings
Ratio of Strings
0.03 0.15
0.02 0.10
0.01 0.05
0.000 5 10 15 20 25 30 0.000 20 40 60 80 100
String Length Number of Character Ties
Figure 9: Statistics regarding our two string datasets. All strings are stored concurrently with null-termination signifying the end of a string
and the beginning of a new string. Our Words dataset has on average strings of length 8 characters long, while our Sentences dataset has
strings on average 98 characters long. As strings from our sentences are much longer on average, they will run into more lengthy tie-break
scenarios as we perform our merge-sort. Our sentences dataset, has many ties ranging from 1020 characters long, and quite a number that
are even greater than 100 (we clipped the range).
memory theoretical cap, this bound gives us an idea how much time ters between two concurrent strings after being sorted between our
is being spent on the actual sorting itself (instead of global memory two datasets. After every step of our merge sort, comparisons will
data movement). We now must analyze what else may be causing be between more similar strings(as illustrated in Figure 9b and Fig-
this performance difference. From this analysis, we may be able ure 8a), this gives us an idea of how many worst case comparisons
to learn where we should focus our efforts to optimize merge sort will be required.
in future work. Next we will discuss factors that effect our sorting For authors, it is very common to begin sentences in similar ways
performance. (e.g., Then, And, But, etc.), which results in many string ties of
about 1020 of the same characters in a row. In Figure 9b we even
6.1 Fixed-Length Keys Performance see a set of very similar strings greater than 100 characters long (we
The two factors that have the largest effect on our merge sort per- capped our histogram). Since all threads in a warp must wait for a
formance are divergence and shared memory bank conflicts. Though tie to resolve before continuing, such occurrences are very costly.
we can streamline our windowed loads to be free of divergence or We could expect a database of names and addresses to have
bank conflicts, it is difficult to do so for both the binary search stage somewhat similar behavior, where ties among sets of common names
and linear search stage. must be broken. On the other hand, our Wikipedia word list dataset
has much fewer ties and none that exceed 20 characters in a row. As
we can see from Figure 7b, our sentences dataset is over 5x slower
Bank Conflicts. (lower MStrings/sec) sorting rate than our words dataset. How-
Divergence occurs most frequently in our binary search stage. ever since each sentence is much longer (about 10x), we achieve a
To illustrate, consider a SIMD warp with each thread performing a higher GB/s sorting rate with sentences.
binary search in a shared memory window. Each thread will query
the middle value; this results in an efficient broadcast read. How-
ever, given two evenly distributed sequences, in the next query half Long Sets of Similar Strings.
of the threads will search to the left and the other half will search As sequences become very large in comparison to the memory
to the right. This will create a number of 2-way bank conflicts. windows we can handle, the distribution of the values (variance)
Similarly, subsequent searches will double the number of possible decreases. Since our shared memory and register sizes are limited,
shared memory bank conflicts. We attempted to resolve this prob- we cannot directly scale our windows to keep pace. Therefore,
lem by implementing an address translator that would avoid bank some threads in our linear search stage are more likely to run into
conflicts. However, this modification did not improve performance. long sets of ties before calculating their correct indexes, while oth-
ers resolve their location immediately. Figure 8a illustrates this
effect. As we begin to merge larger and larger blocks, the number
Divergence. of total ties within a merge step grows. Figure 8b shows the num-
Though our linear search stage does not suffer heavily from bank ber of keys that share the same value after our string is sorted. This
conflicts, it does suffer from divergence problems. When each effect is a data-dependent load imbalance. Though it was more ef-
thread performs its linear search, it begins independently in a shared ficient in our uniform-length key sort to perform linear searches
memory location. It will then incrementally check the subsequent in every merge stage (after an initial binary search) as described in
values until it finds the first value larger than its key. Since a SIMD Section 3 this change in distribution makes the worst-case for linear
warp must operate in lockstep, a thread cannot progress past this searches more probable. Therefore, we limit the number of linear
linear search stage until all threads are finished. Therefore a warp searches, and have threads perform more binary searches (worse
will have to wait for the thread which has the largest gap to search average case, but better worst case) when locating their insertion
through. indexes. When comparing our two datasets, the effect is much
more pronounced in our sentences database (again since authors
6.2 Variable-Length Keys Performance have common ways of beginning sentences).
Since our string sort is built on top of our fixed-length key merge We could also attempt to mitigate the amount of variance within
sort, it suffers from the same divergence and bank conflicts men- a window using the following strategy: Since each thread knows
tioned previously. However, it is also affected by other causes of the largest and smallest value in a memory window, a simple and
divergence as well as load imbalance. operation can determine the set of most significant bits shared by
all values within that window. Then a block can decide whether it is
worth it to shift those shared bits out and load new bits to increase
Divergence. the variance within that block. We think this can help reduce the
Since warps operate in lockstep, any threads that break ties will number of ties, and we plan to implement it in future work.
stall all other threads within a warp. Since these divergent forks
must be resolved in high-latency global memory, they are inher- 7. CONCLUSION
ently expensive for SIMD machines. This isnt an issue that can We have presented an efficient hierarchical approach to merge
easily be rectified either. However, since we keep our original sort. We harness more register communication, handle arbitrarily
string locations static and in-order, if there are multiple ties when large partitions, create just enough work to fill the machine and
comparing two strings, the majority of consecutive tie-breaks should limit unnecessary binary searches.
be cached reads. Although this doesnt directly address the diver- Our merge sort attains best-of-class performance through four
gence cost, it helps to mitigate the effect. main techniques: (1) in our initial step, sorting 8 elements within
When comparing our two data sets it becomes apparent that the each thread, which leverages register bandwidth; (2) a novel binary-
average number of global memory ties that must be resolved begins then-linear searching approach in our merge within a thread block;
to dominate performance. Not only are our sentences much longer (3) avoiding over-segmentation with a moving shared memory and
on average than our words, but require much more work to resolve register windows; and (4) a modular, three-step formulation of merge
ties. Figure 9b compares the number of shared concurrent charac- sort that is well-suited to the GPU computational and memory hi-
erarchy (and possibly suitable for tuning to other parallel architec- 2006 ACM SIGMOD International Conference on
tures). Management of Data, pages 325336, June 2006.
From this merge sort we are able to specialize a string sort, which [10] P. Gutenberg. Free ebooks by project guttenberg, 2010.
we believe is the first general string sort on GPUs. The performance http://www.gutenberg.org/.
of the string sort highlights the large cost of thread divergence when [11] C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and
comparisons between strings must break ties with non-coalesced, D. Manocha. Fast BVH construction on GPUs. Computer
long-latency global memory reads. We see this as the most critical Graphics Forum, 28(2):375384, 2009.
issue for optimizing string sort on future GPU-like architectures. [12] N. Leischner, V. Osipov, and P. Sanders. GPU sample sort. In
There are a number of possible directions we would like to take Proceedings of the 2010 IEEE International Symposium on
future work in both comparison sorts, and string sorts in general. Parallel & Distributed Processing, Apr. 2010.
We have focused on implementing an efficient merge sort. How- [13] D. Merrill and A. Grimshaw. Revisiting sorting for GPGPU
ever, we would like to explore comparison-based techniques for stream architectures. Technical Report CS2010-03,
handling very large sequences across multiple GPUs. For example, Department of Computer Science, University of Virginia,
hybrid techniques that combine merge sort with sample sort appear Feb. 2010.
promising for handling hundreds of gigabytes worth of key-value
[14] G. Morton. A Computer Oriented Geodetic Data Base and A
pairs.
New Technique In File Sequencing. International Business
We would also like to develop methods for avoiding thread di-
Machines Co., 1966.
vergence and global memory tie breaks in our current string sort,
and explore hybrid string sorting techniques that might combine [15] R. A. Patel, Y. Zhang, J. Mak, and J. D. Owens. Parallel
radix-sorts with comparison sorts (such as our merge sort). lossless data compression on the GPU. In Proceedings of
Innovative Parallel Computing (InPar 12), May 2012.
[16] H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger. Fast
Acknowledgments in-place, comparison-based sorting with CUDA: a study with
We appreciate the support of the National Science Foundation (grants bitonic sort. Concurrency and Computation: Practice and
OCI-1032859 and CCF-1017399). We would like to thank Anjul Experience, 23(7):681693, 2011.
Patney for some assistance in creating figures, and Ritesh Patel for [17] T. J. Purcell, C. Donner, M. Cammarano, H. W. Jensen, and
testing our string sort and reporting bugs. We would also like to P. Hanrahan. Photon mapping on programmable graphics
thank the reviewers for their valuable comments and feedback. hardware. In Graphics Hardware 2003, pages 4150, July
2003.
[18] N. Satish, M. Harris, and M. Garland. Designing efficient
8. REFERENCES sorting algorithms for manycore GPUs. In Proceedings of the
[1] D. Adjeroh, T. Bell, and A. Mukherjee. The 23rd IEEE International Parallel and Distributed Processing
Burrows-Wheeler Transform: Data Compression, Suffix Symposium, May 2009.
Arrays, and Pattern Matching. Springer Publishing [19] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan
Company, Incorporated, 1st edition, 2008. primitives for GPU computing. In Graphics Hardware 2007,
[2] K. E. Batcher. Sorting networks and their applications. In pages 97106, Aug. 2007.
Proceedings of the AFIPS Spring Joint Computing [20] R. Sinha, J. Zobel, and D. Ring. Cache-efficient string
Conference, volume 32, pages 307314, Apr. 1968. sorting using copying. Journal of Experimental
[3] N. Bell and M. Garland. Implementing sparse matrix-vector Algorithmics, 11, Feb. 2007.
multiplication on throughput-oriented processors. In SC 09: [21] E. Sintorn and U. Assarsson. Fast parallel GPU-sorting using
Proceedings of the 2009 ACM/IEEE Conference on a hybrid algorithm. Journal of Parallel and Distributed
Supercomputing, pages 18:118:11, Nov. 2009. Computing, 68(10):13811388, 2008.
[4] N. Bell and J. Hoberock. Thrust: A productivity-oriented [22] V. Volkov and J. Demmel. LU, QR, and Cholesky
library for CUDA. In W. W. Hwu, editor, GPU Computing factorizations using vector capabilities of GPUs. Technical
Gems, volume 2, chapter 4, pages 359372. Morgan Report UCB/EECS-2008-49, Electrical Engineering and
Kaufmann, Oct. 2011. Computer Sciences, University of California at Berkeley,
[5] D. Cederman and P. Tsigas. GPU-Quicksort: A practical 13 May 2008.
quicksort algorithm for graphics processors. Journal of [23] V. Volkov and J. W. Demmel. Benchmarking GPUs to tune
Experimental Algorithmics, 14:4:1.44:1.24, Jan. 2010. dense linear algebra. In Proceedings of the 2008 ACM/IEEE
[6] A. Davidson and J. D. Owens. Register packing for cyclic Conference on Supercomputing, pages 31:131:11, Nov.
reduction: A case study. In Proceedings of the Fourth 2008.
Workshop on General Purpose Processing on Graphics [24] V. Volkov and J. W. Demmel. Using GPUs to accelerate the
Processing Units, pages 4:14:6, Mar. 2011. bisection algorithm for finding eigenvalues of symmetric
[7] A. Davidson, Y. Zhang, and J. D. Owens. An auto-tuned tridiagonal matrices. LAPACK Working Note 197,
method for solving large tridiagonal systems on the GPU. In Department of Computer Science, University of Tennessee,
Proceedings of the 25th IEEE International Parallel and Knoxville, Jan. 2008.
Distributed Processing Symposium, pages 956965, May [25] Wikipedia. Wikimedia downloads, 2010.
2011. http://dumps.wikimedia.org.
[8] F. Dehne and H. Zaboli. Deterministic sample sort for GPUs. [26] X. Ye, D. Fan, W. Lin, N. Yuan, and P. Ienne. High
CoRR, 2010. http://arxiv.org/abs/1002.4464. performance comparison-based sorting algorithm on
[9] N. K. Govindaraju, J. Gray, R. Kumar, and D. Manocha. many-core GPUs. Parallel Distributed Processing (IPDPS),
GPUTeraSort: High performance graphics coprocessor 2010 IEEE International Symposium on, pages 1 10, Apr.
sorting for large database management. In Proceedings of the
2010.