0% found this document useful (0 votes)
101 views

Efficient Parallel Merge Sort For Fixed and Variable Length Keys

Great research paper for parallel merge sort

Uploaded by

robbiejones96
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views

Efficient Parallel Merge Sort For Fixed and Variable Length Keys

Great research paper for parallel merge sort

Uploaded by

robbiejones96
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Efficient Parallel Merge Sort for Fixed and

Variable Length Keys

Andrew Davidson David Tarjan


University of California, Davis NVIDIA Corporation
aaldavidson@ucdavis.edu dtarjan@nvidia.com
Michael Garland John D. Owens
NVIDIA Corporation University of California, Davis
mgarland@nvidia.com jowens@ece.ucdavis.edu

ABSTRACT ison sorts on the GPU include a bitonic sort by Peters et al. [16],
We design a high-performance parallel merge sort for highly paral- a bitonic-based merge sort (named Warpsort) by Ye et al. [26] a
lel systems. Our merge sort is designed to use more register com- Quicksort by Cederman and Tsigas [5] and sample sorts by Leis-
munication (not shared memory), and does not suffer from over- chner et al. [12] and Dehne and Zaboli [8].
segmentation as opposed to previous comparison based sorts. Us- In this work we implement a merge-sort-based comparison sort
ing these techniques we are able to achieve a sorting rate of 250 that is well-suited for massively parallel architectures like the GPU.
MKeys/sec, which is about 2.5 times faster than Thrust merge sort Since a GPU requires hundreds or thousands of threads to reach
performance, and 70% faster than non-stable state-of-the-art GPU bandwidth saturation, an efficient GPU comparison sort must se-
merge sorts. lect a sorting implementation that has ample independent work at
Building on this sorting algorithm, we develop a scheme for every stage. Merge sort is therefore well-suited for the GPU as
sorting variable-length key/value pairs, with a special emphasis on any two pre-sorted blocks can be merged independently. We fo-
string keys. Sorting non-uniform, unaligned data such as strings is cus on designing an efficient stable merge sort (order preserved on
a fundamental step in a variety of algorithms, yet it has received ties) that reduces warp divergence, avoids over-segmenting blocks
comparatively little attention. To our knowledge, our system is the of data, and increases register utilization when possible. We ex-
first published description of an efficient string sort for GPUs. We tend our techniques to also implement an efficient variable-key sort
are able to sort strings at a rate of 70 MStrings/s on one dataset and (string-sort). Our two major contributions are a fast stable merge
up to 1.25 GB/s on another dataset using a GTX 580. sort that is the fastest current comparison sort on GPUs, and the
first GPU-based string-sort of which we are aware.

1. INTRODUCTION
Sorting is a widely-studied fundamental computing primitive that 2. RELATED WORK
is found in a plethora of algorithms and methods. Sort is useful for Sorting has been widely studied on a broad range of architec-
organizing data structures in applications such as sparse matrix- tures. Here we concentrate on GPU sorts, which can generally be
vector multiplication [3], the Burrows-Wheeler transform [1, 15], classified as radix or comparison sorts.
and Bounding Volume Hierarchies (LBVH) [11]. While CPU-based Radix sorts rely on a binary representation of the sort key. Each
algorithms for sort have been thoroughly studied, with the shift in iteration of a radix sort processes b bits of the key, partitioning its
modern computing to highly parallel systems in recent years, there output into 2b parts. The complexity of the sort is proportional to
has been a resurgence of interest in mapping sorting algorithms b, the number of bits, and n, the size of the input (O(bn)), and fast
onto these architectures. scan-based split routines that efficiently perform these partitions
For fixed key lengths where direct manipulation of keys is al- have made the radix sort the sort of choice for key types that are
lowed, radix sort on the GPU has proven to be very efficient, with suitable for the radix approach, such as integers and floating-point
recent implementations achieving over 1 GKeys/sec [13]. How- numbers. Merrill and Grimshaws radix sort [13] is integrated into
ever, for long or variable-length keys (such as strings), radix sort the Thrust library and is representative of the fastest GPU-based
is not as appealing an approach: the cost of radix sort scales with radix sorts today. However, as keys become longer, radix sort be-
key length. Rather, comparison-based sorts such as merge sort are comes proportionally more expensive from a computational per-
more appealing since one can modify the comparison operator to spective, and radix sort is not suitable for all key types/comparisons
handle variable-length keys. The current state of the art in compar- (consider sorting integers in Morton order [14], for instance).
Comparison sorts can sort any sequence using only a user-specified
comparison function between two elements and can thus sort se-
quences that are unsuitable for a radix sort. Sorting networks stipu-
late a set of comparisons between elements that result in a sorted se-

c 20xx IEEE Personal use of this material is permitted. Permission from quence, traditionally with O(n log2 n) complexity. Because those
IEEE must be obtained for all other uses, in any current or future media, comparisons have ample parallelism and are oblivious to the in-
including reprinting/republishing this material for advertising or promo- put, they have been used for sorting since the earliest days of GPU
tional purposes, creating new collective works, for resale or redistribution computing [17]. Recent sorting-network successes include an im-
to servers or lists, or reuse of any copyrighted component of this work in
other works. plementation of Batchers bitonic sort [16].
. The classic Quicksort is also a comparison sort that lacks the
Registers 400
Unsorted data (Single Thread)
E1 1001
a Block Sort 999
1002
E2 1004
1003
1005
E3 1004 1006
1008
Simple Merge
b E4 1007
...
2000
a b
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Figure 2: Our block sort consists of two steps. First each thread
performs an eight-way bitonic sort in registers (as seen in a). Then
c Multi Merge
each thread is responsible for merging those eight elements in the
log(numElements/8) stages.

Sorted data

Figure 1: A high level overview of our hierarchical merge sort.


In (a) each CUDA block performs an independent sort on a set of
elements. The second stage (b) CUDA blocks work independently
... Iteration i Iteration (i+2) ...
to merge two sequences of data. In the final step (c), CUDA blocks
Registers
cooperate to merge sequences.

obliviousness of a sorting network, instead relying on a divide-


and-conquer approach. It has superior complexity (O(n log n)) to
sorting networks, but is more complex from a parallel implementa-
... Iteration i Iteration (i+1) ...
tion perspective. Quicksort was first demonstrated on the GPU by
Sengupta et al. [19] and was further addressed by Cederman and Shared Memory
Tsigas [5]. Sample sorts generalize quicksorts; rather than splitting
the input in 2 or 3 parts as in quicksort, they choose representative
or random splitter elements to divide the input elements into many
Figure 3: In order to merge partitions much larger than a GPUs
buckets, typically computing histograms to derive element offsets,
shared memory and register capacity, we must manage two moving
then sort each bucket independently. Leischner et al. [12] use ran-
memory windows. From two sorted sequences we load a portion of
dom splitters and Dehne and Zaboli [8] deterministic splitters in
one set into registers (A), and a portion of the second into shared
their GPU implementations.
memory (B). Then we update each memory window according to
Merge sorts are O(n log n) comparison sorts that recursively
the largest elements, as illustrated above.
merge multiple sorted subsequences into a single sorted sequence.
Typical mergesort approaches on GPUs use any sorting method to
sort small chunks of data within a single GPU core, then recursively
merge the resulting sorted subsequences [18]. and compare our design to the previous state-of-the-art GPU merge
Hybrid approaches to sorting, typically used to best exploit dif- sort. We organize our sort into a hierarchical three-stage system.
ferent levels of the GPUs computational or memory hierarchies We design a merge sort implementation that attempts to avoid
during computation, are common, such as the highly successful shared memory communication when possible(favoring registers),
radix-bitonic sort of Govindaraju et al. [9] or the quicksort-mergesort uses a persistent-thread model for merging, and reduces the number
of Sintorn and Assarsson [21]. of binary searches. Due to these choices, we utilize more register
The GPU methods above do not focus on complex key types like communication and handle larger partitions at a time to avoid load
strings; in fact most of the above comparison-based work simply imbalance. Our merge sort consists of three stages, designed to
uses fixed-length integers and do not address complex comparison handle different amounts of parallelism. We will next highlight our
functions like string-compare. The above work also does not ad- design choices for each stage in our merge sort.
dress the complexities of storing variable-length keys like strings.
In the CPU world, string keys merit special attention; a represen-
tative example (among many) is the burstsort-based string-sorting Block Sort.
technique of Sinha et al. [20] that optimizes cache management of Since local register communication has much higher throughput
string keys to achieve better locality and performance. than standard shared memory communication on GPUs (such as
NVIDIA CUDA capable cards), using registers for sorting when
possible is preferred. A common strategy to achieve higher register
3. MERGE SORT utilization on GPUs is to turn fine-grained threads (that handle one
In this section we will describe our merge sort implementation, element) into fatter, coarse threads (that handle multiple elements).
Work in GPU linear algebra kernels by Volkov et al. [2224]
has shown that quite frequently sacrificing occupancy(the ratio of
active warps to possible warps) for the sake of better register uti-
lization leads to higher throughput. Though in these linear alge-
bra kernels, the communication pattern is quite different from our
... ...
merge sorts downsweep pattern, more recent work by Davidson et 1000 1300
al. has also shown that similar divide-and-conquer methods can
benefit from fewer threads with higher register utilization (register
packing) [6, 7]. In order to achieve more register communication,
... ...
we split our blocksort step into two stages. We decompose our
block sort into two stages. First each thread loads eight concurrent Registers Shared Memory
elements in registers, and sorts them locally using a bitonic sort as
illustrated in Figure 2. Though bitonic sorters are in general non-
stable, since we are only sorting eight elements we can carefully
construct our sorting network to maintain stability. Figure 4: Example of the state of two memory windows of A and
Now that we have a set of sorted sequences (eight elements each), B.
we can begin merging them. A common strategy to merge two
sorted sequences A and B is to search for the intersection index of
each element in A into B, and vice versa. The final index for an
element Ai after being merged is i + j where Bj < Aj Bj+1 .
After each elements output index (the sum of your own index and
the intersecting index) has been calculated the resulting list C =
... 4000 6000 9000
...
merge(A, B) will be a new sorted sequence. This merged sequence
C will have size sizeof(A)+sizeof(B). Since each thread operates
independently, in order to locate the correct intersection index, pre-
vious parallel merge sorts have used binary search for each element
... i (i+1)
...
in both partitions.
Reloaded section
Since, in our case, each block is sorting m elements (m for
us is heuristically selected to be 1024), we now have m 8
threads
per block, each with eight sorted elements. Our second stage in- Registers Shared Memory
volves log( m 8
) stages to merge all these sorted sequences into a
final sorted block. In each merge stage, every thread (still respon-
sible for eight elements) changes its search space in order to calcu- Figure 5: Optimization to reduce load imbalance. We can reload
late the correct indices for its elements. In the first merge stage, the elements for the next binary search in order to get more overlap
search space is the neighboring eight elements, in the second stage (60009000 range in this example).
it becomes sixteen, and so on. Once a thread has calculated the
correct index to insert its element, it dumps the sorted values into
shared memory. After each merge step a thread then synchronizes
memory windows: one in registers, and one in shared memory.
(as intermediate values are dumped into shared memory), and then
Our moving memory windows work as follows: Each thread
loads eight new concurrent elements for the next merge stage.
loads a set of k values it will merge into registers (again using our
Instead of each thread performing eight binary searches to find
register packing strategy), starting at the beginning of one sequence
each insertion index, we perform a binary search for the first ele-
(sequence A). We now have k elements and q threads. The prod-
ment and use a linear search for the other seven elements, using the
uct of k and q gives us the number of elements in registers that a
results of our initial binary search as a starting point. Since each
block can handle at a time. This then sets the size of the moving
thread is handling a sorted subsequence, a search using the first
window in partition A. Next we load into shared memory a set of
element will most likely be close in relation to a search using the
values from the other sequence (sequence B). The size of this mov-
next element. This is illustrated in Figure 2. Though it is possible
ing window is set by hardware limitations on shared memory size
that in the worst case these linear searches will require more trips
allowed per block. We will refer to the size of the register window
to memory (O(n) vs O(log(n))), we find in our experiments that
as as and the shared memory window as bs . By keeping track of
in the average case these secondary linear searches require only a
the largest and smallest values in each moving window (register and
fraction of the accesses that a binary search would need.
shared memory), after each thread finishes updating its current val-
ues, a block will decide whether to load new values into registers,
new values into shared memory or both. This process is illustrated
Merge SortSimple. in Figure 3. We find that performance is improved if we have our
After our block sort stage, we have t blocks of m size. In our shared memory windows be larger than our register windows.
next stage we want each CUDA block to work independently in Now we can step through and merge partitions of arbitrary size.
merging two of these sorted sequences together. At every step we However, implementing our merge as just described has one major
halve the number of blocks and double the size of each sequence disadvantage. Since we are updating blocks based on the status of
we are merging. Our design goal is to create a merge which utilizes two moving windows, a block cannot update its register window
shared memory at every stage for our binary and linear searches until all other threads are merged. Similarly, a block cannot up-
even though our hardware shared memory size remains fixed. In date shared memory values until all register values in A have an
order for our algorithm to be able to handle sequences of arbitrary opportunity to check if their insertion index is within the range of
size and still use shared memory effectively, we design two moving the maximum and minimum values in this window. Therefore the
amount of work a block will do at any given time is the union of el- an arbitrary number of partitions and sequence sizes. An overview
ements that share the same range in the register and shared memory of our merge sort hierarchy can be seen in Figure 1. We will now
windows as illustrated in Figure 4. This leads to some load imbal- compare our sort with the previous state-of-the-art merge sort on
ance between blocks, since threads are responsible for concurrent the GPU.
values. However, if all values a warp is responsible for lies out-
side this union, no useless work will be done. Therefore for every 3.1 Previous Merge Sorts
sequence a block handles there can only be at most one divergent Satish et al. developed a comparison-based merge sort which is
warp (portion of the threads searching while others stand idle) at implemented in a divide and conquer approach [18]. The sorts they
any given time. created outperformed carefully optimized multicore CPU sorts, which
However we still have a possible load-balance issue. If the over- helped dispel the myth that GPUs were ill-suited for sorting.
lap in our valid window ranges is very small, some blocks will have Their algorithm is designed in the following way. First they di-
a lot of work while others may have little to no work. We can help vide and locally sort a set of t tiles using Batchers odd-even merge
address this issue by modifying our window update step in the fol- sort [2]. Next, each block merges two adjacent sequences at a time
lowing way. Before we update our register window, first check the (log2 t stages) until the entire sequence is sorted. In order to main-
insertion index associated with the last value within that window tain parallelism, and ensure they are able to use shared memory
(the last value in the last thread). We can broadcast that index to at every stage, their method splits each sub-partition to make sure
all threads, and reload the entire shared memory window with this each sequence being merged is of a relatively small size (256 el-
additional offset. ements in their example). In a merge stage, each element in A is
Since all threads need to calculate their insertion index regard- assigned to a thread; this thread then performs a binary search into
less, this step adds no extra computation. An example of this is the B partition. Since each of these partitions are sorted, the sum
illustrated in Figure 5. A disadvantage from this modification is of its index with the binary search index gives the output index in
that we will have to reload a portion of values from B into shared C. This merge process is repeated until the entire dataset is sorted.
memory (bs ai values must be reloaded). Therefore we heuristi- This is only a brief summary of the sort implemented by Satish
cally choose a switching threshold k such that if bs ai k we et al. Though the merge sort they created was quite impressive and
perform this shift load. revolutionary, we must highlight some disadvantages to their ap-
Unfortunately, applying this same optimization to the register proach which our method attempts to address.
window (A) is not as easy. We would require all threads to vote in
shared memory whether they lie inside or outside of their current
shared memory window, then the sum of these votes would be the Block Sort Disadvantages.
offset for our shift load. However, this requires extra work, atomic In Satish et al.s implementation, each thread handles one ele-
operations, and the cost of loading a whole new set of register val- ment, performing an odd-even merge sort to calculate the correct
ues. Therefore, in general, this shift load optimization only makes index at any given stage. Though this results in a great deal of par-
sense for our shared memory moving window. allelism, unfortunately it also results in limited register work (all
As in our block sort, each thread will be responsible for multi- values are stored in shared memory). In this case, parallelism is
ple values. Every thread checks to see if its value is in the search being distributed at too fine-grained a level. Each thread is respon-
windows valid range, performs a binary search to find the correct sible for only one element, and each element must perform a binary
index for its first value, and then performs linear searches for the search.
consecutive elements in that thread. The block sort they implement is also unstable. A stable sort
requires all elements which have equal keys to preserve ordering
(i.e., if ai = aj and i < j, ai will appear before aj after the sort is
Merge SortMultiple. complete). Due to its predictability, stability is often a desired trait
Though we can now merge very large sequences within a single for algorithms that use sorts.
block, the final steps of our merge sort will only have a few inde-
pendent sequences to sort(and the last merge only two independent
blocks). Modern GPUs require dozens of blocks to maintain opti- Over-Segmentation.
mal throughput. Therefore we require a method that allows CUDA Each sorted sequence is divided into small chunks (256 elements
blocks to cooperate in merging two sequences. Therefore the final in their case) and each thread handles only one element; this again
stage in our hierarchical sort splits sequences in A into multiple ensures a high level of parallelism. However, this also leads to over-
partitions, that are then mapped to appropriate partitions in B. segmentation. The number of chunks being processed is a factor of
However, unlike previous GPU merge sorts, we do not bound the size of the input, not of the machine. We prefer instead to have
the number of elements in a partition. Given a set of sequences l p chunks, where p gives the machine enough parallelism to satu-
we define a number of needed blocks p to fill our GPU. We then rate the machine. Given the limited shared memory in a machine,
select the number of partitions per sequence s such that s l = p. achieving this requires extra bookkeeping and software challenges.
Therefore, we still do not suffer from over-segmentation, yet keep
the machine as occupied as possible. Davidson et al. used a sim-
ilar strategy for solving large tridiagonal systems, where multiple Limited Register Work.
CUDA blocks cooperate on a single large system [6]. Within the merge stage, each element locates its new index through
Otherwise, we utilize the same principles as our previous step. a binary search. This means there are n binary searches needed at
Each thread is responsible for multiple elements, a subset from a every merge stage. This requires n log2 t queries into shared mem-
partition in sequence A and B are loaded into register and shared ory at each stage. Since each thread is only handling one element,
memory windows respectively, and then we perform a binary search the amount of storage needed in registers (ones value and key) is
for the first element, followed by linear searches. minimal, and most of the needed information to progress resides in
Now we have a modular hierarchical merge sort that can handle shared or global memory. This leads to kernels that are bounded
some of them in Section 6. First, though, we will present the results
0 crazy 1000 axiom of our fixed-length sorts and variable-length sorts.
2 cat 2 cat
3 onomatopoeia 0 crazy 5. RESULTS
6 lady 6 lady We will now demonstrate our experimental performance on three
variations of our merge sort: 32-bit key-only sort, key-value sort,
and variable-length string sort. In our test cases for fixed-length
1000 axiom 3 onomatopoeia keys, our data consists of uniformly distributed integers. We test
our string sort on two types of datasets. First we have a dataset gath-
ered from Wikipedia including over a million words (12 MB) [25].
Figure 6: Basic setup of our string sort representation. We can Our second dataset is a list of sentences, gathered from around 20
store the first four characters within an unsigned integer as the Key books and novels from Project Gutenberg [10]. Sentences from
(boxed values), and use the starting character address as the Value novels have a number of different attributes which affect perfor-
(red values). Then we can perform our sort as described in the mance. First, each sentence is much longer than the Wikipedia
previous section by modifying the compare operators to handle ties word list. Second, authors often begin sentences in the same way
in our Keys. (e.g., Then, And, etc), which will lead to a larger number of at least
one tie-break. Finally, since authors often use repetition in their
literature (e.g., poetry) there will also be cases when two strings
heavily by shared memory usage. will have a long series of the same characters. Using both of these
sets we can try to quantify how much of an effect these characteris-
tics have on performance. We also chose both datasets due to their
Load Imbalance. accessibility, range in data, and authenticity (not synthetic). A his-
In a similar vein as our over-segmentation argument, if we limit togram showing the data distribution of both dataset can be seen in
the maximum size of a chunk to be some standard size t, if t is Figure 9a.
small there can be cases where the associative chunk it must merge We report the sorting rates for our merge sorts and string sorts
with is either much larger than t (requiring a re-partitioning to meet as a function of size. Since our strings are variable length, we also
the minimum chunk size criterion), or so small the merge stage is report the throughput in MB/s of sorted string data. We do not
trivial. This causes both load-imbalance and extra repartitioning include transfer time as sorting is often a single step in a much
work that we want to avoid. Satish et al. attempt to mitigate this more involved algorithm. Therefore we expect users to re-use the
by using splitters to help normalize chunk sizes. However, since sorted data on the GPU. Since our string sort is a specialization
we are able to handle larger sequences, we suffer much less from of our key-value sort, we report the ratio in performance between
general load-balancing problems. the two for both datasets (lower is better). This gives us an idea of
the extra cost of load-imbalance and divergence caused by global
memory tie-breaking for each dataset.
4. VARIABLE LENGTH KEY SORTS We compare our key-only and key-value sorts with the Thrust
Now that we have an operational stable key-value merge sort, library comparison sort [4]. Thrust bases its comparison sort on
we can start building a string sort. Our strings in this case are of a variant of Satish et al.s merge sort. In order to make the sort
arbitrary size, with a null-terminating character. We build our sort stable, the library uses a different block level sort, and is coded
by building off of our key-value pair sort. We begin by storing the for generality. Though this reduces the performance of Thrusts
start of each string (up to the first four characters) as the key in our merge sort, it is still widely used due to its accessibility and ease of
key-value sort, and the address of each strings origin as the value. use. It is therefore the de-facto standard for comparing GPU based
Now when performing our merge sort, if two of our shared memory sorts [5, 12, 16, 26].
keys are different, we can perform a compare and swap just as we We also compare our sort with a non-stable optimized key-value
would with our key-value sort. However, when a tie occurs in our version of Satish et al.s sort sort. Though this sort uses the same
keys, we must use the addresses (stored in values) to go out into merge techniques, it utilizes a different (but faster) block sort that
global memory and resolve the tie. This allows us to break ties on isnt stable [2].
the fly, rather than performing a key-value sort and then search for In comparison to the Thrust sort, we are about 3x faster than
and process consecutive sequences of ties as a post-processing step. their key-only implementation and about 2x faster than their key-
We must also be able to process ties when deciding how to move value implementation. Figure 7a shows that Thrusts merge sort
our shared memory and register windows. Though this now allows performance for key-only and key-value are nearly identical. We
us to sort variable-length strings, dealing with this type of data leads believe this is due to their key-only sort implementation being a
to three performance issues. First, if a thread runs across a tie that variant of key-value (with the values being a duplicate of the keys).
must be broken, all threads within this warp must wait for this tie We are also about 70% faster than the non-stable optimized key-
to resolve (divergence issues). Also, as we merge larger partitions, value sort.
the ratio of our memory windows to the size of a partition becomes The performance for our string sort can be seen in Figure 7b.
quite small. Therefore, the probability that an element will have Since our string sort is a variation on our key-value sort, we com-
to resolve a tie becomes higher as we get into later merge steps. pare the performance ratio between the two in terms of key-value
Finally, since the variance in our memory windows decreases as pairs vs. strings sorted. This gives us an idea of the performance
our partition sizes increase, long sets of equivalent keys (chains impact when dealing with global-memory ties (on average a 2.5x
of ties) become more probable, making a worst-case linear search performance impact for words, and 14-15x worse for sentences).
more likely. We also report the rate of our string sort performance in terms of
We will discuss these drawbacks, and possible ways to mitigate MB/s for both datasets. In Section 6 we will analyze the causes of
Performance of Merge Sort
300 20 20

Perf Ratio (Key-Va/String)


Sentences Perf
15 15

MStrings/sec
250 1.25 GB/s 1.211 GB/s
1.044 GB/s
10 0.98 GB/s 10

200 5 5
MKeys/sec

0 0
200 400 600 800 1000
150
Number of Elements (x1024)
100 20

Perf Ratio (Key-Va/String)


90
Words Perf
100 15

MStrings/sec
Our Key-Only Sort 80
Our Key-Value Sort 70
568.88 MB/s
10
50 Satish Key-Value Sort 488.8 MB/s
60 462.488 MB/s
Thrust Key-Only Sort 443.2 MB/s
5
Thrust Key-Value Sort 50

0 40 0
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 200 400 600 800 1000
Number of Elements (x1024) Number of Elements (x1024)

(a) Fixed-Key Sort Performance (b) String Sort Performance

Figure 7: Performance in MKeys/sec of our merge sort for string, key and key-value sorts on a GTX 580. We also compare the performance
ratio between our key-value sort and string sort. Though our string-sort has a slower stings/sec sorting rate for our sentences database,
since each string is much longer the overall MB/s sorting rate is higher.

this performance degradation, and discuss future optimizations. are sorting four million elements, we will have a blocksort stage
Though merge sort is an n log n method, we require a certain and twelve merge stages, each requiring at minimum of two global
number of elements to reach peak performance. We expect our reads (each element is placed once in registers to search, and once
sorting rate to degrade due to the asymptotic O( log1 n ) factor, which in a shared memory search space) and one write (totaling 38n). Un-
begins to take effect after about 4 million elements. The thrust sort der these conditions (four million elements) our theoretical sorting
pays a much higher initial overhead, leading to a flatter degradation rate cap is at 1.26 GKeys/s, which is about 5x faster than what we
over time. are able to achieve. Similarly, we can show that our cap for key-
We test our implementation on a GTX 580 which has up to value pairs is 941 Mpairs/s, which is also about 5x faster than our
1.58 TFlop/s and a theoretical bandwidth of 192.4 GB/s. Since achieved rate.
sorts are traditionally not computationally intensive, the theoretical We can attempt to show a tighter theoretical bound by includ-
bandwidth bounds our performance. Our next section will analyze ing the minimum shared memory communication required at every
our performance, give a rough estimate and compare our perfor- stage. Under our conditions:
mance to the theoretical bounds of a merge sort algorithm, and dis-
Each thread is responsible for k elements;
cuss factors limiting our performance. We will also discuss the
performance and limiting factors of our string sort implementation, Each thread performs a binary search into a shared memory
as well as future work that may improve performance. window of size p;
For k 1 stages we perform a linear search; and
6. PERFORMANCE ANALYSIS The sum of all search spaces loaded into shared memory is
In this section we will analyze our performance and determine at least n.
(1) how well our implementation compares to a theoretical bound
of our key and key-value sorts; (2) where our time is going; and (3) Therefore we can get a lower bound on the minimum shared mem-
where efforts for future improvements lie. ory communication needed by calculating the lower bound per thread.
Each thread requires log(p) + (k 1) shared memory reads to
search, and all threads combined will load the entire input set.
Theoretical Upper Bound. Since there are again log( np ) merge stages, the amount of shared
First we will derive a loose upper bound on the possible sorting elements loads necessary are at least n log( np )(1 + (log(p)+k1)
k
)).
rate for a key-only and key-value merge sort (blocks of size p) using Since the theoretical maximum bandwidth of shared memory is
our strategy. We do this to show that our method is reasonably about 1.2 TB/s, we can plug in the same p, n, and choosing k as
efficient when compared to this bound, and provide a comparison four we add an extra 0.885 ms to sort four million elements on
for future merge sort implementations. We will use as a limiting an NVIDIA GTX 580. This reduces the theoretical sorting rate
factor our global memory bandwidth B. If we assume that for our to 997.4 MKeys/s for key-only sort and 785.45 MPairs/s for key-
blocksort stage and each merge stage we must read elements in at value sort. Therefore our sort performance is about 4x and 4.2x
least twice (once for register window, and once for shared memory away from the theoretical bound for key-only and key-value pairs
window) and write out elements at least once, we have 2 (1 + respectively.
log( np )) global memory accesses, for a key-only sort. Though it is unlikely for an algorithm that requires both syn-
As an example, if we assume our blocksize p is 1024, and we chronization and moving memory windows to be close to the global
10000 1200 105
Distribution of Window Ties (first 4 characters)
Sentences Words Sentences Dataset
104

Number of Windows
Partition Size / Window Size
Number of Ties (x1000)

8000
1000
103
102
800
101
6000
100 0
600 10 101 102 103 104 105
105 Length of Concurrent Ties
4000 Wiki Words Dataset
104

Number of Windows
400
103
2000
200 102
101
0 0 100 0
0 2 4 6 8 10 10 101 102 103 104
Merge Step Length of Concurrent Ties

(a) Ties Per Merge Step (b) Post-Sort Window Ties

Figure 8: Figure 8a shows the total number of global memory accesses needed to resolve ties per merge stage for a million strings for both of
our datasets. Figure 8b measures the number of shared memory windows with duplicate keys, and the length of these windows after our sort
is complete. As the size of our partition grows (while our window size remains fixed), the variance between the largest and smallest key in our
window shrinks. This leads to ties becoming more probable, forcing global memory reads to resolve the ties, and degrading performance.
For our dataset involving book sentences, this variance is even smaller leading to more global memory lookups and a lower Strings/sec
sorting rate. We also test the number of key ties in a row once the sort is finished, and report the number of ties. Since our shared window
size is relatively small (we select as a heuristic 1024 elements), performing a binary or linear search within long blocks with the same key
will be relatively useless, and require a large number of global memory tie breaks.

0.05 Distribution of String Length 0.08 Distribution of String Ties (After Sort)
Sentences Dataset 0.07 Sentences Dataset
0.04 0.06
Fraction of Strings

Fraction of Strings

0.03 0.05
0.04
0.02 0.03
0.01 0.02
0.01
0.000 10 20 30 40 50 60 0.000 20 40 60 80 100
String Length (x4) Number of Character Ties
0.05 0.25
Wiki Words Dataset Wikipedia Words Dataset
0.04 0.20
Ratio of Strings

Ratio of Strings

0.03 0.15
0.02 0.10
0.01 0.05
0.000 5 10 15 20 25 30 0.000 20 40 60 80 100
String Length Number of Character Ties

(a) String Length Distribution (b) Sorted Ties Distribution

Figure 9: Statistics regarding our two string datasets. All strings are stored concurrently with null-termination signifying the end of a string
and the beginning of a new string. Our Words dataset has on average strings of length 8 characters long, while our Sentences dataset has
strings on average 98 characters long. As strings from our sentences are much longer on average, they will run into more lengthy tie-break
scenarios as we perform our merge-sort. Our sentences dataset, has many ties ranging from 1020 characters long, and quite a number that
are even greater than 100 (we clipped the range).
memory theoretical cap, this bound gives us an idea how much time ters between two concurrent strings after being sorted between our
is being spent on the actual sorting itself (instead of global memory two datasets. After every step of our merge sort, comparisons will
data movement). We now must analyze what else may be causing be between more similar strings(as illustrated in Figure 9b and Fig-
this performance difference. From this analysis, we may be able ure 8a), this gives us an idea of how many worst case comparisons
to learn where we should focus our efforts to optimize merge sort will be required.
in future work. Next we will discuss factors that effect our sorting For authors, it is very common to begin sentences in similar ways
performance. (e.g., Then, And, But, etc.), which results in many string ties of
about 1020 of the same characters in a row. In Figure 9b we even
6.1 Fixed-Length Keys Performance see a set of very similar strings greater than 100 characters long (we
The two factors that have the largest effect on our merge sort per- capped our histogram). Since all threads in a warp must wait for a
formance are divergence and shared memory bank conflicts. Though tie to resolve before continuing, such occurrences are very costly.
we can streamline our windowed loads to be free of divergence or We could expect a database of names and addresses to have
bank conflicts, it is difficult to do so for both the binary search stage somewhat similar behavior, where ties among sets of common names
and linear search stage. must be broken. On the other hand, our Wikipedia word list dataset
has much fewer ties and none that exceed 20 characters in a row. As
we can see from Figure 7b, our sentences dataset is over 5x slower
Bank Conflicts. (lower MStrings/sec) sorting rate than our words dataset. How-
Divergence occurs most frequently in our binary search stage. ever since each sentence is much longer (about 10x), we achieve a
To illustrate, consider a SIMD warp with each thread performing a higher GB/s sorting rate with sentences.
binary search in a shared memory window. Each thread will query
the middle value; this results in an efficient broadcast read. How-
ever, given two evenly distributed sequences, in the next query half Long Sets of Similar Strings.
of the threads will search to the left and the other half will search As sequences become very large in comparison to the memory
to the right. This will create a number of 2-way bank conflicts. windows we can handle, the distribution of the values (variance)
Similarly, subsequent searches will double the number of possible decreases. Since our shared memory and register sizes are limited,
shared memory bank conflicts. We attempted to resolve this prob- we cannot directly scale our windows to keep pace. Therefore,
lem by implementing an address translator that would avoid bank some threads in our linear search stage are more likely to run into
conflicts. However, this modification did not improve performance. long sets of ties before calculating their correct indexes, while oth-
ers resolve their location immediately. Figure 8a illustrates this
effect. As we begin to merge larger and larger blocks, the number
Divergence. of total ties within a merge step grows. Figure 8b shows the num-
Though our linear search stage does not suffer heavily from bank ber of keys that share the same value after our string is sorted. This
conflicts, it does suffer from divergence problems. When each effect is a data-dependent load imbalance. Though it was more ef-
thread performs its linear search, it begins independently in a shared ficient in our uniform-length key sort to perform linear searches
memory location. It will then incrementally check the subsequent in every merge stage (after an initial binary search) as described in
values until it finds the first value larger than its key. Since a SIMD Section 3 this change in distribution makes the worst-case for linear
warp must operate in lockstep, a thread cannot progress past this searches more probable. Therefore, we limit the number of linear
linear search stage until all threads are finished. Therefore a warp searches, and have threads perform more binary searches (worse
will have to wait for the thread which has the largest gap to search average case, but better worst case) when locating their insertion
through. indexes. When comparing our two datasets, the effect is much
more pronounced in our sentences database (again since authors
6.2 Variable-Length Keys Performance have common ways of beginning sentences).
Since our string sort is built on top of our fixed-length key merge We could also attempt to mitigate the amount of variance within
sort, it suffers from the same divergence and bank conflicts men- a window using the following strategy: Since each thread knows
tioned previously. However, it is also affected by other causes of the largest and smallest value in a memory window, a simple and
divergence as well as load imbalance. operation can determine the set of most significant bits shared by
all values within that window. Then a block can decide whether it is
worth it to shift those shared bits out and load new bits to increase
Divergence. the variance within that block. We think this can help reduce the
Since warps operate in lockstep, any threads that break ties will number of ties, and we plan to implement it in future work.
stall all other threads within a warp. Since these divergent forks
must be resolved in high-latency global memory, they are inher- 7. CONCLUSION
ently expensive for SIMD machines. This isnt an issue that can We have presented an efficient hierarchical approach to merge
easily be rectified either. However, since we keep our original sort. We harness more register communication, handle arbitrarily
string locations static and in-order, if there are multiple ties when large partitions, create just enough work to fill the machine and
comparing two strings, the majority of consecutive tie-breaks should limit unnecessary binary searches.
be cached reads. Although this doesnt directly address the diver- Our merge sort attains best-of-class performance through four
gence cost, it helps to mitigate the effect. main techniques: (1) in our initial step, sorting 8 elements within
When comparing our two data sets it becomes apparent that the each thread, which leverages register bandwidth; (2) a novel binary-
average number of global memory ties that must be resolved begins then-linear searching approach in our merge within a thread block;
to dominate performance. Not only are our sentences much longer (3) avoiding over-segmentation with a moving shared memory and
on average than our words, but require much more work to resolve register windows; and (4) a modular, three-step formulation of merge
ties. Figure 9b compares the number of shared concurrent charac- sort that is well-suited to the GPU computational and memory hi-
erarchy (and possibly suitable for tuning to other parallel architec- 2006 ACM SIGMOD International Conference on
tures). Management of Data, pages 325336, June 2006.
From this merge sort we are able to specialize a string sort, which [10] P. Gutenberg. Free ebooks by project guttenberg, 2010.
we believe is the first general string sort on GPUs. The performance http://www.gutenberg.org/.
of the string sort highlights the large cost of thread divergence when [11] C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and
comparisons between strings must break ties with non-coalesced, D. Manocha. Fast BVH construction on GPUs. Computer
long-latency global memory reads. We see this as the most critical Graphics Forum, 28(2):375384, 2009.
issue for optimizing string sort on future GPU-like architectures. [12] N. Leischner, V. Osipov, and P. Sanders. GPU sample sort. In
There are a number of possible directions we would like to take Proceedings of the 2010 IEEE International Symposium on
future work in both comparison sorts, and string sorts in general. Parallel & Distributed Processing, Apr. 2010.
We have focused on implementing an efficient merge sort. How- [13] D. Merrill and A. Grimshaw. Revisiting sorting for GPGPU
ever, we would like to explore comparison-based techniques for stream architectures. Technical Report CS2010-03,
handling very large sequences across multiple GPUs. For example, Department of Computer Science, University of Virginia,
hybrid techniques that combine merge sort with sample sort appear Feb. 2010.
promising for handling hundreds of gigabytes worth of key-value
[14] G. Morton. A Computer Oriented Geodetic Data Base and A
pairs.
New Technique In File Sequencing. International Business
We would also like to develop methods for avoiding thread di-
Machines Co., 1966.
vergence and global memory tie breaks in our current string sort,
and explore hybrid string sorting techniques that might combine [15] R. A. Patel, Y. Zhang, J. Mak, and J. D. Owens. Parallel
radix-sorts with comparison sorts (such as our merge sort). lossless data compression on the GPU. In Proceedings of
Innovative Parallel Computing (InPar 12), May 2012.
[16] H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger. Fast
Acknowledgments in-place, comparison-based sorting with CUDA: a study with
We appreciate the support of the National Science Foundation (grants bitonic sort. Concurrency and Computation: Practice and
OCI-1032859 and CCF-1017399). We would like to thank Anjul Experience, 23(7):681693, 2011.
Patney for some assistance in creating figures, and Ritesh Patel for [17] T. J. Purcell, C. Donner, M. Cammarano, H. W. Jensen, and
testing our string sort and reporting bugs. We would also like to P. Hanrahan. Photon mapping on programmable graphics
thank the reviewers for their valuable comments and feedback. hardware. In Graphics Hardware 2003, pages 4150, July
2003.
[18] N. Satish, M. Harris, and M. Garland. Designing efficient
8. REFERENCES sorting algorithms for manycore GPUs. In Proceedings of the
[1] D. Adjeroh, T. Bell, and A. Mukherjee. The 23rd IEEE International Parallel and Distributed Processing
Burrows-Wheeler Transform: Data Compression, Suffix Symposium, May 2009.
Arrays, and Pattern Matching. Springer Publishing [19] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan
Company, Incorporated, 1st edition, 2008. primitives for GPU computing. In Graphics Hardware 2007,
[2] K. E. Batcher. Sorting networks and their applications. In pages 97106, Aug. 2007.
Proceedings of the AFIPS Spring Joint Computing [20] R. Sinha, J. Zobel, and D. Ring. Cache-efficient string
Conference, volume 32, pages 307314, Apr. 1968. sorting using copying. Journal of Experimental
[3] N. Bell and M. Garland. Implementing sparse matrix-vector Algorithmics, 11, Feb. 2007.
multiplication on throughput-oriented processors. In SC 09: [21] E. Sintorn and U. Assarsson. Fast parallel GPU-sorting using
Proceedings of the 2009 ACM/IEEE Conference on a hybrid algorithm. Journal of Parallel and Distributed
Supercomputing, pages 18:118:11, Nov. 2009. Computing, 68(10):13811388, 2008.
[4] N. Bell and J. Hoberock. Thrust: A productivity-oriented [22] V. Volkov and J. Demmel. LU, QR, and Cholesky
library for CUDA. In W. W. Hwu, editor, GPU Computing factorizations using vector capabilities of GPUs. Technical
Gems, volume 2, chapter 4, pages 359372. Morgan Report UCB/EECS-2008-49, Electrical Engineering and
Kaufmann, Oct. 2011. Computer Sciences, University of California at Berkeley,
[5] D. Cederman and P. Tsigas. GPU-Quicksort: A practical 13 May 2008.
quicksort algorithm for graphics processors. Journal of [23] V. Volkov and J. W. Demmel. Benchmarking GPUs to tune
Experimental Algorithmics, 14:4:1.44:1.24, Jan. 2010. dense linear algebra. In Proceedings of the 2008 ACM/IEEE
[6] A. Davidson and J. D. Owens. Register packing for cyclic Conference on Supercomputing, pages 31:131:11, Nov.
reduction: A case study. In Proceedings of the Fourth 2008.
Workshop on General Purpose Processing on Graphics [24] V. Volkov and J. W. Demmel. Using GPUs to accelerate the
Processing Units, pages 4:14:6, Mar. 2011. bisection algorithm for finding eigenvalues of symmetric
[7] A. Davidson, Y. Zhang, and J. D. Owens. An auto-tuned tridiagonal matrices. LAPACK Working Note 197,
method for solving large tridiagonal systems on the GPU. In Department of Computer Science, University of Tennessee,
Proceedings of the 25th IEEE International Parallel and Knoxville, Jan. 2008.
Distributed Processing Symposium, pages 956965, May [25] Wikipedia. Wikimedia downloads, 2010.
2011. http://dumps.wikimedia.org.
[8] F. Dehne and H. Zaboli. Deterministic sample sort for GPUs. [26] X. Ye, D. Fan, W. Lin, N. Yuan, and P. Ienne. High
CoRR, 2010. http://arxiv.org/abs/1002.4464. performance comparison-based sorting algorithm on
[9] N. K. Govindaraju, J. Gray, R. Kumar, and D. Manocha. many-core GPUs. Parallel Distributed Processing (IPDPS),
GPUTeraSort: High performance graphics coprocessor 2010 IEEE International Symposium on, pages 1 10, Apr.
sorting for large database management. In Proceedings of the
2010.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy