Você está na página 1de 1

3. PARTITIONING 3.

2 Out-of-Cache
Out-of-cache performance is throttled by increased cache
3.1 In-Cache conflicts [14] and cache pollution with output tuples [15].
We start by considering the versions that best operate TLB thrashing occurs when the number of partitions ex-
when the table fits in the cache. The non-in-place version ceeds the TLB capacity [11], unless the entire dataset can
(Algorithm 1) uses a separate array from the input to store be placed in equally few large OS pages to be TLB resident.
the output, while the in-place version (Algorithm 2) uses one
array for both input and output. Each partition is generated 3.2.1 Non-in-place
in a single segment. In-cache partitioning can be run in To mitigate these problems, prior work [14] proposed using
parallel, if the threads operate in a shared-nothing fashion. the cache as an intermediate bu↵er before writing back to
memory. Also, when write backs occur, they bypass the
Algorithm 1 Non-in-place in-cache partitioning higher cache levels entirely and avoid polluting the cache
i 0 // P : the number of partitions [15]. Recent work [2] uses the same basic technique for out-
for p 0 to P -1 do of-cache radix partitioning during hash join execution.
o↵set[p] i // point at the start of each partition Bu↵ering data for each partition reduces the working set
i i + histogram[p] size and eliminates the TLB problem when operating in the
end for bu↵er. TLB misses still occur, but 1/L of the time, if L is the
for iin 0 to |Tin |-1 do number of tuples bu↵ered for each partition before writing
t Tin [iin ] // Tin : the input table to output. If the bu↵er for each partition is exactly as big as
iout o↵set[f (t.key)] + + // f : the partition function a cache line, writing the full cache line to memory is accel-
Tout [iout ] t // Tout : the output table erated by write-combining and avoids polluting the higher
end for cache levels with output data. The partitioning fanout is
now bounded by the number of cache lines in the fast core-
The simplest non-in-place version does only two random private cache, rather than the TLB entries. Bu↵er flushing
accesses per item. When operating in the cache, we need the is optimally done using wider registers [15]. To maximize
output and the o↵set array to be cache-resident. A slightly the cache use, we use the last bu↵er slot to save the output
more complicated version of the algorithm allows the parti- o↵set and access one cache line per iteration (Algorithm 3).
tioning to happen in-place, by swapping items across loca- To extend the above method to multiple columns stored in
tions. In short, we start by reading an item, find the correct separate arrays, the standard case in RAM-resident database
partition and the output destination through the o↵set ar- data, we use one cache line per column in the bu↵er of each
ray, swap it with the item stored there, and continue for partition. A generic implementation can use one cache line
the new item until the cycle is closed. Each item is moved per column and flush it separately depending on the column
exactly once and we stop when the whole array is covered. width. We can also interleave the columns in a single tuple
Item swaps are performed in cycles of transfers, defined and de-interleave the columns when the bu↵er is flushed.
as swap cycles. When the items are processed low-to-high For example, when partitioning arrays of 32-bit keys and
[1], the cycle starts by doing a read and then swaps until 32-bit payloads, we store 64-bit tuples in the cached bu↵er.
it reaches the same location it initially read from, to write Tuple (de-)interleaving can be accelerated using SIMD.
back. This case occurs 1/P of time on average but requires Parallel execution of the non-in-place out-of-cache parti-
branching. In Algorithm 2 below, the partitions are written tioning is trivial. The input can be split to equal pieces, one
high-to-low and swap cycles close when all items of a parti- for each thread. By executing a prefix sum of all individ-
tion have been placed, avoiding branching for every tuple. ual histograms, one can ensure that each partition output is
written in a distinct location. Threads are only synchronized
Algorithm 2 In-place in-cache partitioning
after individual histograms are built. This is the only known
i 0 // P : the number of partitions technique for parallel partitioning on shared segments.
for p 0 to P -1 do
i i + histogram[p] Algorithm 3 Non-in-place out-of-cache partitioning
o↵set[p] i // point at the end of each partition iout 0 // P : the number of partitions
end for for p 0 to P -1 do
p iend 0 bu↵er[p][L-1] iout // L: # of tuples per cache line
while histogram[p] = 0 do iout iout + histogram[p]
p++ // skip initial empty partitions end for
end while for iin 0 to |Tin |-1 do
repeat t Tin [iin ] // Tin /Tout : the input/output table
t T [iend ] // T : the input & output table p f (t.key) // f : the partition function
repeat iout bu↵er[p][L-1] + +
p f (t.key) // f : the partition function bu↵er[p][iout mod L] t
i o↵set[p] if iout mod L = L-1 then
T [i] $ t // swap for ibuf 0 to L-1 do
until i = iend Tout [iout + ibuf L] bu↵er[p][ibuf ] // no cache
repeat end for
iend iend + histogram[p + +] // skip if empty bu↵er[p][L-1] iout + 1
until p = P or iend 6= o↵set[p] end if
until p = P end for

Você também pode gostar