Você está na página 1de 22

What is the WHT anyway, and

why are there so many ways to


compute it?

Jeremy Johnson

1, 2, 6, 24, 112, 568, 3032, 16768,…


Walsh-Hadamard Transform
• y = WHTN x, N = 2n
n
644447 44448 1 1
WHT N
= WHT2 ⊗ ... ⊗ WHT2 WHT2 = 1 − 1
WHT = WHT ⊗ WHT
4 2 2

1 1 1 1
 
1 1 1 1 1 − 1 1 − 1
=  ⊗  =
1 − 1 1 − 1 1 1 − 1 − 1
 
1 − 1 − 1 1
WHT Algorithms
• Factor WHTN into a product of sparse
structured matrices

• Compute: y = (M1 M2 … Mt)x


yt = Mtx

y2 = M2 y3
y = M1 y2
Factoring the WHT Matrix
• AC ⊗ ΒD = (Α ⊗ Β)(C ⊗ D)
• A ⊗ Β = (Α ⊗ Ι)(Ι ⊗ Β)
• A ⊗ (Β ⊗ C) = (A ⊗ Β) ⊗ C
• Im ⊗ Ιn = Ιmn

1 1 1 1  1 0 1 0  1 1 0 0
1 − 1 1 − 1 0 1 0 1  1 − 1 0 0 
WHT 4 = 1 1 − 1 − 1 = 1 0 − 1 0  0 0 1 1
    
1 − 1 − 1 1   0 1 0 − 1 0 0 1 − 1

WHT2 ⊗ WHT2 = (WHT2 ⊗ Ι2)(Ι2 ⊗ WHT2)


Recursive and Iterative
Factorization
WHT8 = (WHT2 ⊗ Ι4)(Ι2 ⊗ WHT4)

= (WHT2 ⊗ Ι4)(Ι2 ⊗ ((WHT2 ⊗ Ι2) (I2 ⊗ WHT2)))

= (WHT2 ⊗ Ι4)(Ι2 ⊗ (WHT2 ⊗ Ι2)) (I2 ⊗ (I2 ⊗ WHT2))

= (WHT2 ⊗ Ι4)(Ι2 ⊗ (WHT2 ⊗ Ι2)) ((I2 ⊗ I2) ⊗ WHT2)

= (WHT2 ⊗ Ι4)(Ι2 ⊗ WHT2 ⊗ Ι2) ((I2 ⊗ I2) ⊗ WHT2)

= (WHT2 ⊗ Ι4)(Ι2 ⊗ WHT2 ⊗ Ι2) (I4 ⊗ WHT2)


WHT8 =
(WHT2 ⊗ Ι4)(Ι2 ⊗ WHT2 ⊗ Ι2) (I4 ⊗ WHT2)
1 1 1 1 1 1 1 1
1 −1 1 −1 1 −1 1 − 1

1 1 −1 −1 1 1 − 1 − 1
 
1 −1 −1 1 1 −1 − 1 1
=
1 1 1 1 −1 −1 − 1 − 1
 
1 −1 1 −1 −1 1 − 1 1
1 1 −1 −1 −1 −1 1 1
 
1 −1 −1 1 −1 1 1 − 1

1 1  1 1  1 1 
 1 1  1 1  1 − 1 
   
 1 1  1 −1  1 1 
   
 1 1  1 − 1  1 − 1 
1 −1   1 1   1 1 
   
 1 −1  1  1 −1 
 1 −1  1 −1  1 1
   
 1 − 1  1 − 1  1 − 1
WHT Algorithms
• Recursive
WHT N = (WHT2 ⊗ IN / 2) (I2⊗ WHT N / 2)
• Iterative
WHT N = ∏ (I2i =1
n
i −1 ⊗ WHT ⊗ I n−i
2 2
)
• General
WHT2 = ∏ (I2
n
t

i =1
n1+L+ni −1 ⊗ WHT n
2i 2
)
⊗ I ni+1+L+nt ,

where n = n1 + L + nt
WHT Implementation
• Definition/formula
– N=N1* N2…Nt Ni=2ni
M
– x=WHTN*x xb,s=(x(b),x(b+s),…x(b+(M-1)s))
• Implementation(nested loop)
R=N; S=1;
for i=t,…,1 t
R=R/Ni
for j=0,…,R-1
WHT2n = ∏ (I
i= 1
⊗WHT2n i⊗ I 2 ni+1+ ··· + nt)
2 n1+ ··· + ni-1

for k=0,…,S-1
x NjNi i S + k , S = WHTN i ⋅ x NjNi i S + k , S
S=S* Ni;
Partition Trees
9
Left Recursive Right Recursive
3 4 2
4 4
1 2 1
3 1 1 3
1 1
2 1 1 2
Balanced
1 1 1 1
4
Iterative
4 2 2

1 1 1 1 1 1 1 1
Ordered Partitions
• There is a 1-1 mapping from ordered
partitions of n onto (n-1)-bit binary
numbers.
⇒There are 2n-1 ordered partitions of n.

162 = 1 0 1 0 0 0 1 0
1|1 1|1 1 1 1|1 1 → 1+2+4+2 = 9
Enumerating Partition Trees
00 01 01
3 3 3

2 1 2 1

1 1

10 10 11
3 3 3

1 2 1 2 1 1 1
1 1
Counting Partition Trees

1 +
 ∑ T n1
LT , n > 1
nt
Tn =  1 tn +L+ n = n
1, n = 1

T( z) = ∑ T z = z n
n 2 3 4
+ 2 z + 6 z + 24 z + L
n≥0

2
z T( z ) −1 + − 1 − 8z + 8 z
T( z) = +
(1 − z ) (1 − T( z ))
=
2(−2 + 2 z )
⇒ Tn = Θ(α / n ), α ≈ 6.8
n 3/ 2
WHT Package
Püschel & Johnson (ICASSP ’00)
• Allows easy implementation of any of the possible
WHT algorithms
• Partition tree representation
W(n)=small[n] | split[W(n1),…W(nt)]
• Tools
– Measure runtime of any algorithm
– Measure hardware events (coupled with PCL)
– Search for good implementation
• Dynamic programming
• Evolutionary algorithm
Histogram (n = 16, 10,000 samples)

•Wide range in performance despite equal number of arithmetic


operations (n2n flops)
•Pentium III consumes more run time (more pipeline stages)
•Ultra SPARC II spans a larger range
Operation Count
Theorem. Let WN be a WHT algorithm of
size N. Then the number of floating point
operations (flops) used by WN is Nlg(N).

Proof. By induction.
t
flops(W ) = ∑ 2n−n flops(W )
i
N i =1
N i

t t
= ∑ 2n−ni ni 2 i = 2n ∑ ni = n 2n
n
i =1 i =1
Instruction Count Model
3 3
IC(n) = α A(n) + ∑ β L (n) + ∑α l A (n)
i =1
i i
l =1
l

A(n) = number of calls to WHT procedure


α= number of instructions outside loops
Al(n) = Number of calls to base case of size l
α l = number of instructions in base case of size l

Li = number of iterations of outer (i=1), middle (i=2), and


outer (i=3) loop
βi = number of instructions in outer (i=1), middle (i=2), and
outer (i=3) loop body
Small[1]
.file "s_1.c"
.version "01.01"
gcc2_compiled.:
.text
.align 4
.globl apply_small1
.type apply_small1,@function
apply_small1:
movl 8(%esp),%edx //load stride S to EDX
movl 12(%esp),%eax //load x array's base address to EAX
fldl (%eax) // st(0)=R7=x[0]
fldl (%eax,%edx,8) //st(0)=R6=x[S]
fld %st(1) //st(0)=R5=x[0]
fadd %st(1),%st // R5=x[0]+x[S]
fxch %st(2) //st(0)=R5=x[0],s(2)=R7=x[0]+x[S]
fsubp %st,%st(1) //st(0)=R6=x[S]-x[0] ?????
fxch %st(1) //st(0)=R6=x[0]+x[S],st(1)=R7=x[S]-x[0]
fstpl (%eax) //store x[0]=x[0]+x[S]
fstpl (%eax,%edx,8) //store x[0]=x[0]-x[S]
ret
Recurrences
t
A(n) = 1 + ∑ 2 n −ni A(
n ), n = n + ... + n
i 1 t
i =1

A(n) = 0, n a leaf

A ( n ) =ν 2 , whereν l = number of leaves = l


n −l
l l
Recurrences
t

L1 ( n ) = t + ∑ 2 L1 ni
n −ni ( ),
i =1
n = n1 + ... + nt
t

L2 ( n ) = ∑ 2 n −ni n1+ + ni −1
n = n + ... + n
...

i =1
L2 ( ni) + 2 , 1 t

L3 ( n ) = ∑ 2 L2 ni 2
i =1
n −ni ( ) + n −ni ,
n = n + ... + n
1 t

L (n) = 0, n
i
a leaf
Histogram using Instruction Model (P3)

α l = 12, α l = 34, and α l = 106


α = 27
β1 = 18, β2 = 18, and β1 = 20
Algorithm Comparison
Recursive/Iterative Runtime Rec &Bal/It Instruction Count
2.00E+00
1.80E+00
1.60E+00 2.5
1.40E+00
1.20E+00
2
ratio

1.00E+00 r1/i1
1.5 rr1/i1
8.00E-01
6.00E-01 1 lr1/i1
4.00E-01 bal1/i1
0.5
2.00E-01
0.00E+00 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

11

13

15

17

19
WHT size(2^n)

Rec&It/Best Runtime Small/It Runtime


1.20E+01
1.20E+01
1.00E+01
1.00E+01
8.00E+00
r1/b I_1/rt
ratio

8.00E+00 6.00E+00
r3/b
r_1/rt
ratio

6.00E+00 i1/b 4.00E+00


i3/b
4.00E+00 b/b 2.00E+00
2.00E+00 0.00E+00
0.00E+00 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
WHT size(2^n)
WHT size(2^n)
Dynamic Programming
n
min Cost( ),
n1+… + nt= n
Tn1 … Tnt

where Tn is the optimial tree of size n.

This depends on the assumption that Cost only depends on


the size of a tree and not where it is located.
(true for IC, but false for runtime).

For IC, the optimal tree is iterative with appropriate leaves.


For runtime, DP is a good heuristic (used with binary trees).

Você também pode gostar