Escolar Documentos
Profissional Documentos
Cultura Documentos
AbstractA dedicated hardware architecture for the digital Mean Square Error (MMSE) IA architecture is required.
baseband processing of minimum mean square error interfer- Furthermore, the acceptable latency depends on the varying
ence alignment is presented. The computationally intensive task channel. In this paper, the acceptable latency requirement is
of calculating the precoding and decoding matrices has been
implemented and the underlying algorithm has been optimized assumed to be 1 ms for fast changing channels. The precod-
for real-time capability, efciency and exibility. The required ing matrices have to be adapted to the time-varying channel
number of iterations has been optimized and appropriate low- sufciently fast to achieve real-time operation. Especially,
latency algorithms for the computation of basic operations outdated precoding matrices result in insufciently aligned
have been identied to meet a real-time constraint of 1 ms interference at the receivers. Thus, the channel has to be
processing latency. The architecture has been veried and syn-
thesized for a Xilinx Virtex-6 LX550T FPGA. The maximum tracked and the precoding matrices have to be updated at a
number of antennas, users and data streams is congurable sufciently low latency. The energy budget for computing
at synthesis time. The actual parameters are congurable one set of precoding and decoding matrices is limited,
at runtime. Different degrees of parallelism allow a trade- especially in mobile high subcarrier count OFDM systems.
off between resource requirements, latency and throughput. From the hardware perspective, previous testbeds focused
The target FPGA resources are sufcient for real-time system
congurations up to 5 users with 3 antennas. on demonstrating the real-world feasibility of Interference
Index TermsInterference Alignment; Hardware Accelera- Alignment with questions in system level integration and
tor; Testbed; algorithmics [5] [6] [7] [8] [9]. These testbeds have provided
the proof-of-concept for IA by leveraging rapid-prototyping
I. INTRODUCTION approaches to hardware implementation. However, real-time
Channel capacity in multi-user wireless communication capability for fast changing channels and hardware efciency
systems is limited by interference. A well known approach are not within the scope of these papers. Processing latency
towards exploiting the available channel capacity is In- and hardware resource requirements have usually not been
terference Alignment (IA) [1]. Depending on the context published. For real-world usage, highly efcient hardware
and application, different approaches to IA are feasible. A architectures with sufcient latency, throughput and power
multitude of different algorithms for IA in a variety of consumption are required. Very few has been published
applications and constraints exist to leverage system channel about resource-efcient digital hardware implementations
capacity in multi-user (MU) multiple-input-multiple-output feasible for low-power and mobile devices. In this contri-
(MIMO) systems [2]. The focus of this paper is on minimum bution, we focus on efcient digital hardware architectures
mean square error interference alignment as presented in [3]. for the computation of the linear precoding and decoding
Here, linear precoding and decoding matrices are applied at matrices according to the Minimum Mean Square Error
the transmitter and receiver, respectively. From the hardware criterion. An estimate of the computational complexity of
point of view, current general purpose and Software-Dened MMSE IA has been previously published in [4].
Radio (SDR) systems are unable to compute these precoding The rest of the paper is organized as follows. The system
and decoding matrices in real-time. model and MMSE IA algorithm are recapitulated in Sec-
Due to its iterative nature and the required type of tions II and III. Algorithm optimizations aiming at hardware
operations, the computational complexity of this algorithm is implementation are presented in Section IV. The hardware
demanding, especially for mobile devices [4]. Thus, for high architecture and synthesis results are discussed in Sections
data rate systems, a high throughput real-time Minimum V and VI, respectively.
yes
7; 5;
(8) MSE
9 + 8
+ +
(9)
7; 5; no MSE < ?
yes
+
9 + 8
nished
+
+
7; 5;
Fig. 2. MMSE IA algorithm owchart
+
9 + 8
576
Now consider the inner loop. An explicit solution of only the delay of one real MAC operation and thus have a
V k (k )2F = 1 can not be given. Therefore, the equation small latency compared to matrix inversions.
has to be solved for k numerically. However, it can be Computing a decoding matrix U k in step (1) depends on
shown that a unique solution exists [3]. Several standard all matrices V k from the previous iteration. As there is no
root-nding approaches have been evaluated in this work data dependency between the computation of individual U k ,
including Newton iterations, secant method and Brents they can be computed in parallel for all users. Steps (2)
method. It must be noted that the standard algorithms do through (7) update the precoding matrices V k , including
not always nd the desired root for k 0, but instead a numerical root-nding in steps (4) through (7). These
somtimes get stuck at negative k . computations depend on all U k from step (1). It can be
The knowledge about the function can be used to guar- noted that computing V k can also be parallelized across all
antee convergence and improve convergence speed, i.e. the users.
number of iterations in the numerical root-nding loop. We
propose a modied secant method. Based on the curvature of V. HARDWARE ARCHITECTURE
the function, simple heuristic rules have been established for A xed-point dedicated hardware accelerator has been
a correction of the root estimated by the secant method. De- implemented for the use in an OCP- or AXI-based System-
pending on the system parameters, this reduces the number on-Chip. Its structure is shown in Fig. 4. All processing
of required iterations by a factor between 3 and 10 compared units and data ow is controlled by a top-level controller.
to the unmodied secant method. The second modication The outer loop from Fig. 2 and communication with external
is the choice of the initial value for in step (3). If the last memory is handled here. In the initialization phase, channel
accepted from the previous global iteration is reused as matrices H and initial values for V are loaded from external
the initial value, the average number of root-nding function memory into an on-chip BRAM cache. Then, the matrices
evaluations can be reduced to about 3.9. Fig. 3 shows the V and U are updated iteratively according to Eq. 2 and 3 by
number of iterations for the standard and optimized root- processing elements (PEs). Each PE handles the update of
nding algorithms. Representative system setups from the either V or U for one user at a time, and up to K PEs work
space of proper setups as discussed above have been chosen in parallel. The number of instantiated PEs is congurable
for simulation. The discrete setups are plotted on the x-axis. at synthesis time. The MSE unit computes the total mean
square error for the current set of matrices in parallel to the
80 next matrix update. When a predened threshold has been
Modied secant method reached, the outer iteration loop is stopped.
Secant method Fig. 5 shows the block structure of a PE. It can be
Newtons iterations congured to compute either an update of the encoding
60 Binary search
matrix V or the decoding matrix U for one user according
to steps (1) or (2) to (7), respectively. Computing U requires
a subset of the hardware resources needed to compute V .
Iterations
577
elimination loop, the elimination is achieved with multiply +98 98
and add operations. Only one nal division is required per 3(98
result variable, but all divisions can be computed in parallel.
In this work, a two-step Bareiss algorithm Systolic Array
# #
Processor has been implemented as a tradeoff between re-
source requirements, stability and latency. Two variables are
eliminated per step, requiring two clock cycles. To achieve
sufcient numerical stability, a row-wise renormalization #
step has been inserted after each elimination using shifts,
requiring one additional clock cycle. Table I summarizes %
the required number of clock cycles per operation. M is the
data word length in bits. The worst-case evaluated system
setup (Nt = Nr = 11, K = 19) requires a total of 26022
clock cycles including overhead, leading to a minimum clock
frequency of 26.02 MHz for 1 ms latency. The hardware
implementation has been veried against the MATLAB
oating-point reference model.
!!!!"&
TABLE I
C LOCK CYCLES PER OPERATION (PE MODULE )
VII. CONCLUSION
Fig. 4. MMSE IA hardware accelerator top level block diagram
The minimum mean square error interference alignment
algorithm has been optimized for low latency hardware
VI. SYNTHESIS RESULTS implementability. The overall required operation count has
The system has been synthesized for a Xilinx been reduced and the algorithm has been implemented on
XC6VLX550T-2 FPGA using Xilinx ISE 14.7. A 50 MHz an FPGA. It has been shown that the computation of the
clock constraint was met for all congurations. It has been precoding and decoding matrices is possible in hardware
chosen to enable the sequential computation of two sets under a real-time latency constraint of one millisecond. The
of matrices within 1 ms. Table II summarizes the resource synthesis results show high DSP resource requirements even
requirements for certain system congurations and degrees for small system congurations.
578
REFERENCES
[1] V. Cadambe and S. Jafar, Interference alignment and degrees of
freedom of the K-user interference channel, Information Theory,
IEEE Transactions on, vol. 54, no. 8, pp. 3425 3441, aug. 2008.
[2] D. Schmidt, C. Shi, R. Berry, M. Honig, and W. Utschick, Compar-
ison of distributed beamforming algorithms for MIMO interference
networks, Signal Processing, IEEE Transactions on, vol. 61, no. 13,
pp. 34763489, July 2013.
[3] , Minimum mean squared error interference alignment, in
Signals, Systems and Computers, 2009 Conference Record of the
Forty-Third Asilomar Conference on, nov. 2009, pp. 1106 1110.
[4] M. Kock, S. Hesselbarth, M. Ptzner, and H. Blume, Hardware-
accelerated design space exploration framework for communication
systems, Analog Integrated Circuits and Signal Processing,
vol. 78, no. 3, pp. 557571, 2014. [Online]. Available:
http://dx.doi.org/10.1007/s10470-013-0127-6
[5] J. A. Garca-Naya, L. Castedo, . Gonzlez, D. Ramrez, and
I. Santamara, Experimental evaluation of interference alignment
under imperfect channel state information, in 19th European Signal
Processing Conference (EUSIPCO 2011), Barcelona, Spain, August
2011.
[6] O. Gonzlez, D. Ramrez, I. Santamara, J. Garca-Naya, and
L. Castedo, Experimental validation of interference alignment tech-
niques using a multiuser MIMO testbed, in Smart Antennas (WSA),
2011 International ITG Workshop on, feb. 2011, pp. 1 8.
[7] P. Greisen, S. Haene, and A. Burg, Simulation and emulation
of MIMO wireless baseband transceivers, EURASIP Journal on
Wireless Communications and Networking, vol. 2010, no. 1, 2010.
[8] J. Massey, J. Starr, S. Lee, D. Lee, A. Gerstlauer, and R. Heath, Im-
plementation of a real-time wireless interference alignment network,
in Signals, Systems and Computers (ASILOMAR), 2012 Conference
Record of the Forty Sixth Asilomar Conference on, 2012, pp. 104108.
[9] P. Zetterberg and N. N. Moghadam, An experimental investigation
of SIMO, MIMO, interference-alignment (IA) and coordinated multi-
point (CoMP), CoRR, vol. abs/1111.3616, 2011.
[10] E. H. Bareiss, Sylvesters identity and multistep integer-preserving
gaussian elimination, Math. Comp., vol. 22, pp. 565578, 1968.
579