Escolar Documentos
Profissional Documentos
Cultura Documentos
T e l e c o m m u n i c a t i o n
ITU-T
U n i o n
G.723.1
TELECOMMUNICATION
STANDARDIZATION SECTOR
OF ITU
(05/2006)
G.100G.199
G.200G.299
G.300G.399
G.400G.449
G.450G.499
G.600G.699
G.700G.799
G.700G.709
G.710G.719
G.720G.729
G.730G.739
G.740G.749
G.750G.759
G.760G.769
G.770G.779
G.780G.789
G.790G.799
G.800G.899
G.900G.999
G.1000G.1999
G.6000G.6999
G.7000G.7999
G.8000G.8999
G.9000G.9999
Summary
This Recommendation specifies a coded representation that can be used for compressing the speech
or other audio signal component of multimedia services at a very low bit rate as part of the overall
H.324 family of standards. This coder has two bit rates associated with it: 5.3 and 6.3 kbit/s. The
higher bit rate has greater quality. The lower bit rate gives good quality and provides system
designers with additional flexibility. Both rates are a mandatory part of the encoder and decoder. It is
possible to switch between the two rates at any frame boundary. An option for variable rate
operation using discontinuous transmission and noise fill during non-speech intervals is also
possible.
This coder was optimized to represent speech with a high quality at the above rates using a limited
amount of complexity. It encodes speech or other audio signals in frames using linear predictive
analysis-by-synthesis coding. The excitation signal for the high rate coder is Multipulse Maximum
Likelihood Quantization (MP-MLQ) and for the low rate coder is Algebraic-Code-Excited
Linear-Prediction (ACELP). The frame size is 30 ms and there is an additional look ahead of 7.5 ms,
resulting in a total algorithmic delay of 37.5 ms. All additional delays in this coder are due to
processing delays of the implementation, transmission delays in the communication link and
buffering delays of the multiplexing protocol.
The description of this Recommendation is made in terms of bit-exact, fixed-point mathematical
operations. The ANSI C code indicated in clause 5 constitutes an integral part of this
Recommendation and shall take precedence over the mathematical descriptions in this text if
discrepancies are found. A non-exhaustive set of test sequences which can be used in conjunction
with the C code are available in the electronic attachment.
This revision integrates the correction of defects identified either in Implementors' Guides or in
SG 16 meeting reports.
Source
ITU-T Recommendation G.723.1 was approved on 29 May 2006 by ITU-T Study Group 16
(2005-2008) under the ITU-T Recommendation A.8 procedure.
FOREWORD
The International Telecommunication Union (ITU) is the United Nations specialized agency in the field of
telecommunications. The ITU Telecommunication Standardization Sector (ITU-T) is a permanent organ of
ITU. ITU-T is responsible for studying technical, operating and tariff questions and issuing
Recommendations on them with a view to standardizing telecommunications on a worldwide basis.
The World Telecommunication Standardization Assembly (WTSA), which meets every four years,
establishes the topics for study by the ITU-T study groups which, in turn, produce Recommendations on
these topics.
The approval of ITU-T Recommendations is covered by the procedure laid down in WTSA Resolution 1.
In some areas of information technology which fall within ITU-T's purview, the necessary standards are
prepared on a collaborative basis with ISO and IEC.
NOTE
In this Recommendation, the expression "Administration" is used for conciseness to indicate both a
telecommunication administration and a recognized operating agency.
Compliance with this Recommendation is voluntary. However, the Recommendation may contain certain
mandatory provisions (to ensure e.g. interoperability or applicability) and compliance with the
Recommendation is achieved when all of these mandatory provisions are met. The words "shall" or some
other obligatory language such as "must" and the negative equivalents are used to express requirements. The
use of such words does not suggest that compliance with the Recommendation is required of any party.
ITU 2007
All rights reserved. No part of this publication may be reproduced, by any means whatsoever, without the
prior written permission of ITU.
ii
CONTENTS
Page
1
Introduction ..................................................................................................................
1.1
Scope ..............................................................................................................
1.2
Bit rates...........................................................................................................
1.3
Possible input signals .....................................................................................
1.4
Delay...............................................................................................................
1.5
Speech coder description................................................................................
1
1
1
1
1
1
2
2
3
4
4
4
6
6
7
7
7
7
8
8
9
9
10
13
13
14
14
15
16
16
16
16
16
17
17
18
19
19
iii
Page
19
20
21
ANSI C code.................................................................................................................
22
Glossary ........................................................................................................................
23
26
26
27
29
31
36
39
39
40
41
41
41
41
42
42
43
52
56
3.10
3.11
Electronic attachment.
iv
Introduction
1.1
Scope
This Recommendation specifies a coded representation that can be used for compressing the speech
or other audio signal component of multimedia services at a very low bit rate. In the design of this
coder, the principal application considered was very low bit rate visual telephony as part of the
overall H.324 family of standards.
1.2
Bit rates
This coder has two bit rates associated with it. These are 5.3 and 6.3 kbit/s. The higher bit rate has
greater quality. The lower bit rate gives good quality and provides system designers with additional
flexibility. Both rates are a mandatory part of the encoder and decoder. It is possible to switch
between the two rates at any 30-ms frame boundary. An option for variable rate operation using
discontinuous transmission and noise fill during non-speech intervals is also possible.
1.3
This coder was optimized to represent speech with a high quality at the above rates using a limited
amount of complexity. Music and other audio signals are not represented as faithfully as speech, but
can be compressed and decompressed using this coder.
1.4
Delay
This coder encodes speech or other audio signals in 30-ms frames. In addition, there is a look ahead
of 7.5 ms, resulting in a total algorithmic delay of 37.5 ms. All additional delays in the
implementation and operation of this coder are due to:
i)
actual time spent processing the data in the encoder and decoder;
ii)
transmission time on the communication link;
iii)
additional buffering delay for the multiplexing protocol.
1.5
The description of the speech coding algorithm of this Recommendation is made in terms of
bit-exact, fixed-point mathematical operations. The ANSI C code indicated in clause 5, which
constitutes an integral part of this Recommendation, reflects this bit-exact, fixed-point description
approach. The mathematical descriptions of the encoder and decoder, given respectively in clauses
2 and 3, can be implemented in several other fashions, possibly leading to a codec implementation
not complying with this Recommendation. Therefore, the algorithm description of the C code of
clause 5 shall take precedence over the mathematical descriptions of clauses 2 and 3 whenever
discrepancies are found. A non-exhaustive set of test sequences which can be used in conjunction
with the C code are available from the ITU.
Encoder principles
2.1
General description
This coder is designed to operate with a digital signal obtained by first performing telephone
bandwidth filtering (ITU-T Rec. G.712) of the analogue input, then sampling at 8000 Hz and then
converting to 16-bit linear PCM for the input to the encoder. The output of the decoder should be
converted back to analogue by similar means. Other input/output characteristics, such as those
specified by ITU-T Rec. G.711 for 64 kbit/s PCM data, should be converted to 16-bit linear PCM
before encoding or from 16-bit linear PCM to the appropriate format after decoding. The bitstream
from the encoder to the decoder is defined within this Recommendation.
The coder is based on the principles of linear prediction analysis-by-synthesis coding and attempts
to minimize a perceptually weighted error signal. The encoder operates on blocks (frames) of
240 samples each. That is equal to 30 ms at an 8-kHz sampling rate. Each block is first high pass
filtered to remove the DC component and then divided into four subframes of 60 samples each. For
every subframe, a 10th order Linear Prediction Coder (LPC) filter is computed using the
unprocessed input signal. The LPC filter for the last subframe is quantized using a Predictive Split
Vector Quantizer (PSVQ). The unquantized LPC coefficients are used to construct the short-term
perceptual weighting filter, which is used to filter the entire frame and to obtain the perceptually
weighted speech signal.
For every two subframes (120 samples), the open-loop pitch period, LOL, is computed using the
weighted speech signal. This pitch estimation is performed on blocks of 120 samples. The pitch
period is searched in the range from 18 to 142 samples.
From this point the speech is processed on a 60 samples per subframe basis.
Using the estimated pitch period computed previously, a harmonic noise shaping filter is
constructed. The combination of the LPC synthesis filter, the formant perceptual weighting filter,
and the harmonic noise shaping filter is used to create an impulse response. The impulse response is
then used for further computations.
Using the pitch period estimation, LOL, and the impulse response, a closed-loop pitch predictor is
computed. A fifth order pitch predictor is used. The pitch period is computed as a small differential
value around the open-loop pitch estimate. The contribution of the pitch predictor is then subtracted
from the initial target vector. Both the pitch period and the differential value are transmitted to the
decoder.
Finally the non-periodic component of the excitation is approximated. For the high bit rate,
Multipulse Maximum Likelihood Quantization (MP-MLQ) excitation is used, and for the low bit
rate, an algebraic-code-excitation (ACELP) is used.
The block diagram of the encoder is shown in Figure 1.
Framer
File : LBCCODEC.C
Procedure : main()
File : CODER.C
Procedure : Coder()
The coder processes the speech by buffering consecutive speech samples, y[n], into frames of
240 samples, s[n]. Each frame is divided into two parts of 120 samples for pitch estimation
computation. Each part is divided by two again, so that each frame is finally divided into four
subframes of 60 samples each.
2.3
File : UTIL_LBC.C
Procedure : Rem_Dc()
This block removes the DC element from the input speech, s[n]. The filter transfer function is:
H (z ) =
1 z 1
127
1 128 z 1
(1)
LPC analysis
File : LPC.C
Procedure : Comp_Lpc()
File : LPC.C
Procedure : Durbin()
Levinson-Durbin recursion
The LPC analysis is performed on signal x[n] in the following way. Tenth order Linear Predictive
(LP) analysis is used. For each subframe, a window of 180 samples is centred on the subframe.
A Hamming window is applied to these samples. Eleven autocorrelation coefficients are computed
from the windowed signal. A white noise correction factor of (1025/1024) is applied by using the
formula R[0] = R[0](1 + 1/1024). The other 10 autocorrelation coefficients are multiplied by the
binomial window coefficients table. (The values for this table and all others are given in the
C code.) The Linear Predictive Coefficients (LPC) are computed using the conventional
Levinson-Durbin recursion. For every input frame, four LPC sets are computed, one for every
subframe. These LPC sets are used to construct the short-term perceptual weighting filter. The
LPC synthesis filter is defined as:
Ai ( z ) =
1
10
1 aij z
0 i 3
(2)
j =1
LSP quantizer
File : LSP.C
Procedure : AtoLsp()
File : LSP.C
Procedure : LspQnt()
File : LSP.C
Procedure : Lsp_Svq()
First, a small additional bandwidth expansion (7.5 Hz) is performed. Then, the resulting A3(z)
LP filter is quantized using a predictive split vector quantizer. The quantization is performed in the
following way:
1)
The LP coefficients, {aj}j = 1..10, are converted to LSP coefficients, { P'j }j = 1..10, by searching
along the unit circle and interpolating for zero crossings.
2)
The long-term DC component, pDC, is removed from the LSP coefficients, p, and a new
DC removed LSP vector, p, is obtained.
3)
4)
A first-order fixed predictor, b = (12/32), is applied to the previously decoded LSP vector
~
pn 1 , to obtain the DC removed predicted LSP vector, pn , and the residual LSP error
vector, en, at time (frame) n.
p nT = [ p1,n p 2 ,n L p10 ,n ]
(3.1)
p nT = [ p1,n p 2 ,n L p10 ,n ]
(3.2)
pn = b[ ~
pn 1 pDC ]
(3.3)
en = pn pn
(3.4)
3, m = 0
K m = 3, m = 1
4, m = 2
~
pl ,Tm = ~
p1, l , m ~
p2, l , m L ~
pK m ,l ,m
0m2
1 l 256
p = p + p DC
(4.2)
(4.3)
0m2
~
pl ,m = pm + pDCm + el ,m
T
pl,m ) W m ( pm ~
pl , m
El ,m = ( p m ~
(4.1)
1 l 256
0m2
1 l 256
(4.4)
(4.5)
where:
el, m is the lth entry of the mth split residual LSP codebook, and
Wn is a diagonal weighting matrix, determined from the unquantized LSP
coefficients vector p, with weights defined by:
wj , j =
w1,1
min {p j p j 1 , p j + 1 p j }
1
=
p 2 p1
w10 ,10 =
5)
2 j9
(5)
1
p9
p10
2.6
LSP decoder
File : LSP.C
Procedure: Lsp_Inq()
First, the three sub-vectors, {em,n}m = 0..2, are decoded to form a tenth order vector, e~n .
The predicted vector, pn , is added to the decoded vector, e~n , and DC vector, pDC, to form
the decoded LSP vector, ~p .
n
3)
A stability check is performed on the decoded LSP vector, ~pn , to ensure that the decoded
LSP vector is ordered according to the following condition:
~
p j + 1,n ~
p j,n min
1 j 9
(6)
min is equal to 31.25 Hz. If this stability check (6) fails for ~
pi and ~
pi + 1 , then ~
p j and
~
p
are modified in the following way:
j +1
~p = ( ~p + ~p ) / 2
avg
j
j +1
(7.1)
~
pj = ~
p avg min / 2
(7.2)
~
p j +1 = ~
p avg + min / 2
(7.3)
The modification is performed until condition (6) is met. If after 10 iterations the
condition of stability is not met, the previous LSP vector is used.
2.7
LSP interpolation
File : LSP.C
Procedure : Lsp_Int()
LSP interpolator
File : LSP.C
Procedure : LsptoA()
Linear interpolation is performed between the decoded LSP vector, ~pn , and the previous
LSP vector, ~
pn 1 , for each subframe. Four interpolated LSP vectors, {~
pi }i = 0..3 , are converted to
.
LPC vectors, {a~ }
i i = 0..3
p n 1 + 0.25 ~
pn
0.75 ~
0.5 ~
~
p n 1 + 0. 5 p n
~
p ni =
~
~
0.25 p n 1 + 0.75 p n
~
pn
i=0
T
a~i = [a~i1a~i 2 L a~i10 ] T
0i3
i =1
i=2
(8)
i= 3
(9)
~
The quantized LPC synthesis filter, Ai ( z ) , is used for generating the decoded speech signal and is
defined as:
~
Ai ( z ) =
1
10
1 a~ij z j
j =1
0 i 3
(10)
2.8
File : LPC.C
Procedure : Wght_Lpc()
File : LPC.C
Procedure : Error_Wght()
For each subframe a formant perceptual weighting filter is constructed, using the unquantized LPC
coefficients{aij}j = 1,..,10. The filter has a transfer function:
10
1
Wi ( z ) =
aij z
j =1
10
1 a ij z
j 1j
0i3
j 2j
(11)
j =1
where 1 = 0.9 and 2 = 0.5. The input speech frame, {x[n]}n = 0..239, is then divided to four
subframes and each subframe is filtered using the Wi(z) filter, and the weighted output speech
signal, {f [n]}n = 0..239 is obtained.
2.9
Pitch estimation
File : EXC_LBC.C
Procedure : Estim_Pitch()
Two pitch estimates are computed for every frame, one for the first two subframes and one for the
last two. The open-loop pitch period estimate, LOL, is computed using the perceptually weighted
speech f [n]. A cross-correlation criterion, COL( j), maximization method is used to determine the
pitch period, using the following expression:
2
119
f [n] f [n j ]
n=0
C OL ( j ) = 119
f [n j ] f [n j ]
18 j 142
(12)
n=0
The index j which maximizes the cross-correlation, COL ( j), is selected as the open-loop pitch
estimation for the appropriate two subframes. While searching for the best index, some preference
is given to smaller pitch periods to avoid choosing pitch multiples. Maximums of COL ( j) are
searched for beginning with j = 18. For every maximum COL ( j) found, its value is compared to the
best previous maximum found, COL ( j). If the difference between indices j and j is less than 18 and
COL ( j) > COL ( j), the new maximum is selected. If the difference between the indices is greater
than or equal to 18, the new maximum is selected only if COL (j ) is greater than COL ( j) by 1.25 dB.
2.10
Subframe processing
From this point on, all the computational blocks are performed on a once per subframe basis.
2.11
File : EXC_LBC.C
Procedure : Comp_Pw()
File : EXC_LBC.C
Procedure : Filt_Pw()
In order to improve the quality of the encoded speech, a harmonic noise shaping filter is
constructed. The filter is:
Pi (z ) = 1 z L
(13)
The optimal lag, L, for this filter is the lag which maximizes the criterion, CPW ( j), while considering
only positive correlation values for the numerator, N( j), before squaring:
N ( j) =
C PW ( j ) =
59
f [n] f [n j ]
(14.1)
n=0
(N ( j ))2
L1 j L2
59
f [n j ] f [n j ]
(14.2)
n=0
where L1 = LOL 3 and L2 = LOL + 3. The maximum value will be defined as CL. The optimal filter
gain, Gopt, is:
59
f [n] f [n L]
Gopt =
n=0
59
f [n L] f [n L]
(15)
n=0
Gopt is limited to the range [0,1]. The energy, E, of the weighted speech signal, {f [n]}n = 0..59 is given
by:
E=
59
f [n]
2
(16)
n=0
Then the coefficient of the harmonic noise shaping filter, P(z), is given by:
0.3125G
opt
=
0.0
C
if 10 log10 1 EL 2.0
otherwise
(17)
After computing the harmonic noise filter coefficients, the formant perceptually weighted speech,
f [n], is filtered using P(z) to obtain the target vector, w[n].
w[n] = f [n] f [n L]
2.12
0 n 59
(18)
File : LPC.C
Procedure : Comp_Ir()
(19)
where the components of Si(z) are defined in Formulas 10, 11 and 13. The impulse response of this
filter is computed and will be referred as {hi[n]}n = 0..59, i = 0..3.
2.13
File : LPC.C
Procedure : Sub_Ring ()
The zero input response of the combined filter, Si(z), is obtained by computing the output of that
filter when the input signal is all zero-valued samples. The zero input response is denoted
{z[n]}n = 0..59. The ringing subtraction is performed by subtracting the zero input response from the
harmonic weighted speech vector, {w[n]}n = 0..59. The resulting vector is defined as
t[n] = w[n] z[n].
8
2.14
Pitch predictor
File : EXC_LBC.C
Procedure : Find_Acbk()
File : EXC_LBC.C
Procedure : Get_Rez()
File : EXC_LBC.C
Procedure : Decod_Acbk()
The pitch prediction contribution is treated as a conventional adaptive codebook contribution. The
pitch predictor is a fifth order pitch predictor (see Equation 41.2). For subframes 0 and 2, the
closed-loop pitch lag is selected from around the appropriate open-loop pitch lag in the range 1
and coded using 7 bits. (Note that the open-loop pitch lag is never transmitted.) For subframes 1 and
3 the closed-loop pitch lag is coded differentially using 2 bits and may differ from the previous
subframe lag only by 1, 0, +1 or +2. The quantized and decoded pitch lag values will be referred to
as Li from this point on. The pitch predictor gains are vector quantized using two codebooks with 85
or 170 entries for the high bit rate and 170 entries for the low bit rate. The 170 entry codebook is the
same for both rates. For the high rate if L0 is less than 58 for subframes 0 and 1 or if L2 is less than
58 for subframes 2 and 3, then the 85 entry codebook is used for the pitch gain quantization.
Otherwise, the pitch gain is quantized using the 170 entry codebook. The contribution of the pitch
predictor, {p[n]}n = 0..59, is subtracted from the target vector {t[n]}n = 0..59, to obtain the residual
signal {r[n]}n = 0..59.
r [n] = t [n] p[n]
2.15
(20)
File : EXC_LBC.C
Procedure : Find_Fcbk()
File : EXC_LBC.C
Procedure : Find_Best()
File : EXC_LBC.C
Procedure : Gen_Trn()
File : EXC_LBC.C
Procedure : Fcbk_Pack()
The residual signal {r[n]}n = 0..59, is transferred as a new target vector to the MP-MLQ block. This
block performs the quantization of this vector. The quantization process is approximating the target
vector r[n] by r[n]:
r [n ] =
h[ j ] v[n j ]
0 n 59
(21)
j=0
where v[n] is the excitation to the combined filter S(z) with impulse response h[n] and defined as:
M 1
v[n ] = G k d [n mk ]
0 n 59
(22)
k =0
where G is the gain factor, [n] is a Dirac function, {k} k = 0..M 1 and {mk}k = 0..M 1 are the
signs (1) and the positions of Dirac functions respectively and M is the number of pulses, which is
6 for even subframes and is 5 for odd subframes. There is a restriction on pulses positions. The
positions can be either all odd or all even. This will be indicated by a grid bit. So, the problem is to
estimate the unknown parameters, G, {k}k = 0..M 1, and {mk}k = 0..M 1, that minimize the mean
square of the error signal err[n]:
M 1
(23)
k =0
d [ j ] = r [n] h[n j ]
0 j 59
(24)
n= j
max { d [ j ] }j = 0..59
(25)
59
h[n] h[n]
n=0
Then the estimated gain Gmax is quantized by a logarithmic quantizer. This scalar gain quantizer is
~
common to both rates and consists of 24 steps, of 3.2 dB each. Around this quantized value, G max ,
~
~
additional gain values are selected within the range Gmax 3.2, Gmax + 6.4 . For each of these
gain values, the signs and locations of the pulses are sequentially optimized. This procedure is
repeated for both the odd and even grids. Finally, the combination of the quantized parameters that
yields the minimum mean square of err[n] is selected. The optimal combination of pulse locations
and gain is transmitted. 30
combinatorial coding is used to transmit the pulse locations.
M
Furthermore, using the fact that the number of codewords in the fixed codebooks is not a power of
2, three additional bits are saved by combining the four most significant bits of the combinatorial
codes for the four subframes to form a 13-bit index. The C code provides the details on how this
information is packed.
( )
To improve the quality of speech with a short pitch period, the following additional procedure is
used. If L0 is less than 58 for subframes 0 and 1 or if L2 is less than 58 for subframes 2 and 3, a train
of Dirac functions with the period of the pitch index, L0 or L2, is used for each location mk instead of
a single Dirac function in the above quantization procedure. The choice between a train of Dirac
functions or a single Dirac function to represent the residual signal is made based on the mean
square error computation. The configuration which yields the lowest mean square error is selected
and its parameter indices are transmitted.
2.16
File : EXC_LBC.C
Procedure : search_T0()
File : EXC_LBC.C
Procedure :
ACELP_LBC_code()
File : EXC_LBC.C
Procedure : Cor_h()
File : EXC_LBC.C
Procedure : Cor_h_X()
File : EXC_LBC.C
Procedure : D4i64_LBC()
File : EXC_LBC.C
Procedure : G_code()
10
A 17-bit algebraic codebook is used for the fixed codebook excitation v[n]. Each fixed codevector
contains, at most, four non-zero pulses. The four pulses can assume the signs and positions given in
Table 1:
Positions
The positions of all pulses can be simultaneously shifted by one (to occupy odd positions) which
needs one extra bit. Note that the last position of each of the last two pulses falls outside the
subframe boundary, which signifies that the pulse is not present.
Each pulse position is encoded with 3 bits and each pulse sign is encoded in 1 bit. This gives a total
of 16 bits for the 4 pulses. Further, an extra bit is used to encode the shift resulting in a 17-bit
codebook.
The codebook is searched by minimizing the mean square error between the weighted speech
signal, r[n], and the weighted synthesis speech given by:
E = r GHv
(26)
where r is the target vector consisting of the weighted speech after subtracting the zero-input
response of the weighted synthesis filter and the pitch contribution, G is the codebook gain; v is the
algebraic codeword at index ; and H is a lower triangular Toeplitz convolution matrix with
diagonal h(0) and lower diagonals h(1),..., h(L 1), with h(n) being the impulse response of the
weighted synthesis filter Si(z).
It can be shown that the optimum codeword is the one which maximizes the term:
=
(d
v )
(27)
v v
T
where d = HT r is the correlation between the target vector signal, r[n], and the impulse response,
h(n), and = HT H is the covariance matrix of the impulse response. The vector d and the matrix
are computed prior to the codebook search. The elements of the vector d are computed by:
d ( j) =
59
r[n] h[n j ]
0 j 59
(28)
n= j
and the elements of the symmetric matrix (i, j) are computed by:
(i , j ) =
59
h[n i ] h[n j ]
n= j
ji
0 i 59
(29)
NOTE Only the elements actually needed are computed and an efficient storage has been designed that
speeds the search procedure up.
The algebraic structure of the codebook allows for very fast search procedures since the excitation
vector v contains only 4 non-zero pulses. The search is performed in 4 nested loops, corresponding
to each pulse positions, where in each loop the contribution of a new pulse is added. The correlation
in Equation 27 is given by:
ITU-T Rec. G.723.1 (05/2006)
11
(30)
where mk is the position of the kth pulse and k is its sign (1). The energy for even pulse position
codevectors in Equation 27 is given by:
= (m0 , m0 )
+ (m1 , m1 ) + 2 0 1(m0 , m1 )
+ (m2 , m2 ) + 2[ 0 2 (m0 , m2 ) + 1 2 (m1 , m2 )]
+ (m3 , m3 ) + 2[ 0 3 (m0 , m3 ) + 1 3 (m1 , m3 ) + 2 3 (m2 , m3 )]
(31)
For odd pulse position codevectors, the energy in Equation 27 is approximated by the energy of the
equivalent even pulse position codevector obtained by shifting the odd position pulses to one
sample earlier in time. To simplify the search procedure, the functions d[ j] and (m1, m2) are
modified. The simplification is performed as follows (prior to the codebook search). First, the signal
s[ j] is defined and then the signal d[ j] is constructed.
if d [2 j ] > d [2 j +1]
otherwise
(32)
and the signal d is given by d[ j] = d[ j]s[ j]. Second, the matrix is modified by including the
signal information; that is, (i, j) = s[i]s[ j](i, j). The correlation in Equation 30 is now given by:
(33)
= (m0 , m0 )
+ (m1 , m1 ) + 2 (m0 , m1 )
+ (m2 , m2 ) + 2[ (m0 , m2 ) + (m1 , m2 )]
+ (m3 , m3 ) + 2[ (m0 , m3 ) + (m1 , m3 ) + (m2 , m3 )]
(34)
A focused search approach is used to further simplify the search procedure. In this approach a
precomputed threshold is tested before entering the last loop, and the loop is entered only if this
threshold is exceeded. The maximum number of times the loop can be entered is fixed so that a low
percentage of the codebook is searched. The threshold is computed based on the correlation C. The
maximum absolute correlation and the average correlation due to the contribution of the first three
pulses, max3 and av3, are found prior to the codebook search. The threshold is given by:
(35)
The fourth loop is entered only if the absolute correlation (due to three pulses) exceeds thr3. Note
that this results in a variable complexity search. To further control the search, the number of times
the last loop is entered (for the 4 subframes) is not allowed to exceed 600. (The average worst case
per subframe is 150 times. This can be viewed as searching only 150 8 entries of the codebook,
ignoring the overhead of the first three loops.)
A special feature of the codebook is that, for pitch delays less than 60, a pitch contribution
depending on the index PGIndi of the LTP pitch predictor gain vector is added to the code. That is,
after the optimum algebraic code v[n] is determined, it is modified by v[n] v[n] + b(PGIndi)v[n
Li e(PGIndi)], the values (PGIndi) and (PGIndi) are tabulated and Li being the integer pitch
period. Note that prior to the codebook search, the impulse response should be modified in a similar
fashion if Li < 60.
12
The last step, after getting the v[n] sequence, is quantizing the gain G. The gain is quantized in the
same way as in the high rate excitation. This is done by stepping through the gain quantization table
~
and selecting the index MGIndi which minimizes the following expression: G G j , 0 j 23.
2.17
Excitation decoder
File : EXC_LBC.C
Procedure : Fcbk_Unpk()
(36)
where:
~
GSize = 24 is the size of the G table; and
PGIndi is obtained in 2.18.
2)
3)
4)
5)
6)
7)
2.18
( )
File : EXC_LBC.C
Procedure : Get_Rez()
File : EXC_LBC.C
Procedure : Decod_Acbk()
Li = PIndi + 18
2)
i = 0,2
(37)
Li = Li 1 + i
i = 1,3
(38)
where i {1, 0, + 1, + 2} .
3)
The gain vector of the pitch predictor in ith subframe is derived from the gain index GIndi.
For the low rate this index contains the information about the pitch predictor gain vector
and the index of the gain of the pulse sequence. In this case pitch gain index PGIndi is
derived as follows:
i = 0..3
(39)
where x indicates the greatest integer x. For the high rate, in the case that the condition
Li 58 is met, this index is derived in the same manner as described in (39). In the above
cases PGIndi is a pointer to 170 entries gain vector codebook. Otherwise, this index is a
ITU-T Rec. G.723.1 (05/2006)
13
pointer to 85 entries gain vector codebook and contains an additional information about the
impulse train bit. In this case pitch gain index is derived as follows:
i = 0..3
(40)
The pitch predictor lag and gain vector are decoded from these indices and utilized for the
pitch contribution u[n] extraction as described below. First, a signal e[n] is defined by:
e[0] = e[ Li 2]
e[1] = e[ Li 1]
e[n] = e[(n mod Li ) Li ]
(41.1)
2 n 63
j=4
e[n + j ]
ij
0 n 59
(41.2)
j =0
2.19
Memory update
File : LPC.C
Procedure : Upd_Ring ()
Memory update
The last task of the ith subframe before proceeding to encode the next subframe is to update the
~
memories of the synthesis filter Ai ( z ) , the formant perceptual weighting filter Wi(z), and the
harmonic noise shaping filter Pi(z). To accomplish this, the complete response of combined filter
Si(z), is computed by passing the reconstructed excitation sequence through this filter. At the end of
the excitation filtering, the memory of the combined filter is saved and will be used to compute the
zero input response during the encoding of the next speech vector.
2.20
Bit allocation
File : UTIL_LBC.C
Procedure : Line_Pack()
Bitstream packing
This clause presents the bit allocation tables for both high and low bit rates. The major differences
between two rates are in the pulse positions and amplitudes coding. Also, at the lower rate 170
codebook entries are always used for the gain vector of the long term predictor. See Tables 2, 3
and 4.
Subframe 0
Subframe 1
Subframe 2
Subframe 3
LPC indices
Total
24
18
12
12
12
12
48
Pulse positions
20
18
20
18
73 (Note)
Pulse signs
22
Grid index
Total
189
NOTE By using the fact that the number of codewords in the fixed codebook is not a power of 2,
3 additional bits are saved by combining the 4 MSBs of each pulse position index into a single 13-bit
word.
14
Subframe 0
Subframe 1
Subframe 2
Subframe 3
LPC indices
Total
24
18
12
12
12
12
48
Pulse positions
12
12
12
12
48
Pulse signs
16
Grid index
4
158
Total
Transmitted parameters
High rate
# bits
Low rate
# bits
24
24
LPC
LSP VQ index
ACL0
ACL1
ACL2
ACL3
GAIN0
12
12
GAIN1
12
12
GAIN2
12
12
GAIN3
12
12
POS0
20 (Note)
12
POS1
18 (Note)
12
POS2
20 (Note)
12
POS3
18 (Note)
12
PSIG0
PSIG1
PSIG2
PSIG3
GRID0
Grid index
GRID1
Grid index
GRID2
Grid index
GRID3
Grid index
NOTE The 4 MSBs of these codewords are combined to form a 13-bit index, MSBPOS.
2.21
Coder initialization
File : CODER.C
Procedure : Init_Coder()
Coder initialization
All the static coder variables should be initialized to 0, with the exception of the previous
LSP vector, which should be initialized to LSP DC vector, pDC.
15
Decoder principles
3.1
General description
The decoder operation is also performed on a frame-by-frame basis. First the quantized LPC indices
are decoded, then the decoder constructs the LPC synthesis filter. For every subframe, both the
adaptive codebook excitation and fixed codebook excitation are decoded and input to the synthesis
filter. The adaptive postfilter consists of a formant and a forward-backward pitch postfilter. The
excitation signal is input to the pitch postfilter, which in turn is input to the synthesis filter whose
output is input to the formant postfilter. A gain scaling unit maintains the energy at the input level
of the formant postfilter.
LSP decoder
File : LSP.C
Procedure : Lsp_Inq()
3.3
LSP interpolator
File : LSP.C
Procedure : Lsp_Int()
LSP interpolator
File : LSP.C
Procedure : LsptoA()
3.4
File : EXC_LBC.C
Procedure : Get_Rez()
File : EXC_LBC.C
Procedure : Decod_Acbk()
16
3.5
Excitation decoder
File : EXC_LBC.C
Procedure : Fcbk_Unpk()
3.6
Pitch postfilter
File : EXC_LBC.C
Procedure : Comp_Lpf()
File : EXC_LBC.C
Procedure : Find_F()
File : EXC_LBC.C
Procedure : Find_B()
File : EXC_LBC.C
Procedure : Get_Ind()
Gain computation
File : EXC_LBC.C
Procedure : Filt_Lpf()
Pitch postfiltering
A pitch postfilter is used to improve the quality of the synthesized signal. It is important to note that
the pitch postfilter is performed for every subframe, and to implement it, it is required that the
whole frame excitation signal {e[n]}n = 0..239 is generated and saved. The quality improvement is
obtained by increasing the SNR at multiples of the pitch period. This is done in the following way.
The postfiltered signal {ppf [n]}n = 0..59 is obtained from the decoded excitation signal {e[n]}n = 0..59 as
given by the following expression:
] + wb gb e[n M b ] ) }
(42)
where e[n] is the decoded excitation signal. The computation of the gains, gp, gf, gb, and the delays,
Mf, Mb, is based on a forward and backward cross-correlation analysis. The weights wf, wb may have
the following values: (0,0), (0,1), and (1,0). The delays are selected by maximizing the
cross-correlations. The cross-correlation for the forward pitch lag is given by:
Cf =
e[n] e[n + M ]
59
M1 Mf M 2
(43.1)
n=0
and the cross-correlation for the backward pitch lag is given by:
Cb =
59
e[n] e[n M ]
M1 M b M 2
(43.2)
n=0
where M1 = Li 3 and M2 = Li + 3 and {Li}i = 0.2 are the received pitch lags for the first and third
subframes. L0 is utilized for the first two subframes and L2 for the last two. Note that if one of the
maximum correlations is negative or if for some n [0..59] there is no sample value e[n + Mf]
available, then the corresponding weight and delay are set to 0. This makes four possible cases:
0)
both maxima are negative and no pitch postfilter weights need to be computed;
1)
only the forward maximum is positive, so it is selected;
2)
only the backward maximum is positive, so it is selected; or
3)
both maxima are positive and the one making the larger contribution is selected. This
procedure is described below. For cases 1), 2) and 3), the relevant signal energies (Ten, Df
and/or Db) for the optimum pitch lag (Mf or Mb) are computed according to Equations 44.1,
44.2 and 44.3:
Df =
e[n + M ] e[n + M ]
59
(44.1)
n=0
17
59
e[n M ] e[n M ]
Db =
(44.2)
n=0
Ten =
59
e[n] e[n]
(44.3)
n=0
(e[n] g e[n + M ] )
59
Ef =
(45.1)
n=0
Eb =
59
(e[n] g e[n M ] )
b
(45.2)
n=0
C2
. In case 3), the selection between forward and backward is
D
C 2b
C2 f
. The prediction gain is equal to 10 log10
and
made by selecting the larger of
Df
Db
E is minimized by maximizing
1 C . If this gain is less than 1.25 dB, then the contribution is judged to be negligible and no
DT
en
pitch postfilter is used. If a pitch postfilter is used, then the optimum gain is given by:
g=
C
D
(46)
According to the speech coder bit rate, the optimum gain is multiplied by a weighting factor, ltp,
which equals 0.1875 for the high rate and 0.25 for the low rate. Finally, the scaling gain, gp, is
computed as:
59
e [n]
2
gp =
n=0
59
( ppf [n] )
(47)
2
n=0
If the denominator in Equation 47 is less than the numerator, the gain is set to 1.
3.7
File : LPC.C
Procedure : Synt()
~
The 10th order LPC synthesis filter Ai (z ) is used to synthesize the speech signal sy[n] from the
decoded pitch postfiltered residual ppf [n].
sy [n ] = ppf [n ] +
18
10
a~ij sy[n j ]
j =1
(48)
3.8
Formant postfilter
File : LPC.C
Procedure : Spf()
Formant postfiltering
File : UTIL_LBC.C
Procedure : Comp_En()
A conventional ARMA postfilter is used. The transfer function of the short term postfilter is given
by the following equations:
59
sy[n] sy[n 1]
k =
n =1
(49.1)
59
sy[n] sy[n]
n=0
k1 = 3 k1old + 1 k
4
4
1
F (z ) =
(49.2)
10
a~i i1 z i
(1 0.25k1z 1 )
i =1
10
1 a~i i 2 z i
(49.3)
i =1
where 1 = 0.65 and 2 = 0.75, k is the first autocorrelation coefficient that is estimated from the
synthesized speech sy[n], and k1old is the value of k1 from the previous subframe. The postfiltered
signal pf [n] is obtained as the output of the formant postfilter with input signal sy[n].
3.9
File : UTIL_LBC.C
Procedure : Scale()
This unit receives two input vectors, the synthesized speech vector {sy[n]}n = 0..59 and the
postfiltered output vector {pf [n]}n = 0..59. First, the amplitude ratio gs is computed, using:
59
sy [n]
2
gs =
n=0
(50)
59
pf [n]
2
n=0
(51)
(52)
3.10
File : EXC_LBC.C
Procedure : Comp_Info()
File : EXC_LBC.C
Procedure : Regen()
19
This coder has been designed to be robust for indicated frame erasures. An error concealment
strategy for frame erasures has been included in the decoder. However, this strategy must be
triggered by an external indication that the bitstream for the current frame has been erased. Because
the coder was designed for burst errors, there is no error correction mechanism provided for random
bit errors. If a frame erasure has occurred, the decoder switches from regular decoding to frame
erasure concealment mode. The frame interpolation procedure is performed independently for the
LSP coefficients and the residual signal.
2)
The predicted vector, pn , is added to the vector, e~n , and DC vector, pDC, to form the
pn . For pn generation a different fixed predictor is used: be = 23/32.
decoded LSP vector, ~
From this point the decoding of the LSP continues as in 2.6, except that min = 62.5 Hz, rather than
31.25 Hz.
3.11
Decoder initialization
File : DECOD.C
Procedure : Init_Decod()
Decoder initialization
All the static decoder variables should be initialized to 0, with the exception of:
1)
Previous LSP vector, which should be initialized to LSP DC vector, pDC.
2)
Postfilter gain g[1], which should be initialized to 1.
20
Bitstream packing
NOTE Each bit of transmitted parameters is named PAR(x)_By, where PAR is the name of the parameter
and x indicates the subframe index if relevant and y stands for the bit position starting from 0 (LSB) to the
MSB. The expression PARx_By..PARx_Bz stands for the range of transmitted bits from bit y to bit z. The
unused bit is named UB (value = 0). RATEFLAG_B0 tells whether the high rate (0) or the low rate (1) is
used for the current frame. VADFLAG_B0 tells whether the current frame is active speech (0) or
non-speech (1). The combination of RATEFLAG and VADFLAG both being set to 1 is reserved for future
use. Octets are transmitted in the order in which they are listed in Tables 5 and 6. Within each octet, the bits
are with the MSB on the left and the LSB on the right.
Table 5/G.723.1 Octet bit packing for the high bit rate codec
Transmitted octets
PARx_By, ...
LPC_B13...LPC_B6
LPC_B21...LPC_B14
GAIN0_B11...GAIN0_B4
GAIN1_B7...GAIN1_B0
GAIN2_B3...GAIN2_B0, GAIN1_B11...GAIN1_B8
10
GAIN2_B11...GAIN2_B4
11
GAIN3_B7...GAIN3_B0
12
13
MSBPOS_B6...MSBPOS_B0, UB
14
15
POS0_B9...POS0_B2
16
17
POS1_B9...POS1_B2
18
POS2_B3...POS2_B0, POS1_B13...POS1_B10
19
POS2_B11...POS2_B4
20
POS3_B3...POS3_B0, POS2_B15...POS2_B12
21
POS3_B11...POS3_B4
22
23
PSIG2_B2...PSIG2_B0, PSIG1_B4...PSIG1_B0
24
PSIG3_B4...PSIG3_B0, PSIG2_B5...PSIG2_B3
21
Table 6/G.723.1 Octet bit packing for the low bit rate codec
Transmitted octets
PARx_By, ...
LPC_B13...LPC_B6
LPC_B21...LPC_B14
GAIN0_B11...GAIN0_B4
GAIN1_B7...GAIN1_B0
GAIN2_B3...GAIN2_B0, GAIN1_B11...GAIN1_B8
10
GAIN2_B11...GAIN2_B4
11
GAIN3_B7...GAIN3_B0
12
13
POS0_B7...POS0_B0
14
POS1_B3...POS1_B0, POS0_B11...POS0_B8
15
POS1_B11...POS1_B4
16
POS2_B7...POS2_B0
17
POS3_B3...POS3_B0, POS2_B11...POS2_B8
18
POS3_B11...POS3_B4
19
PSIG1_B3...PSIG1_B0, PSIG0_B3...PSIG0_B0
20
PSIG3_B3...PSIG3_B0, PSIG2_B3...PSIG2_B0
ANSI C code
ANSI C code simulating the dual rate encoder/decoder in 16-bit fixed point arithmetic is available
from ITU-T. Table 7 lists all the files included in this code.
Description
TYPEDEF.H
CST_LBC.H
LBCCODEC.C
LBCCODEC.H
Functions prototypes
CODER.C
CODER.H
Functions prototypes
DECOD.C
DECOD.H
Functions prototypes
LPC.C
LPC.H
Functions prototypes
22
Description
LSP.C
LSP.H
Functions prototypes
EXC_LBC.C
EXC_LBC.H
Functions prototypes
UTIL_LBC.C
UTIL_LBC.H
Functions prototypes
TAB_LBC.C
Tables of constants
TAB_LBC.H
BASOP.C
BASOP.H
Functions prototypes
Glossary
This is a glossary containing the mathematical symbols used in the text and a brief description of
what they represent. See Table 8.
Description
y[j]
s[j]
x[j]
R[j]
ai
a~i
~
pn
pn
PDC
en
~
en
Quantized value of en
Wn
1, 2
Wi
f[n]
LOL
COL( j)
CPW ( j)
23
Gopt
Pi
Si
Combined harmonic and formant weighting and synthesis filters for subframe i
h[n]
z[n]
w[n]
t[n]
Target vector
p[n]
r[n]
r[n]
v[n]
Number of pulses
Sign of pulse k
mk
Position of pulse k
d[n]
Gmax
Li
max3
av3
thr3
MGIndi
GIndi
PIndi
PGIndi
GSize
~
G
Quantized gain
ij
u[n]
e[n]
ppf [n]
Mf, Mb
ppf [n]
ltp
24
Description
Description
Pitch postfilter scaling gain
gp, gb
Cf, Cb
Df, Db
Ef, Eb
Ten
sy[n]
pf [n]
q[n]
g[n]
k1
1, 2
gs
be
Fixed predictor for LSP interpolation during frame erasure concealment, 23/32
25
Annex A
Silence compression scheme
A.1
Introduction
This annex describes the silence compression system that has been designed for the G.723.1 speech
coder. Silence compression techniques are used to reduce the transmitted bit rate during silent
intervals of speech. Systems allowing discontinuous transmission are based on a Voice Activity
Detection (VAD) algorithm and a Comfort Noise Generator (CNG) algorithm that allows the
insertion of an artificial noise during silence periods. This feature is necessary to avoid noise
modulation introduced when the transmission is switched off: if the background acoustic noise that
was present during active periods abruptly disappears, this very unpleasant noise modulation may
even reduce the intelligibility of the speech.
The purpose of the VAD is to reliably detect the presence or absence of speech and to convey this
information to the CNG algorithm. Typically, VAD algorithms base their decisions on several
successive frames of information in order to make them more reliable and to avoid producing
intermittent decisions. The VAD is constrained to operate on the same 30-ms speech frames which
will subsequently either be encoded by the speech coder or filled with comfort noise by the comfort
noise generator. The output of the VAD algorithm is passed to the CNG algorithm.
The largest difficulty in the detection of speech is the presence of any of a diverse range of
background noise conditions. The VAD must be able to detect speech even in very low signal-tonoise ratio conditions. It is impossible to distinguish between speech and noise using simple level
detection techniques when parts of the speech utterance are buried below the noise. The distinction
between these conditions can only be made by taking into consideration the spectral characteristics
of the input signal. In order to do this, the VAD incorporates an inverse filter, the coefficients of
which are derived during noise-only periods by the CNG. All further details of the VAD are
included in A.2.
The purpose of the CNG algorithm is to create a noise that matches the actual background noise
with a global transmission cost as low as possible. At the transmitting end, the CNG algorithm uses
the activity information given by the VAD for each frame, then computes the encoded parameters
needed to synthesize the artificial noise at the receiving end. These encoded parameters compose
the Silence Insertion Descriptor (SID) frames, which require less bits than the active speech frames
and are transmitted during inactive periods.
The main feature of this CNG algorithm is that the transmission of SID frames is not periodic: for
each inactive frame, the algorithm makes the decision of sending a SID frame or not, based on a
comparison between the current inactive frame and the preceding SID frame. In this way, the
transmission of the SID frames is limited to the frames where the power spectrum of the noise has
changed.
During inactive frames, the comfort noise is synthesized at the decoder by introducing a
pseudo-white excitation into the short-term synthesis filter. The parameters used to characterize the
comfort noise are the LPC synthesis filter coefficients and the energy of the excitation signal. At the
encoder, for each SID frame the algorithm computes a set of LPC parameters and quantizes the
corresponding LSPs using the coder LSP quantizer on 24 bits. It also evaluates the excitation energy
and quantizes it with 6 bits. This yields encoded SID frames of 4 bytes including the 2 bits for bit
rate and DTX information.
A notable feature of this CNG algorithm is the method used to evaluate the spectrum of the ambient
noise for each SID frame. It takes into account the local stationarity or non-stationarity of the input
signal.
26
Finally, the excitation corresponds to the higher bit rate excitation of the G.723.1 codec. Since the
fixed excitation has a rather poor spectrum, the long-term excitation is also used in order to obtain a
better white-noise-type of excitation. The algorithm randomly chooses the codes of the long-term
parameters (delays and gains) and the fixed codebook parameters (grid, pulse positions and signs).
For every two subframes, it computes the gain of the fixed excitation to achieve a global energy
derived from the transmitted SID energy.
The computation of the excitation needs to be performed both at the encoder and at the decoder to
keep both parts synchronized.
At the receiver, to simplify the procedure, the harmonic postfilter is switched off during comfort
noise generation since the generated noise is not a voiced signal.
The results of the tests on the VAD/DTX/CNG scheme as described in this annex will be published
at a later date as an appendix to this annex.1
A.2
This clause describes the Voice Activity Detector (VAD) used in the G.723.1 speech coder. The
function of the VAD is to indicate whether each 30-ms frame produced by the speech encoder
contains speech or not. The VAD decision at frame t is labelled as Vadt and is the input to the
COD-CNG block that computes Ftypt, as described in A.3 and Figure A.1. The performance of the
VAD algorithm is characterized by the amount of audible speech clipping and the percentage of
speech activity it indicates.
The VAD is basically an energy detector. The energy of the inverse filtered signal is compared with
a threshold. Speech is indicated whenever the threshold is exceeded. The threshold is computed by
a two-step procedure. First, the noise level is updated based on its previous value and the energy of
the filtered signal. Second, the threshold is computed from the noise level via a logarithmic
approximation.
"Hangover" is a term describing the practice of declaring the first few frames of silence following a
speech burst to still be speech. It is used to eliminate low-level speech clipping. Hangover is only
added to speech bursts which exceed a certain duration to avoid extending noise spikes.
A.2.1
An adaptation enable flag, denoted Aent for the current frame t, is used to be sure that the
VAD noise level is adapted only when speech is not present. It is based on the fact that the
background noise or the silence is neither a voiced signal nor a sine wave:
Voiced/Unvoiced detection
The open loop pitch delays of the preceding and current frame are used to test voicing. Let
us
note
LOL , j = 0,1,2,3
those
four
values.
The
minimum
delay
Lmin
OL = Min LOL , j = 0,1,2,3 is first computed. The counter pc [1,2,3,4] indicating how
j
j
many delays LOL
lie in the neighborhood of a multiple of Lmin
OL (3) is evaluated. If pc is
equal to 4, the signal is considered as voiced.
____________________
1
It should be noted that test conditions were not sufficiently severe for mobile conditions.
ITU-T Rec. G.723.1 (05/2006)
27
If kit [2] 0.95 for at least 14 of the 15 last values, then a sine wave is detected (SinD = 1).
In the other case, SinD = 0.
Compute the adaptation enable flag:
Aent = Aent 1 + 2
Aent = Aent 1 1
if pc = 4 or SinD = 1
otherwise
Inverse filtering
The input signal frame, {s[n ]}n = 60..239 , is inverse filtered by a FIR filter Ano (z) with coefficients
{ano [ j ]}j = 1..10 . This filter is calculated by the CNG block and provides an estimation of the
LPC filter associated to the current background noise.
et [n] = s[n] +
10
a [ j ] s[n j ]
n = 60 239
no
(A-1)
j =1
A.2.3
The energy, Enrt, is computed from the inverse filtered signal of the current frame by:
Enrt =
A.2.4
1
80
239
et2 [n]
(A-2)
n = 60
The noise level at frame t, Nlevt, is updated based on its previous value and on the previous energy,
Enrt 1 and on the adaptation enable flag Aent. This update procedure is characterized by slow attack
and fast decay. The dynamic range of the noise level at frame t is limited to the range
[Nlevmin , Nlevmax ].
1)
2)
if Aent = 0
otherwise
28
(A-3)
(A-4)
A.2.5
Threshold computation
The relationship between the noise level at frame t, Nlevt, and the threshold, Thr, is defined by
logarithmic approximation and defined by the following formula:
5.012
0.7 0.05 log
2
Thr = 10
2.239
A.2.6
if Nlev = 128
Nlev
128
(A-5)
if Nlev 16384
The VAD decision is based on the comparison between the threshold, Thr, and the current energy,
Enrt.
1 Enrt Thr
Vad t =
0 Enrt < Thr
A.2.7
(A-6)
A hangover of 6 frames is added only in the case of speech bursts (Vadt = 1) larger than or equal to
2 frames.
A.2.8
VAD initialization
All static variables of the VAD algorithm are initialized to zero, except the following variables:
Nlev1
Enr1
j
LOL
j
LOL
A.3
=
=
=
=
1024
1024
1
60
j = 0,1
j = 2,3
(A-7)
The algorithm is divided into two blocks situated at the encoder and the decoder that will be called
respectively COD-CNG and DEC-CNG. At the encoder (see Figure A.1), the COD-CNG block uses
the autocorrelation function of the speech signal computed for each 60 samples subframe, the past
excitation samples and LSPs from the preceding frame.
29
For inactive frames, COD-CNG computes the CNG excitation samples in order to synchronize the
local decoder of the encoder with the distant decoder.
Because of the predictive coding of the LSPs in the G.723.1 scheme, a similar input/output with
update is done for LSP parameters during inactive frames.
COD-CNG outputs the encoded SID frames and the final decision Ftypt (Frame type of frame t) as
one of the three values, 0, 1, or 2 corresponding to untransmitted frame, active speech frame or
SID frame, respectively.
At the receiver (see Figure A.2), the DEC-CNG block processes only inactive speech frames, for
which the input information Ftypt is equal to 0 or 2 (untransmitted/SID). DEC-CNG decodes the
SID frames and both for SID and untransmitted frames, computes the current LSPs and excitation
using the same method as COD-CNG.
Then the G.723.1 decoder synthesizes the comfort noise using the CNG excitation and LSPs.
30
A.4
For each frame of 240 samples (active or inactive), the COD-CNG block processes the data coming
from the VAD and the coder, produces the Ftypt information and the coded SID frame according to
the procedure depicted by Figure A.3 and detailed in A.4.1 to A.4.7.
File: COD_CNG.C
Procedure: Update_Acf()
For every frame t (active or inactive), the autocorrelation coefficients (calculated in the encoder as
described in 2.4) Ri [ j ] , j = 0 to 10 of the four subframes indexed by i = 0 to 3 are summed. The
cumulated autocorrelation function of the current frame t, is given by:
R t [ j] =
R [ j]
i
for j = 0 to 10
(A-8)
i = 10
A.4.2
File: COD_CNG.C
Procedure: Cod_Cng()
File: COD_CNG.C
Procedure: LpcDiff()
File: LPC.C
Procedure: Durbin()
Levinson-Durbin recursion
If the current frame t is an active speech frame (Vadt = 1), then Ftypt = 1 and no other processing is
performed.
31
In the other case, the decision SID/untransmitted frame is taken according to the following
procedure:
The LPC filter At(z) of the current frame t is calculated by the Durbin procedure (see 2.4) using
Rt[ j ] as input. The coefficients of At(z) are noted at[j], j = 1 to 10. The Durbin procedure also
provides the residual energy Et , that will be used as an estimate of the frame excitation energy.
Then, the current frame type Ftypt is determined in the following way:
If the current frame is the first inactive frame of the inactive zone, the frame is selected as
SID frame, the variable E which reflects the energy sum is taken equal to Et, and the
number of frames involved in the summation, kE, is initialized to 1:
(Vad
t 1
Ftypt = 2
= 1) E = E
k = 1
E
(A-9)
Else, if the current filter is significantly different from the preceding SID filter, or if the
current excitation energy significantly differs from the preceding SID energy, then the
frame is selected as SID (Ftypt = 2).
Otherwise, if the current frame is not the first of an inactive period, and if the current
LPC filter and the excitation energy are similar to the SID ones, the frame is not transmitted
(Ftypt = 0).
The LPC filters and energies are compared according to the following methods:
Comparison of the LPC filters
The current LPC filter and SID filter are considered as significantly different if the Itakura distance
between the two filters exceeds the given threshold, which is expressed by:
10
R [ j] R [ j] > E
t
thr1
(A-10)
j =0
where Ra[j], j = 0 to 10 is a function derived from the autocorrelation of the coefficients of the SID
filter, given by :
10 j
[
]
R
j
=
2
a
asid [k ] asid [k + j ]
k =0
10
R (0 ) =
asid [k ]2
k =0
if j 0
(A-11)
KE being first incremented up to the maximum value 3, the sum of the frame energies
E =
Et
is calculated.
i = t k E +1
Then E is quantized, using the 6-bit pseudo-logarithmic quantizer described in A.4.3. The coded
gain index GIndt is compared to the previous coded SID gain index GIndsid. If the difference
exceeds the threshold thr2 = 3, the two energies will be considered as significantly different.
32
A.4.3
File: UTIL_CNG.C
Procedure: Qua_SidGain()
File: UTIL_CNG.C
Procedure: Dec_SidGain()
The quantization procedure operates on the sum of the energies E , and the decoding provides a
gain, which corresponds to the decoded value of the average energy square root.
A scaling factor w = 2.70375 is introduced to take into account the effect of windowing and
bandwidth expansions present in the subframes autocorrelation functions Ri[ j ].
The value used at the input of the gain quantizer is:
G = w
1
E , bounded in [0, 352].
k E 240
The quantizer is a pseudo-log one, that divides [0, 352] into three segments indexed isg = 0 to 2 of
length N[isg] = 16, 16, 32 with the associated resolutions 2, 4 and 8.
Let Gisg [j], j = 0 to N[isg] 1 be the decoded values for segment isg. Those values are given by:
Gisg [ j ] = Gisg [0] + j 2 ( isg + 1)
(A-12)
The procedure uses G2 to calculate the index isg of the segment which contains G and the index is
of Gisg ( is) the closer to G.
The current quantization index is given by:
GInd t = 16 isg + i s
(A-13)
(A-14)
File: COD_CNG.C
Procedure: Cod_Cng()
File: COD_CNG.C
Procedure: ComputePastAvFilter()
File: COD_CNG.C
Procedure: LpcDiff()
File: COD_CNG.C
Procedure: CalcRc()
File: LPC.C
Procedure: Durbin()
Levinson-Durbin recursion
File: LSP.C
Procedure: AtoLsp()
File: LSP.C
Procedure: Lsp_Qnt()
LSP quantization
File: LSP.C
Procedure: Lsp_Inq()
When the current frame is a SID frame, the SID parameters are calculated and quantized. Notice
that those parameters will serve in making the SID decision for the next inactive frames up to the
next SID frame.
33
Computation of SID LPC filter [Asid(z)] and update of VAD LPC filter [Ano(z)]
First, the past average LPC filter A p ( z ) built from the three frames preceding the current one is
estimated, using the Durbin procedure with the following autocorrelation function as input:
R p [ j] =
t 1
R k [ j]
for j = 0 to 10
(A-15)
k = t 3
j = 1,2,...10
(A-16)
(A-17)
~
and the decoded value is denoted Gsid .
A.4.5
File: UTIL_CNG.C
Procedure: Calc_Exc_Rand()
File: UTIL_CNG.C
Procedure: random_number()
File: UTIL_CNG.C
Procedure: distG()
File: UTIL_LBC.C
Procedure: Sqrt_lbc()
Square root
File: UTIL_LBC.C
Procedure: Rand_lbc()
Pseudo-random sequence
The update of the excitation signal is performed both for SID frames and for untransmitted frames.
~
First, let us define the target excitation gain Gt as the square root of the average energy that must
~
be obtained for the current frame t synthetic excitation. Gt is calculated using the following
smoothing procedure:
~
Gsid
if Vad t 1 = 1
~
Gt = 7 ~
(A-18)
1~
otherwise
8 Gt 1 + 8 Gsid
34
The 240 samples of the frame are divided into two blocks of 120 samples, each block comprising
two subframes of 60 samples.
For each block, the CNG excitation samples are synthesized using the following algorithm:
First, the LTP parameters of the two subframes are selected.
The pitch lag for the first subframe is randomly chosen in the interval [123, 143].
The two subframes gain vector indices are randomly chosen into [0, 49], which corresponds
to the first 50 vectors of the 170 entries gain codebook.
The second subframe lag offset is taken equal to 0 for the first block, and 3 for the second
block.
Next, the fixed codebook vectors of the two subframes are built by random selection of the grid, the
pulses signs and positions, corresponding to the higher rate fixed excitation pattern.
Then, a unique fixed excitation gain is computed for the two subframes of the block.
The adaptive excitation vector on the current block is noted u[n], n = 0 to 119 and the fixed
excitation v[n], n = 0 to 119.
The fixed excitation gain is obtained by calculating the value Gf that yields a block average energy
~
the closest to the target energy Gt2 .
Select Gf such that
119
(u[n] + Gf v[n] )
20
~
Gt2 minimum.
(A-19)
n0
119
119
~
a = v[n ]2 , b = u[n ]v[n ] , c = u[n ]2 120Gt2
n =0
n =0
n =0
b
is selected, else the two roots are calculated and the one
a
with the lowest absolute value is selected.
If the discriminant is 0, then Gf =
A.4.6
n = 0 to 119
(A-20)
File: COD_CNG.C
Procedure: Cod_Cng()
File: LSP.C
Procedure: Lsp_Int()
LSP Interpolator
Both for SID frames and for untransmitted frames, the interpolated sets of LPC coefficients are
calculated using ~
psid and the previous LSP vector ~
pt 1 provided to COD-CNG.
p =~
p .
The LSP update is also performed: ~
t
sid
35
A.4.7
COD-CNG initialization
The following initialization must be performed on the frame autocorrelation functions, the target
excitation gain, VAD information, and seed of the random generator used to compute the CNG
excitation:
R k [ j] = 0
for j = 0,...,10 and k = 1,2,3
~
G1 = 0
Vad 1 = 1
rseed = 12345
At the receiving end, DEC-CNG processes SID frames and untransmitted frames to produce the
synthesized comfort noise.
The procedures developed to deal with frame erasures are described in the following subclauses.
A.5.1
Description of DEC-CNG
File: DEC_CNG.C
Procedure: Dec_Cng()
File: UTIL_CNG.C
Procedure: Calc_Exc_Rand()
File: UTIL_CNG.C
Procedure: random_number()
File: UTIL_CNG.C
Procedure: distG()
File: UTIL_LBC.C
Procedure: Sqrt_lbc()
Square root
File: UTIL_LBC.C
Procedure: Rand_lbc()
Pseudo-random sequence
File: LSP.C
Procedure: Lsp_Inq()
File: LSP.C
Procedure: Lsp_Int()
LSP Interpolator
File: UTIL_CNG.C
Procedure: Qua_SidGain()
File: UTIL_CNG.C
Procedure: Dec_SidGain()
36
Figure A.4 provides a general description of the comfort noise generation at the decoder part.
When the decoder receives a SID frame, DEC-CNG decodes the SID parameters.
Both in case of SID and untransmitted frames, the module DEC-CNG uses the decoded
SID parameters to compute the LSPs and the excitation of the comfort noise that will be synthesized
by the decoder synthesis module.
The CNG type of frame information Ftypt (for frame t) provided at the receiver is the same as the
value computed by COD-CNG at the encoder.
~
p for the LSPs and G
sid
Then, in both cases, the CNG excitation is calculated according to the procedure described in
psid , is used to compute the interpolated
COD-CNG in A.4.5. The new LSP vector, ~
LPC coefficients, and the LSP updating is performed: ~
pt = ~
psid .
37
A.5.2
File: DECOD.C
Procedure: Decod()
Frame decoding
When a frame erasure is detected by the decoder, the erased frame type depends on the preceding
frame type:
If the preceding frame was active, then the current erased frame is considered as active.
Else, if the preceding frame was either a SID frame or an untransmitted frame, the current
erased frame is considered as untransmitted:
Ftypt = 1
Ftypt 1 = 1
Ftyp
t 1 = 0 or 2 Ftypt = 0
(A-21)
If it is not the first SID frame of the current inactive period, then the previous SID
parameters are kept.
If it is the first SID frame of an inactive period, a special protection has been taken:
As stated in A.5.1, this case is detected by the fact that Ftypt1 = 1 and Ftypt = 0.
This combination of events does not imply that the preceding frame was a good active frame:
several frames up to the preceding one may have been erased. What is certain is that the last good
frame was an active frame, that the present frame was not erased, and that the SID frame supposed
to provide information for the current untransmitted frame is lost.
To recover the SID information, DEC-CNG uses parameters provided by the G.723.1 decoder main
part:
psid .
The LSPs of the last valid active frame are used for ~
The energy term Enr calculated by the decoder during the the residual interpolation
procedure (see 3.10.2) over the 120 last excitation samples of the last valid active frame is
~
used to recover Gsid , according to the method described in A.5.1.
Finally, to avoid de-synchronization of the random generator used to compute the excitation, the
pseudo-random sequence reset is performed at each active frame, both at the encoder and coder
part: rseed = 12345.
A.5.3
DEC-CNG initialization
Vad 1 = 1
rseed = 12345
38
A.6
Table A.1 shows the bit stream of the SID frames according to the notations used in clause 4.
Table A.1/G.723.1 Bit packing for SID frames
A.7
Transmitted octets
PARx_By, ...
Glossary
a no [ j ]
j
LOL
pc
Aent
et [n]
Enrt
Nlevt
Nlev min
Nlev max
Thr
k it [2]
SinD
e[n]
Rt [ j ]
thr2
Ra[j]
GIndsid
~
Gsid
GIndt
isg
N[isg]
Gisg [ j ]
is
w
asid
ap
Rp[ j]
~
p
sid
39
A.8
u[n]
v[n]
~
Gt
a, b, c
C(X)
Gf
rseed
All details of the silence compression algorithm are included as part of bit-exact, fixed-point
ANSI C source code. In the event of any discrepancy between the above descriptions and the C
source, the C source code is presumed to be correct. This C source code is a part of ITU-T
Rec. G.723.1.
40
Annex B
Alternative specification based on floating point arithmetic
B.1
Introduction
This Recommendation provides a bit-exact, fixed-point specification of a dual rate 6.3 and
5.3 kbit/s speech coder for multimedia telecommunications applications. Exact details of this
specification are given in bit-exact, fixed-point C code available from the ITU-T. This annex
describes an alternative implementation of G.723.1 contained in floating point C source code. A set
of digital test vectors for this floating point specification is also available from the ITU-T in order to
facilitate the implementation of this Recommendation. Note that passing the test vectors is a
necessary but not a sufficient condition to comply with this Recommendation.
B.2
Algorithm description
The floating point version of G.723.1 has the same algorithmic steps as the fixed-point version.
Similarly, the bit stream is identical to that of G.723.1. The reader is referred to clauses 2, 3, 4 and 6
for details.
B.3
ANSI C code
ANSI C code simulating the dual/rate encoder/decoder in floating point arithmetic is available from
the ITU-T. Table B.1 lists all the files included in this code. Individual C function names are the
same as in the main body text of this Recommendation.
Table B.1/G.723.1 List of software filenames
File name
Description
TYPEDEF2.H
CST2.H
LBCCODE2.C
LBCCODE2.H
Function prototypes
CODER2.C
CODER2.H
Function prototypes
DECOD2.C
DECOD2.H
Function prototypes
LPC2.C
LPC2.H
Function prototypes
LSP2.C
LSP2.H
Function prototypes
EXC2.C
EXC2.H
Function prototypes
UTIL2.C
UTIL2.H
Function prototypes
TAB2.C
Tables of constants
TAB2.H
41
Annex C
Scalable channel coding scheme for wireless applications
C.1
Introduction
C.1.1
Scope
This annex specifies a channel coding scheme which can be used with the triple rate speech codec
G.723.1. The channel codec is scalable in bit-rate and is designed for mobile multimedia
applications as a part of the overall H.324 family of standards. With this annex, G.723.1 can be
adapted to any wireline or wireless transmission system. Transmission system dependent
functionalities like interleaving or burst formatting are not subject of this annex.
C.1.2
Bit-rates
A wide range of channel codec bit-rates is supported ranging from 0.7 kbit/s up to 14.3 kbit/s. The
channel codec supports all three operational modes of the G.723.1 codec, namely high rate, low rate
and discontinuous transmission modes.
The applied channel codec bit-rates depend on the transmission system and the application and are
therefore not specified in this annex.
C.1.3
Delay
No additional delay is introduced by the channel encoder and decoder described in this annex.
C.1.4
The channel codec is based on punctured convolutional codes. The possibility to generate multiple
rates from a given mother code is the basis for the scalability feature. Based on the subjective
importance of each class of information bits, the available channel codec bit-rate is allocated
optimally to the bit classes. This allocation is based on an algorithm which has to be known by
encoder and decoder. Each time the system control signals either a change in the G.723.1 rate or in
the available channel codec bit-rate, this algorithm is executed to adapt the channel codec to the
new speech service configuration.
If only a very low channel codec bit-rate is available, the subjectively most sensitive bits are
protected first. When increasing the channel codec bit-rate, the additional bits are used first to
protect more information bits and second to increase the protection of the already protected classes.
A special configuration of the channel codec enables a robust G.723.1 transmission without the use
of convolutional codes but with only an error detection code for the subjectively most sensitive bits.
The channel codec bit-rate for this configuration is 667 bits per second.
Prior to the application of the channel encoding functions, the speech parameters are partly
modified in a channel adaptation layer to improve their robustness against transmission errors.
Two parameters are sufficient to configure and control the channel codec:
1)
G.723.1 configuration bit, signalling the operational mode of the speech encoder;
2)
Channel codec configuration bits, signalling the channel codec bit-rate to apply.
C.1.5
The channel codec bit-rate available for the speech service is provided by the control H.245 using
one channel codec configuration bits. Both channel codec configuration bit and G.723.1
configuration bits are protected and transmitted together with the protected information bitstream to
the decoder. Prior to the decoding of the information bitstream, the configuration bits are decoded
and channel and speech decoder are configured accordingly.
42
C.2
Channel encoder
C.2.1
This clause describes the transformation of the original G.723.1 speech parameters to a modified
bitstream with increased robustness to transmission errors.
C.2.1.1
The format of the speech parameters is specified in Tables 5 and 6. The 12-bit overall gain index
(GAINx_By) of each subframe is a compressed index of two individual gain indices. These indices
shall be decompressed as it is described in the speech decoding part (see 2.17 and 2.18). This results
in one 8-bit (AGAINx_By) and one 5-bit (FGAINx_By) gain index per subframe, requiring one
additional bit compared to the compressed index. The overhead introduced by this decompression
operation shall be exploited in the channel decoder for error detection and error concealment
purposes.
C.2.1.2
LPC reordering
The 24-bit LPC index (LPC_Bx, x = 0,23) consists of three 8-bit sub-vectors as described in 2.6.
e0 = { LPC _ B7, LPC _ B6, LPC _ B5, LPC _ B4, LPC _ B3, LPC _ B1, LPC _ B0}
e1 = { LPC _ B15, LPC _ B14, LPC _ B13, LPC _ B12, LPC _ B11, LPC _ B10, LPC _ B9, LPC _ B8}
e2 = { LPC _ B 23, LPC _ B22, LPC _ B21, LPC _ B20, LPC _ B19, LPC _ B18, LPC _ B17, LPC _ B16}
[ ]
emR = ReorderTab m em
m = 0,1,2
The LPC reordering tables are given in Table C.1a for the first sub-vector, in Table C.1b for the
second and in Table C.1c for the third sub-vector. The entry (i,j) to the tables is determined by:
em = 16 j + i . The output obtained is the transmitted index of the sub-vector.
Table C.1a/G.723.1 Reordering table for m = 0 (ReorderTab0[e0])
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
j=0
82
91
190
191
189
36
187
32
38
39
185
112
166
175
116
120
j=1
34
35
122
41
40
43
42
56
57
90
168
85
74
170
234
174
j=2
169
172
178
182
184
179
181
180
37
186
44
33
159
183
188
155
j=3
253
252
147
154
246
165
218
139
63
160
157
146
145
144
177
176
j=4
244
131
148
129
128
161
219
135
34
203
200
206
207
204
205
254
j=5
212
222
213
220
221
141
216
88
38
137
136
217
133
132
201
197
j=6
196
76
77
243
192
195
193
119
18
108
121
249
247
245
130
240
j=7
241
235
69
68
210
226
227
224
25
79
211
250
251
255
198
242
j=8
194
127
45
46
117
125
124
67
101
72
13
97
65
j=9
66
98
96
158
248
228
229
93
31
109
100
110
111
75
73
115
j=10 114
23
94
95
22
230
99
81
20
21
83
113
j=11 102
14
103
15
12
16
52
17
19
29
28
30
31
11
j=12
126
24
18
26
25
27
86
87
64
71
70
153
152
173
162
j=13 167
164
89
80
238
208
232
233
15
236
239
92
202
209
199
143
j=14 142
140
223
84
78
171
151
156
37
150
149
214
106
107
10
53
j=15
54
55
104
105
123
48
58
59
49
50
51
62
60
61
63
47
43
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
j=0
122
150
201
213
214
212
246
244
136
135
130
134
140
216
218
146
j=1
209
208
110
153
156
158
152
144
147
145
199
197
193
204
205
249
j=2
248
127
185
172
174
173
179
169
168
181
182
177
183
161
178
167
j=3
163
162
160
165
232
234
235
233
237
236
238
239
78
155
154
75
j=4
74
69
68
71
70
73
72
139
123
114
59
58
113
170
137
141
j=5
186
133
151
132
148
159
196
149
187
189
191
157
198
190
121
126
j=6
125
120
57
112
171
63
45
62
47
46
99
109
98
103
124
102
j=7
107
106
129
128
131
89
118
105
104
95
91
90
94
86
83
82
j=8
138
119
117
88
116
92
37
36
38
32
96
87
81
39
93
49
j=9
48
23
22
21
20
29
56
55
44
175
60
61
53
43
143
j=10 115
176
180
142
166
184
188
207
206
231
252
195
194
219
217
221
j=11 223
211
215
210
192
200
164
226
202
203
230
227
229
253
255
224
j=12 225
243
228
220
240
241
242
222
245
251
250
247
254
111
108
77
j=13
76
79
101
100
30
15
97
52
54
41
50
51
40
33
42
35
j=14
34
13
12
14
85
84
25
31
27
80
17
67
66
16
28
j=15
24
65
11
10
26
64
19
18
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
j=0
243
154
190
170
189
169
140
188
171
168
166
164
165
216
146
132
j=1
242
234
158
144
252
147
148
13
74
75
19
77
76
79
78
70
j=2
71
198
199
72
12
73
220
150
151
131
204
133
205
230
229
225
j=3
200
201
226
227
196
192
197
195
202
203
136
143
142
175
207
206
j=4
137
138
139
174
173
172
240
228
209
224
238
239
128
130
145
250
j=5
236
237
255
248
253
232
153
179
181
180
235
233
152
156
184
178
j=6
185
162
183
182
163
160
161
167
157
159
155
134
186
191
135
187
j=7
141
254
177
251
176
33
244
245
61
60
34
35
32
249
126
41
j=8
62
59
58
63
57
56
42
43
25
29
40
125
120
24
26
127
j=9
123
124
122
48
121
46
47
105
104
109
108
28
116
112
113
247
j=10 246
36
54
149
55
52
53
49
37
39
51
50
107
211
215
j=11
20
21
17
223
222
231
129
218
217
219
194
221
j=12 193
210
38
212
14
15
241
44
208
45
11
10
110
j=13
68
69
103
100
213
214
16
106
18
111
22
23
64
65
j=14
27
118
117
102
31
30
97
114
115
96
88
89
90
91
99
94
j=15
80
82
83
119
98
101
92
95
66
67
84
93
85
81
86
87
44
C.2.1.3
The speech parameter format of the high and low bit-rate modes after gain decompression and
LPC reordering is given in Tables C.2a and C.2b (UB means "unused bit").
Table C.2a/G.723.1 Format of decompressed high rate codec speech parameters
Channel encoder octets
PARx_By
R_LPC_B13...R_LPC_B6
R_LPC_B21...R_LPC_B14
AGAIN1_B3...AGAIN1_B0, AGAIN0_B7...AGAIN0_B4
AGAIN2_B3...AGAIN2_B0, AGAIN1_B7...AGAIN1_B4
AGAIN3_B3...AGAIN3_B0, AGAIN2_B7...AGAIN2_B4
10
FGAIN0_B3...FGAIN0_B0, AGAIN3_B7...AGAIN3_B4
11
12
FGAIN3_B4...FGAIN3_B0, FGAIN2_B4...FGAIN2_B2
13
14
MSBPOS_B11...MSBPOS_B4
15
POS0_B6...POS0_B0, MSBPOS_B12
16
POS0_B14...POS0_B7
17
POS1_B6...POS1_B0, POS0_B15
18
POS2_B0, POS1_B13...POS1_B7
19
POS2_B8...POS2_B1
20
POS3_B0, POS2_B15...POS2_B9
21
POS3_B8...POS3_B1
22
PSIG0_B2...PSIG0_B0, POS3_B13...POS3_B9
23
PSIG1_B4...PSIG1_B0, PSIG0_B5...PSIG0_B3
24
25
45
C.2.2
PARx_By
R_LPC_B13...R_LPC_B6
R_LPC_B21...R_LPC_B14
AGAIN1_B3...AGAIN1_B0 , AGAIN0_B7...AGAIN0_B4
AGAIN2_B3...AGAIN2_B0 , AGAIN1_B7...AGAIN1_B4
AGAIN3_B3...AGAIN3_B0 , AGAIN2_B7...AGAIN2_B4
10
FGAIN0_B3...FGAIN0_B0 , AGAIN3_B7...AGAIN3_B4
11
12
FGAIN3_B4...FGAIN3_B0 , FGAIN2_B4...FGAIN2_B2
13
14
POS0_B11...POS0_B8 , POS0_B7...POS0_B4
15
POS1_B7...POS1_B4 , POS1_B3...POS1_B0,
16
POS2_B3...POS2_B0 , POS1_B11...POS1_B8
17
POS2_B11...POS2_B8 , POS2_B7...POS2_B4
18
POS3_B7...POS3_B4 , POS3_B3...POS3_B0
19
PSIG0_B3...PSIG0_B0 , POS3_B11...POS3_B8
20
PSIG2_B3...PSIG2_B0 , PSIG1_B3...PSIG1_B0
21
The bits of the decompressed speech parameters specified in Tables C.2a and C.2b are rearranged
according to Table C.3a for the high bit-rate mode and Table C.3b for the low bit-rate mode.
The numbers indicate the position of the corresponding information bit in the ordered bitstream.
The subjectively most sensitive bit is copied to position "0" of the ordered bitstream.
The ordered bitstream is denoted with:
i(0), i(0), i(92)
for the high rate codec, and:
i(0), i(1), ... i(162)
for the low rate codec.
46
175
180
189
190
191
192
VAD
Rate
98
73
107
154
167
168
169
170
30
17
16
31
48
55
49
71
11
26
10
14
12
27
24
60
44
66
62
82
25
61
45
67
63
83
78
36
50
40
46
68
64
84
79
37
51
41
47
69
65
85
80
38
52
42
10
56
99
159
185
81
39
53
43
11
161
187
20
57
100
160
186
19
12
22
59
102
162
188
21
58
101
13
35
54
70
72
179
178
177
176
14
13
15
23
28
29
32
33
34
15
128
132
146
155
163
171
181
18
16
76
86
90
94
103
108
112
116
17
129
133
147
156
164
172
182
74
18
183
87
91
95
104
109
113
117
19
114
118
130
134
148
157
165
173
20
184
75
77
88
92
96
105
110
21
115
119
131
135
149
158
166
174
22
136
124
120
89
93
97
106
111
23
151
141
137
125
121
144
150
140
24
127
123
145
152
142
138
126
122
25
UB
UB
UB
UB
UB
153
143
139
47
C.2.3
152
153
158
159
160
161
VAD
Rate
69
64
70
91
145
140
147
146
24
15
14
25
46
50
47
63
11
18
10
13
12
19
16
48
42
59
55
65
17
49
43
60
56
66
26
30
34
38
44
61
57
67
27
31
35
39
45
62
58
68
28
32
36
40
10
51
87
141
154
29
33
37
41
11
143
156
21
52
88
142
155
20
12
23
54
90
144
157
22
53
89
13
100
128
96
104
151
150
149
148
14
132
112
116
136
108
120
124
92
15
109
121
125
93
101
129
97
105
16
102
130
98
106
133
113
117
137
17
134
114
118
138
110
122
126
94
18
111
123
127
95
103
131
99
107
19
83
79
75
71
135
115
119
139
20
85
81
77
73
84
80
76
72
21
UB
UB
UB
UB
86
82
78
74
The format of the SID frame type is specified in Annex A. The bits of this frame are rearranged
according to Table C.3c. For SID frames neither a gain decompression nor a LPC reordering is
performed. The numbers indicate the position of the corresponding information bit in the ordered
bitstream. The subjectively most sensitive bit is copied to position "0" of the ordered bitstream.
Table C.3c/G.723.1 Bit ordering table for SID frames
Octet
21
22
23
24
25
26
VAD
Rate
13
14
15
16
17
18
20
19
10
11
12
29
28
27
CRC encoder
The subjectively most sensitive bits are protected by five parity bits used for error detection in the
channel decoder.
48
The first CRCWIN bits of the ordered bitstream are within the CRC window:
Low rate codec: CRCWIN = 34
High rate codec: CRCWIN = 44
CRCWIN = 30
SID frames:
The following generator polynomial for the cyclic code shall be applied:
g ( D) = D 5 + D 2 + 1
The five parity bits p(0) to p(4) are copied right after the CRCWIN ordered information bits.
The bitstream after CRC encoding shall therefore have the following format:
Low rate codec: bs(n) = i(0),i(1), ... i(33),p(0),p(1),p(2),p(3),p(4),i(34),i(35), ... i(161); n = 0 ... 166
High rate codec: bs(n) = i(0),i(1), ... i(43),p(0),p(1),p(2),p(3),p(4),i(44),i(45), ... i(192); n = 0 ... 197
SID frame:
C.2.5
Convolutional encoder
The information bits bs(n) shall be encoded with punctured convolutional codes defined by the
mother polynomials:
0
= D4 + D + 1
g CC
1
= D4 + D3 + D2 + 1
g CC
2
g CC
= D4 + D2 + D + 1
Puncturing of the mother code with a puncturing period of 12 generates 24 intermediate rates:
12
12
r
36
13
The puncturing matrices are given in Table C.4.
Table C.4/G.723.1 Puncturing tables (all values in hexadecimal representation)
Rate r 12/13
12/14
12/15
12/16
12/17
12/18
12/19
12/20
12/21
12/22
12/23
12/24
Pr(0)
D6F
D7F
D7F
D7F
DFF
FFF
FFF
FFF
FFF
FFF
FFF
FFF
Pr(1)
690
690
691
695
695
695
69D
6DD
6DF
7DF
7FF
FFF
Pr(2)
000
000
000
000
000
000
000
000
000
000
000
000
Rate r 12/25
12/26
12/27
12/28
12/29
12/30
12/31
12/32
12/33
12/34
12/35
12/36
Pr(0)
FFF
FFF
FFF
FFF
FFF
FFF
FFF
FFF
FFF
FFF
FFF
FFF
Pr(1)
FFF
FFF
FFF
FFF
FFF
FFF
FFF
FFF
FFF
FFF
FFF
FFF
Pr(2)
001
009
109
309
329
729
72D
72F
7AF
7BF
7FF
FFF
49
C.2.5.1
In this clause the resource allocation algorithm is described which shall be applied for determining
the degree of protection for each information bit. This algorithm has to be executed in the channel
encoder and channel decoder each time at least one of the following events occur:
Reset;
The ordered bitstream shall be divided in five-bit classes c[i], i = 0...4, where c[0] contains
the most sensitive and c[4] the least sensitive bits.
A weighting factor w[i], i = 0...4 shall be assigned to each class, controlling the rate
allocation algorithm.
Class 0 contains c[0]5 information bits and the five parity check code bits which are
copied to the end of that class.
If no channel bits ( B = 0 ) are available, the channel bitstream shall contain the
decompressed G.723.1 bitstream and the five parity check bits.
The maximum number of channel codec bits per frame is achieved, when all classes have
the protection rate r = 1/3.
i =4
Bmax = 2 c[ i ] + 4
i =0
If:
B c[0] + 8
channel codec bits are available per frame, these shall be equally distributed over the bits of
class 0, applying the formula:
r[0] =
12
INT12 1 +
c[0] + 4
r [i ] =
12
B w[i ]
1
r[i ] =
r [i ]
50
12
18
otherwise
if r [i ] >
If the above algorithm allocates less bits than available, in the second pass bits from the
first unprotected class shall be shifted to the last protected class until all available bits are
allocated.
If more bits than available have been allocated, in the second pass bits from the last
protected class shall be shifted to the first unprotected class until the maximum allowed
channel codec bit-rate is achieved.
In the original configuration, class 3 contains the tail bits. If this class receives no
protection applying the above formula, the tail bits shall be attached to the last protected
class.
The following settings of c[i] and w[i] in Table C.5a shall be chosen:
Table C.5a/G.723.1 Settings of control parameters for rate allocation (active speech frames)
G.723.1 6.3 kbit/s
c[i]
44+5crc
44
46
47+4tails
12
34+5crc
40
40
40+4tails
w[i]
0.26
0.29
0.24
0.21
0.00
0.24
0.31
0.24
0.21
0.00
The following settings of c[i] and w[i] in Table C.5b shall be chosen:
Table C.5b/G.723.1 Settings of control parameters for rate allocation (SID frames)
G.723.1 SID
i
c[i]
30+5crc+4tails
w[i]
1.0
0.00
0.00
0.00
0.00
C.2.5.2
To take the lower error probability of the last protected bits into account, the bits:
bs(Nlastn); n = 0 ... 9
shall be exchanged with the bits:
bs(Nlast19 + n); n = 0 ... 9
Nlast is the last protected information bit in the bitstream.
This operation shall only be performed if more than 19 information bits are protected in the
classes 1 to 3.
C.2.5.3
Each bit bs(j) which receives protection ( r i < 1) shall be encoded applying the following steps:
n
. This
Encoding of information bit bs(j) of class i with the three generator polynomials g CC
results in three intermediate channel bits u(n) at the output (n = 0,1,2).
Puncturing of u(n) with Pr[i ] (n) in the following way:
If bit p of Pr[i ] (n) equals 0, then u(n) is discarded; otherwise, it is inserted in the channel
bitstream u(m) and m is incremented. p is a modulo 12 counter which is incremented with
the information bit counter j.
51
After encoding of the protected bits, the channel bitstream is contained in array u(m),
m = 0..Mp 1.
The unprotected bits shall be copied to the end of the protected channel bitstream, starting at
location m = MP and ending at location m = MAll 1.
C.2.5.4
Three channel codec configuration bits shall be added to the two source codec configuration bits as
shown in Table C.6. With this it is possible to add to each of the G.723.1 source codec modes (high
rate, low rate, SID) one of 8 different channel codec bit-rates. The choice of the channel codec
bit-rates depends on the transmission system and is therefore not subject of this annex.
Table C.6/G.723.1 G.723.1 and channel codec configuration bits
Octet
Configuration bits
UB
UB
UB
UB
UB
ccc
VAD
RATE
Channel bit
ucb[7]
ucb[6]
ucb[5]
ucb[4]
ucb[3]
ucb[2]
ucb[1]
ucb[0]
u[2]
u[1]
u[0]
ucb[12]
ucb[11]
ucb[10]
ucb[9]
ucb[8]
u[10]
u[9]
u[8]
u[7]
u[6]
u[5]
u[4]
u[3]
M p /8 + 2
...
...
u[ M p ]
u[ M p 1]
u[ M p 2]
...
...
...
...
...
...
...
...
...
...
...
...
M All /8 + 2
UB
UB
UB
u[ M All 1]
u[ M All 2]
Channel decoder
C.3.1
1000
bits per second.
30
The first 13 bits shall be decoded to obtain the speech and channel codec configurations. The rate
allocation algorithm described in C.2.5.1 is executed after reset or after a change in one of the
configuration bits.
52
C.3.2
The speech parameter format of the channel codec shall be adapted to the speech parameter format
of G.723.1.
C.3.2.1
If the speech decoder operates either in high or low bit-rate mode the gain compression operation as
described in 2.17 and 2.18 shall be performed to generate the bitstream format required by the
speech decoder. One 8-bit (AGAINx_By) and one 5-bit (FGAINx_By) gain index per subframe are
compressed to form one 12-bit compressed gain index. If forbidden indices occur either in the
decoded adaptive codebook gain index or in the decoded fixed excitation gain index the forbidden
index, indication flag (FII) shall be set and the indices shall be replaced by suitable valid values.
The FII flag shall also be set, if a forbidden lag index is decoded (see 2.18). The decoded lag shall
then be replaced by a suitable valid value.
C.3.2.2
LPC ordering
The three LPC sub-vectors shall be ordered applying the inverse tables of C.2.1.2.
[ ]
R
em = ReorderTabinv
m em
C.3.3
m = 0,1,2
Error analysis
Two flags indicating the reliability of the received G.723.1 speech parameters shall be generated in
the channel decoder:
1)
Bad Frame Indication (BFI) which signals uncorrected errors within the bits of the
CRC window (CRCWIN).
2)
Erroneous Frame Indication (EFI) which signals uncorrected errors outside the bits covered
by the CRC code. This flag shall only be evaluated if information bits outside the
CRC window are protected.
These flags shall be made available to the speech decoder together with the speech parameters and
the FII flag. In the case of received SID frames, only the BFI flag shall be determined.
C.3.4
The speech parameters shall be transmitted from the channel decoder to the speech decoder in the
format specified in Tables 5 and 6 with the error flags appended to the end. This format is shown in
Table C.8a for the high bit-rate mode, in Table C.8b for the low bit-rate mode and in Table C.8c for
SID frames.
53
54
PARx_By, ....
LPC_B13...LPC_B6
LPC_B21...LPC_B14
GAIN0_B11...GAIN0_B4
GAIN1_B7...GAIN1_B0
GAIN2_B3...GAIN2_B0, GAIN1_B11...GAIN1_B8
10
GAIN2_B11...GAIN2_B4
11
GAIN3_B7...GAIN3_B0
12
13
MSBPOS_B6...MSBPOS_B0, UB
14
15
POS0_B9...POS0_B2
16
17
POS1_B10...POS1_B3
18
POS2_B3...POS2_B0, POS1_B13...POS1_B11
19
POS2_B11...POS2_B4
20
POS3_B3...POS3_B0, POS2_B15...POS2_B12
21
POS3_B11...POS3_B4
22
23
PSIG2_B2...PSIG2_B0, PSIG1_B4...PSIG1_B0
24
PSIG3_B4...PSIG3_B0, PSIG2_B5...PSIG2_B3
25
PARx_By, ....
LPC_B13...LPC_B6
LPC_B21...LPC_B14
GAIN0_B11...GAIN0_B4
GAIN1_B7...GAIN1_B0
GAIN2_B3...GAIN2_B0, GAIN1_B11...GAIN1_B8
10
GAIN2_B11...GAIN2_B4
11
GAIN3_B7...GAIN3_B0
12
13
POS0_B7...POS0_B0
14
POS1_B3...POS1_B0, POS0_B11...POS0_B8
15
POS1_B11...POS1_B4
16
POS2_B7...POS2_B0
17
POS3_B3...POS3_B0, POS2_B11...POS2_B8
18
POS3_B11...POS3_B4
19
PSIG1_B3...PSIG1_B0, PSIG0_B3...PSIG0_B0
20
PSIG3_B3...PSIG3_B0, PSIG2_B3...PSIG2_B0
21
PARx_By, ....
LPC_B13...LPC_B6
LPC_B21...LPC_B14
55
C.3.5
Error concealment
The flags produced by the channel decoder and by the gain compression operation shall be used in
the speech decoder for error concealment purposes.
C.3.6
This clause specifies the minimum required error analysis performance in terms of bad frame
detection probability (PDBFI) and erroneous frame detection probability (PDEFI). The requirements
contained in Table C.9 shall apply to all transmission channels.
Table C.9/G.723.1 Detection performance of BFI and EFI flags
C.4
G.723.1 SID
PDBFI
0.99
0.99
0.99
PDEFI
0.65
0.65
---
All details of the channel encoder algorithm are described in the form of a fixed point ANSI-C
source code. In the case of any discrepancy between the above description and the C source code,
the latter has to be adopted as reference. This C source code is a part of ITU-T Rec. G.723.1.
Test vectors are included to check the bit-exactness of an implementation compared to the C source
code.
56
Series D
Series E
Overall network operation, telephone service, service operation and human factors
Series F
Series G
Series H
Series I
Series J
Cable networks and transmission of television, sound programme and other multimedia signals
Series K
Series L
Construction, installation and protection of cables and other elements of outside plant
Series M
Series N
Series O
Series P
Series Q
Series R
Telegraph transmission
Series S
Series T
Series U
Telegraph switching
Series V
Series X
Series Y
Series Z
Printed in Switzerland
Geneva, 2007