Escolar Documentos
Profissional Documentos
Cultura Documentos
CERTIFICATE
This is to certify that the work titled DSP Tools in Wireless Communication submitted by Piyush Virmani & Palash Relan in partial fulfilment for the award of degree B.TECH of Jaypee Institute of Information Technology University, Noida has been carried out under my supervision. This work has not been submitted partially or wholly to any other University or Institute for the award of this or any other degree or diploma.
.. ..
ACKNOWLEDGEMENT
We are highly obliged to our project supervisor, Mr. Hemant Kumar Meena, for assigning this work of study on the topic DSP Tools in Wireless Communication which has helped us to develop understanding of Speech processing. We are grateful to him for all his time, assistance and guidance which motivated us to work on this topic and without which our major project would have not seen its end. We are also thankful to the external examiners Mr. R.K Dubey and Mr V.K Dwivedi who helped us build a better understanding on the matter.
CONTENTS
1. 2. 3. 4.
Certificate Acknowledgement Contents Abstract i. Wireless Communication for Voice Transmission ii. Digital Speech Processing 5. Application of Digital Speech Processing i. Speech Coding ii. Text to Speech Synthesis iii. Speech Recognition and Pattern Matching iv. Other Applications 6. Human Speech 7. Properties of Speech 8. Speech Analysis i. Short Term Energy ii. Short Term Zero Crossing iii. Short Term Autocorrelation Function 9. General Encoding of Arbitrary Waveforms i. Types of Vocoders ii. Vocoder Quality Measurement 10. Linear Predictive Analysis i. Introduction ii. LPC Model iii. LPC Analysis i. Input Speech ii. Pitch Period Estimation
iii. Vocal Tract Filter iv. Voiced/Unvoiced Determination v. Levinson-Durbin Algorithm iv. LPC Synthesis/Decoding v. Transmission of patrameters vi. Applications of LPC 11. Full LPC Model and Implementation i. LPC Encoder Model ii. LPC Decoder Model iii. MATLAB Implementation 12. Discussion and Conclusion 13. References
Abstract
Wireless Communication for Voice Transmission
Wireless communications operators see phenomenal growth in consumer demand for high quality and low cost services. Since the physical spectrum for wireless services is limited, operators and equipment suppliers continually find ways to optimise bandwidth efficiency. Digital communications technology provides an efficiency advantage over analog wireless communications; multiplexing and filtering is easier, components are cheaper, encryption is more secure and network management is easier. Additionally, digital technology provides more value added services to customers (security, text and voice messages together, etc.). Today wireless communication is primarily voice. The operator meets the increasing need for services by combining digital technology and special encoding techniques for voice. These encoders ("vocoders") take advantage of predictable elements in human speech. Several low data rate encoders are described here with an assessment of their subjective quality. Test methods to determine voice quality are necessarily subjective. The most efficient vocoders have acceptable quality levels and have data rates between 2 and 8 kbit/s. Higher data rate encoders (8-13 kbit/s) have improved quality while 32 kbit/s coders have excellent quality (but use more network resources. The operator must engineer the proper balance between cost, quality and available resources to provide the optimum solution to the customer.
Speech Coding
Perhaps the most widespread applications of digital speech processing technology occur in the areas of digital transmission and storage of speech signals. In these areas the centrality of the digital representation is obvious, since the goal is to compress the digital waveform representation of speech into a lower bit-rate representation. It is common to refer to this activity as speech coding or speech compression.
Speech coders enable a broad range of applications including narrowband and broadband wired telephony, cellular communications, voice over internet protocol (VoIP) (which utilizes the internet as a real-time communications medium), secure voice for privacy and encryption (for national security applications), extremely narrowband communications channels (such as battlefield applications using high frequency (HF) radio), and for storage of speech for telephone answering machines, interactive voice response (IVR) systems, and pre-recorded messages. Speech coders often utilize many aspects of both the speech production and speech perception processes, and hence may not be useful for more general audio signals such as music. Coders that are based on incorporating only aspects of sound perception generally do not achieve as much compression as those based on speech production, but they are more general and can be used for all types of audio signals. These coders are widely deployed in MP3 and AAC players and for audio in digital television systems.
Text-to-Speech Synthesis
For many years, scientists and engineers have studied the speech production process with the goal of building a system that can start with text and produce speech automatically. In a sense, a text-to-speech synthesizer such as depicted in figure is a digital simulation of the entire upper part of the speech chain diagram.
Text to Speech Synthesis Block Diagram The input to the system is ordinary text such as an email message or an article from a newspaper or magazine. The first block in the text-to-speech synthesis system, labelled linguistic rules, has the job of converting the printed text input into a set of sounds that the machine must synthesize. The conversion from text to sounds involves a set of linguistic rules that must determine the appropriate set of sounds (perhaps including things like emphasis, pauses, rates of speaking, etc.) so that the resulting synthetic speech will express the words and intent of the text message in what passes for a natural voice that can be decoded accurately by human speech perception. Once the proper pronunciation of the text has been determined, the role of the synthesis algorithm is to create the appropriate sound sequence to represent the text message in the form of speech. In essence, the synthesis algorithm must simulate the action of the vocal tract system in creating the sounds of speech.
The first block in the pattern matching system converts the analog speech waveform to digital form using an A-to-D converter. The feature analysis module converts the sampled speech signal to a set of feature vectors. Often, the same analysis techniques that are used in speech coding are also used to derive the feature vectors. The final block in the system, namely the pattern matching block, dynamically time aligns the set of feature vectors representing the speech signal with a concatenated set of stored patterns, and chooses the identity associated with the pattern which is the closest match to the time-aligned set of feature vectors of the speech signal. The symbolic output consists of a set of recognized words, in the case of speech recognition, or the identity of the best matching talker, in the case of speaker recognition, or a decision as to whether to accept or reject the identity claim of a speaker in the case of speaker verification.
The major areas where such a system finds applications include command and control of computer software, voice dictation to create letters, memos, and other documents, natural language voice dialogues with machines to enable help desks and call centres, and for agent services such as calendar entry and update, address list modification and entry, etc.
Human Speech
The fundamental purpose of speech is communication, i.e., the transmission of messages. According to Shannons information theory , a message represented as a sequence of discrete symbols can be quantified by its information content in bits, and the rate of transmission of information is measured in bits/second (bps). In speech production, as well as in many humanengineered electronic communication systems, the information to be transmitted is encoded in the form of a continuously varying (analog) waveform that can be transmitted, recorded, manipulated, and ultimately decoded by a human listener. In the case of speech, the fundamental analog form of the message is an acoustic waveform, which we call the speech signal. Speech signals, as illustrated in Figure 1.1, can be converted to an electrical waveform by a microphone, further manipulated by both analog and digital signal processing, and then converted back to acoustic form by a loudspeaker, a telephone handset or headphone, as desired. This form of speech processing is, of course, the basis for Bells telephone invention as well as todays multitude of devices for recording, transmitting, and manipulating speech and audio signals.
Properties of Speech
The two types of speech sounds, voiced and unvoiced, produce different sounds and spectra due to their differences in sound formation. With voiced speech, air pressure from the lungs forces normally closed vocal cords to open and vibrate. The vibrational frequencies (pitch) vary from about 50 to 400 Hz (depending on the persons age and sex) and forms resonance in the vocal track at odd harmonics. These resonance peaks are called formants and can be seen in the voiced
speech figures below.
Unvoiced sounds, called fricatives (e.g., s, f, sh) are formed by forcing air through an opening (hence the term, derived from the word friction). Fricatives do not vibrate the vocal cords and therefore do not produce as much periodicity as seen in the formant structure in voiced speech; unvoiced sounds appear more noise-like (see figures 3 and 4 below). Time domain samples lose periodicity and the power spectral density does not display the clear resonant peaks that are found in voiced sounds.
The spectrum for speech (combined voiced and unvoiced sounds) has a total bandwidth of approximately 7000 Hz with an average energy at about 3000 Hz. The auditory canal optimizes speech detection by acting as a resonant cavity at this average frequency. Note that the power of speech spectra and the periodic nature of formants drastically diminish above 3500 Hz. Speech encoding algorithms can be less complex than general encoding by concentrating (through filters) on this region. Furthermore, since line quality telecommunications employ filters that pass frequencies up to only 3000-4000 Hz, high frequencies produced by fricatives are removed. A caller will often have to spell or otherwise distinguish these sounds to be understood (e.g., F as in Frank).
Speech Analysis
Our goal is to extract parameters of the model by analysis of the speech signal, it is common to assume structures (or representations) for both the excitation generator and the linear system. One such model uses a more detailed representation of the excitation in terms of separate source generators for voiced and unvoiced speech as shown in the figure.
In this model the unvoiced excitation is assumed to be a random noise sequence, and the voiced excitation is assumed to be a periodic impulse train with impulses spaced by the pitch period (P0) rounded to the nearest sample. The pulses needed to model the glottal flow waveform during voiced speech are assumed to be combined (by convolution) with the impulse response of the linear system, which is assumed to be slowly-time-varying (changing every 50100 ms or so). By this we mean that over the timescale of phonemes, the impulse response, frequency response, and system function of the system remains relatively constant. For example over time intervals of tens of milliseconds, the system can be described by the convolution expression
where the subscript n denotes the time index pointing to the block of samples of the entire speech signal s[n] wherein the impulse response hn[m] applies.We use n for the time index within that interval, and m is the index of summation in the convolution sum. To simplify analysis, it is often assumed that the system is an all-pole system with system function of the form:
Although the linear system is assumed to model the composite spectrum effects of radiation, vocal tract tube, and glottal excitation pulse shape (for voiced speech only) over a short time interval, the linear system in the model is commonly referred to as simply the vocal tract system and the corresponding impulse response is called the vocal tract impulse response. For all-pole linear systems, as represented by the equation, the input and output are related by a difference equation of the form:
Similarly, the short-time zero crossing rate is defined as the weighted average of the number of times the speech signal changes sign within the time window. Representing this operator in terms of linear filtering leads to:
The short-time energy and short-time zero-crossing rate are important because they abstract valuable information about the speech signal, and they are simple to compute. The short-time energy is an indication of the amplitude of the signal in the interval around time. From our model, we expect unvoiced regions to have lower short-time energy than voiced regions. Similarly, the short-time zero-crossing rate is a crude frequency analyzer. Voiced signals have a high frequency (HF) fall off due to the lowpass nature of the glottal pulses, while unvoiced sounds have much more HF energy. Thus, the short-time energy and short-time zero-crossing rate can be the basis for an algorithm for making a decision as to whether the speech signal is voiced or unvoiced at a particular time.
Introduction
There exist many different types of speech compression that make use of a variety of different techniques. However, most methods of speech compression exploit the fact that speech production occurs through slow anatomical movements and that the speech produced has a limited frequency range. The frequency of human speech production ranges from around 300 Hz to 3400 Hz. Speech compression is often referred to as speech coding which is defined as a method for reducing the amount of information needed to represent a speech signal. Most forms of speech coding are usually based on a lossy algorithm. Lossy algorithms are considered acceptable when encoding speech because the loss of quality is often undetectable to the human ear. There are many other characteristics about speech production that can be exploited by speech coding algorithms. One fact that is often used is that period of silence take up greater than 50% of conversations. An easy way to save bandwidth and reduce the amount of information needed to represent the speech signal is to not transmit the silence. Another fact about speech production that can be taken advantage of is that mechanically there is a high correlation between adjacent samples of speech. Most forms of speech compression are achieved by modelling the process of speech production as a linear digital filter. The digital filter and its slow changing parameters are usually encoded to achieve compression from the speech signal. Linear Predictive Coding (LPC) is one of the methods of compression that models the process of speech production. Specifically, LPC models this process as a linear sum of earlier samples using a digital filter inputting an excitement signal. An alternate explanation is that linear prediction filters attempt to predict future values of the input signal based on past signals. LPC models speech as an autoregressive process, and sends the parameters of the process as opposed to sending the speech itself.
All vocoders, including LPC vocoders, have four main attributes: bit rate, delay, complexity, quality. Any voice coder, regardless of the algorithm it uses, will have to make trade offs between these different attributes. The first attribute of vocoders, the bit rate, is used to determine the degree of compression that a vocoder achieves. Uncompressed speech is usually transmitted at 64 kb/s using 8 bits/sample and a rate of 8 kHz for sampling. Any bit rate below 64 kb/s is considered compression. The linear predictive coder transmits speech at a bit rate of 2.4 kb/s, an excellent rate of compression. Delay is another important attribute for vocoders that are involved with the transmission of an encoded speech signal. Vocoders which are involved with the storage of the compressed speech, as opposed to transmission, are not as concern with delay. The general delay standard for transmitted speech conversations is that any delay that is greater than 300 ms is considered unacceptable. The third attribute of voice coders is the complexity of the algorithm used. The complexity affects both the cost and the power of the vocoder. Linear predictive coding because of its high compression rate is very complex and involves executing millions of instructions per second. The general algorithm for linear predictive coding involves an analysis or encoding part and a synthesis or decoding part. In the encoding, LPC takes the speech signal in blocks or frames of speech and determines the input signal and the coefficients of the filter that will be capable of reproducing the current block of speech. This information is quantized and transmitted. In the decoding, LPC rebuilds the filter based on the coefficients received. The filter can be thought of as a tube which, when given an input signal, attempts to output speech. Additional information about the original speech signal is used by the decoder to determine the input or excitation signal that is sent to the filter for synthesis.
LPC Model
The particular source-filter model used in LPC is known as the Linear predictive coding model. It has two key components: analysis or encoding and synthesis or decoding. The analysis part of LPC involves examining the speech signal and breaking it down into segments or blocks. Each segment is than examined further to find the answers to several key questions: Is the segment voiced or unvoiced? What is the pitch of the segment? What parameters are needed to build a filter that models the vocal tract for the current segment?
LPC analysis is usually conducted by a sender who answers these questions and usually transmits these answers onto a receiver. The receiver performs LPC synthesis by using the answers received to build a filter that when provided the correct input source will be able to accurately reproduce the original speech signal.
Essentially, LPC synthesis tries to imitate human speech production. Figure demonstrates what parts of the receiver correspond to what parts in the human anatomy. This diagram is for a general voice or speech coder and is not specific to linear predictive coding. All voice coders tend to model two things: excitation and articulation. Excitation is the type of sound that is passed into the filter or vocal tract and articulation is the transformation of the excitation signal into speech.
LPC Analysis/Encoding
Input speech The input signal is sampled at a rate of 8000 samples per second. This input signal is then broken up into segments or blocks which are each analysed and transmitted to the receiver. The 8000 samples in each second of speech signal are broken into 180 sample segments. This means that each segment represents 22.5 milliseconds of the input speech signal. Voice/Unvoiced Determination According to LPC-10 standards, before a speech segment is determined as being voiced or unvoiced it is first passed through a low-pass filter with a bandwidth of 1 kHz. Determining if a segment is voiced or unvoiced is important because voiced sounds have a different waveform then unvoiced sounds. The differences in the two waveforms creates a need for the use of two
different input signals for the LPC filter in the synthesis or decoding. One input signal is for voiced sounds and the other is for unvoiced. The LPC encoder notifies the decoder if a signal segment is voiced or unvoiced by sending a single bit. Recall that voiced sounds are usually vowels and can be considered as a pulse that is similar to periodic waveforms. These sounds have high average energy levels which means that they have very large amplitudes. Voiced sounds also have distinct resonant or formant frequencies. Pitch Period Estimation Determining if a segment is a voiced or unvoiced sound is not all of the information that is needed by the LPC decoder to accurately reproduce a speech signal. In order to produce an input signal for the LPC filter the decoder also needs another attribute of the current speech segment known as the pitch period. The period for any wave, including speech signals, can be defined as the time required for one wave cycle to completely pass a fixed position. For speech signals, the pitch period can be thought of as the period of the vocal cord vibration that occurs during the production of voiced speech. Therefore, the pitch period is only needed for the decoding of voiced segments and is not required for unvoiced segments since they are produced by turbulent air flow not vocal cord vibrations. It is very computationally intensive to determine the pitch period for a given segment of speech. There are several different types of algorithms that could be used. One type of algorithm takes advantage of the fact that the autocorrelation of a period function, Rxx(k), will have a maximum when k is equivalent to the pitch period. These algorithms usually detect a maximum value by checking the autocorrelation value against a threshold value. One problem with algorithms that use autocorrelation is that the validity of their results is susceptible to interference as a result of other resonances in the vocal tract. When interference occurs the algorithm can not guarantee accurate results. Another problem with autocorrelation algorithms occurs because voiced speech is not entirely periodic. This means that the maximum will be lower than it should be for a true periodic signal. LPC does not use an algorithm with autocorrelation, instead it uses an algorithm called average magnitude difference function (AMDF) which is defined as
Since the pitch period, P, for humans is limited, the AMDF is evaluated for a limited range of the possible pitch period values. Therefore, in LPC there is an assumption that the pitch period is between 2.5 and 19.5 milliseconds. If the signal is sampled at a rate of 8000 samples/second then 20 < P < 160. For voiced segments we can consider the set of speech samples for the current segment, {yn}, as a periodic sequence with period Po. This means that samples that are Po apart should have similar values and that the AMDF function will have a minimum at Po, that is when P is equal to the pitch period.
An advantage of the AMDF function is that it can be used to determine if a sample is voiced or unvoiced. When the AMDF function is applied to an unvoiced signal, the difference between the minimum and the average values is very small compared to voiced signals. This difference can be used to make the voiced and unvoiced determination. For unvoiced segments the AMDF function we also have a minimum when P equals the pitch period however, any additional minimums that are obtained will be very close to the average value. This means that these minimums will not be very deep.
Voiced
Unvoiced
where {yn} is the set of speech samples for the current segment and {ai} is the set of coefficients. In order to provide the most accurate coefficients, {ai} is chosen to minimize the average value of en for all samples in the segment. The first step in minimizing the average mean squared error is to take the derivative.
Taking the derivative produces a set of M equations. In order to solve for the filter coefficients E[yn-iyn-j] has to be estimate. There are two approaches that can be used for this estimation: autocorrelation and autocovariance. Although there are version of LPC that use both approaches, autocorrelation is the approach that will be explained in this paper for linear predictive coding. Autocorrelation requires that several initial assumptions be made about the set or sequence of speech samples, {yn}, in the current segment. First, it requires that {yn} be stationary and second, it requires that the {yn} sequence is zero outside of the current segment. In autocorrelation, each E[yn-iyn-j] is converted into an autocorrelation function of the form Ryy(| i-j |). The estimation of an autocorrelation function Ryy(k) can be expressed as:
Using Ryy(k), the M equations that were acquired from taking the derivative of the mean squared error can be written in matrix form RA = P where A contains the filter coefficients.
In order to determine the contents of A, the filter coefficients, the equation A = R-1P must be solved. This equation can not be solved with out first computing R-1. This is an easy computation if one notices that R is symmetric and more importantly all diagonals consist of the same element. This type of matrix is called a Toeplitz matrix and can be easily inverted. The Levinson-Durbin (L-D) Algorithm is a recursive algorithm that is considered very computationally efficient since it takes advantage of the properties of R when determining the filter coefficients.. This algorithm is denoted with a superscript, {ai (j)}for a jth order filter, and the average mean squared error of a jth order filter is denoted Ej instead of E[e2n]. When applied to an Mth order filter, the L-D algorithm computes all filters of order less than M. That is, it determines all order N filters where N=1,...,M-1.
During the process of computing the filter coefficients {ai} a set of coefficients, {ki}, called reflection coefficients or partial correlation coefficients (PARCOR) are generated. These coefficients are used to solve potential problems in transmitting the filter coefficients. The quantization of the filter coefficients for transmission can create a major problem since errors in the filter coefficients can lead to instability in the vocal tract filter and create an inaccurate output signal. This potential problem is averted by quantizing and transmitting the reflection coefficients that are generated by the Levinson-Durbin algorithm. These coefficients can be used to rebuild the set of filter coefficients {ai} and can guarantee a stable filter if their magnitude is strictly less than one.
LPC Synthesis/Decoding
The process of decoding a sequence of speech segments is the reverse of the encoding process. Each segment is decoded individually and the sequence of reproduced sound segments is joined together to represent the entire input speech signal. The decoding or synthesis of a speech segment is based on the 54 bits of information that are transmitted from the encoder. The speech signal is declared voiced or unvoiced based on the voiced/unvoiced determination bit. The decoder needs to know what type of signal the segment contains in order to determine what type of excitement signal will be given to the LPC filter. Unlike other speech compression algorithms like CELP which have a codebook of possible excitement signals, LPC only has two possible signals. For voiced segments a pulse is used as the excitement signal. This pulse consists of 40 samples and is locally stored by the decoder. A pulse is defined as ...an isolated disturbance, that travels through an otherwise undisturbed medium [10]. For unvoiced segments white noise produced by a pseudorandom number generator is used as the input for the filter. The pitch period for voiced segments is then used to determine whether the 40 sample pulse needs to be truncated or extended. If the pulse needs to be extended it is padded with zeros since the definition of a pulse said that it travels through an undisturbed medium. This combination of voice/unvoiced determination and pitch period are the only things that are need to produce the excitement signal. Each segment of speech has a different LPC filter that is eventually produced using the reflection coefficients and the gain that are received from the encoder. 10 reflection coefficients are used for voiced segment filters and 4 reflection coefficients are used for unvoiced segments. These reflection coefficients are used to generate the vocal tract coefficients or parameters which are used to create the filter.
The final step of decoding a segment of speech is to pass the excitement signal through the filter to produce the synthesized speech signal.
LPC Applications
In general, the most common usage for speech compression is in standard telephone systems. In fact, a lot of the technology used in speech compression was developed by the phone companies. Linear predictive coding only has application in the area of secure telephony because of its low bit rate. Secure telephone systems require a low bit rate since speech is first digitalized, then encrypted and transmitted. These systems have a primary goal of decreasing the bit rate as much as possible while maintaining a level of speech quality that is understandable. Other standards such as the digital cellular standard and the international telephone network standard have higher quality standards and therefore require a higher bit rate. In these standards, understanding the speech is not good enough, the listener must also be able to recognize the speech as belonging to the original source. A second area that linear predictive coding has been used is in Text-to-Speech synthesis. In this type of synthesis the speech has to be generated from text. Since LPC synthesis involves the generation of speech based on a model of the vocal tract, it provides a perfect method for generating speech from text. Further applications of LPC and other speech compression schemes are voice mail systems, telephone answering machines, and multimedia applications. Most multimedia applications, unlike telephone applications, involve one-way communication and involve storing the data. An example of a multimedia application that would involve speech is an application that allows voice annotations about a text document to be saved with the document. The method of speech compression used in multimedia applications depends on the desired speech quality and the limitations of storage space for the application. Linear Predictive Coding provides a favourable method of speech compression for multimedia applications since it provides the smallest storage space as a result of its low bit rate.
MATLAB Implementation
Main.m %MAIN BODY clear all; clc; disp('wavfile'); %INPUT inpfilenm = 'sample1'; [x, fs] =wavread(inpfilenm); %LENGTH (IN SEC) OF INPUT WAVEFILE, t=length(x)./fs; sprintf('Processing the wavefile "%s"', inpfilenm) sprintf('The wavefile is %3.2f seconds long', t) %THE ALGORITHM STARTS HERE, M=10; %prediction order [aCoeff, pitch_plot, voiced, gain] = f_ENCODER(x, fs, M); %pitch_plot is pitch periods synth_speech = f_DECODER (aCoeff, pitch_plot, voiced, gain); %RESULTS beep; disp('Press a key to play the original sound!'); pause; soundsc(x, fs); disp('Press a key to play the LPC compressed sound!'); pause; soundsc(synth_speech, fs); figure; subplot(2,1,1), plot(x); title(['Original signal = "', inpfilenm, '"']); subplot(2,1,2), plot(synth_speech); title(['synthesized speech of "', inpfilenm, '" using LPC algo']);
f_ENCODER.m
function [aCoeff, pitch_plot, voiced, gain] = f_ENCODER(x, fs, M); M = 10; %prediction order=10;
b=1; fsize = 30e-3; %frame size frame_length = round(fs .* fsize); N= frame_length - 1; %VOICED/UNVOICED and PITCH; [independent of frame segmentation] [voiced, pitch_plot] = f_VOICED (x, fs, fsize); %FRAME SEGMENTATION for aCoeff and GAIN; for b=1 : frame_length : (length(x) - frame_length), y1=x(b:b+N); y = filter([1 -.9378], 1, y1); %pre-emphasis filtering
%aCoeff [LEVINSON-DURBIN METHOD]; [a, tcount_of_aCoeff, e] = func_lev_durb (y, M); aCoeff(b: (b + tcount_of_aCoeff - 1)) = a; %GAIN; pitch_plot_b = pitch_plot(b); %pitch period voiced_b = voiced(b); gain(b) = f_GAIN (e, voiced_b, pitch_plot_b); end
func_lev_durbin.m
%function of levinsonDurbin function [aCoeff, tcount_of_aCoeff, e] = func_lev_durb (y, M); if (nargin<2), M = 10; end sk=0; a=[zeros(M+1);zeros(M+1)]; z=xcorr(y);
%finding array of R[l] R=z( ( (length(z)+1) ./2 ) : length(z)); s=1; J(1)=R(1); %GETTING OTHER PARAMETERS OF PREDICTOR OF ORDER "(s-1)": for s=2:M+1, sk=0; for i=2:(s-1), sk=sk + a(i,(s-1)).*R(s-i+1); end k(s)=(R(s) + sk)./J(s-1); J(s)=J(s-1).*(1-(k(s)).^2); a(s,s)= -k(s); a(1,s)=1; for i=2:(s-1), a(i,s)=a(i,(s-1)) - k(s).*a((s-i+1),(s-1)); end end aCoeff=a((1:s),s)'; tcount_of_aCoeff = length(aCoeff); est_y = filter([0 -aCoeff(2:end)],1,y); e = y - est_y;
f_VOICED.m
%function_main of voiced/unvoiced detection function [voiced, pitch_plot] = f_VOICED(x, fs, fsize);
f=1; b=1; frame_length = round(fs .* fsize); N= frame_length - 1; %FRAME SEGMENTATION: for b=1 : frame_length : (length(x) - frame_length), y1=x(b:b+N); y = filter([1 -.9378], 1, y1); %pre-emphasis filter msf(b:(b + N)) = func_vd_msf (y);
zc(b:(b + N)) = func_vd_zc (y); pitch_plot(b:(b + N)) = func_pitch (y,fs); end thresh_msf = (( (sum(msf)./length(msf)) - min(msf)) .* (0.67) ) + min(msf); voiced_msf = msf > thresh_msf; %=1,0 thresh_zc = (( ( sum(zc)./length(zc) ) - min(zc) ) .* min(zc); voiced_zc = zc < thresh_zc; (1.5) ) +
thresh_pitch = (( (sum(pitch_plot)./length(pitch_plot)) min(pitch_plot)) .* (0.5) ) + min(pitch_plot); voiced_pitch = pitch_plot > thresh_pitch; for b=1:(length(x) - frame_length), if voiced_msf(b) .* voiced_pitch(b) .* voiced_zc(b) == 1, % if voiced_msf(b) + voiced_pitch(b) > 1, voiced(b) = 1; else voiced(b) = 0; end end voiced; pitch_plot;
func_pitch.m
function pitch_period = func_pitch (y,fs) clear pitch_period; period_min = round (fs .* 2e-3); period_max = round (fs .* 20e-3); R=xcorr(y); [R_max , R_mid]=max(R); pitch_per_range = R ( R_mid + period_min : R_mid + period_max ); [R_max, R_mid] = max(pitch_per_range); pitch_period = R_mid + period_min;
func_vd_msf.m function m_s_f = func_vd_msf (y) clear m_s_f; [B,A] = butter(9,.33,'low'); y1 = filter(B,A,y); m_s_f=sum(abs(y1)); %.5 or .33?
func_vd_zc.m
function ZC = func_vd_zc (y) ZC=0; for n=1:length(y), if n+1>length(y) break end ZC=ZC + (1./2) .* abs(sign(y(n+1))-sign(y(n))); end ZC;
f_GAIN.m
%function for calc gain per frame function [gain_b, power_b] = f_GAIN (e, voiced_b, pitch_plot_b); if voiced_b == 0, denom = length(e); power_b = sum(e (1:denom) .^2) ./ denom; gain_b = sqrt( power_b ); else denom = ( floor( length(e)./pitch_plot_b ) .* pitch_plot_b ); power_b = sum( e (1:denom) .^2 ) ./ denom; gain_b = sqrt( pitch_plot_b .* power_b ); end
power_b; gain_b;
f_DECODER.m
%DECODER PORTION function synth_speech = f_DECODER (aCoeff, pitch_plot, voiced, gain); frame_length=1; for i=2:length(gain) if gain(i) == 0, frame_length = frame_length + 1; else break; end end %decoding starts here, for b=1 : frame_length : (length(gain)), if voiced(b) == 1, %voiced frame pitch_plot_b = pitch_plot(b); syn_y1 = f_SYN_V (aCoeff, gain, frame_length, pitch_plot_b, b); else syn_y1 = f_SYN_UV (aCoeff, gain, frame_length, b); %unvoiced frame end synth_speech(b:b+frame_length-1) = syn_y1; end
f_SYN_V.m
%a function of f_DEOCDER function syn_y1 = f_SYN_V (aCoeff, gain, frame_length, pitch_plot_b, b); %creating pulsetrain; for f=1:frame_length if f./pitch_plot_b == floor(f./pitch_plot_b) ptrain(f) = 1; else ptrain (f) = 0; end end
f_SYN_UV.m
%a function of f_DEOCDER function syn_y1 = f_SYN_UV (aCoeff, gain, frame_length, b); wn = randn(1, frame_length); syn_y2 = filter(1, [1 aCoeff((b+1):(b+1+9))], wn); syn_y1 = syn_y2 .* gain(b);
References
[1] Lawrence R. Rabiner and Ronald W. Schafer . Introduction to Digital Speech Processing Vol. 1, Nos. 12 (2007) 1194 V. Hardman and O. Hodson. Internet/Mbone Audio (2000) 5-7. Scott C. Douglas. Introduction to Adaptive Filters, Digital Signal Processing Handbook (1999) 7-12. Poor, H. V., Looney, C. G., Marks II, R. J., Verd, S., Thomas, J. A., Cover, T. M. Information Theory. The Electrical Engineering Handbook (2000) 56-57. R. Sproat, and J. Olive. Text-to-Speech Synthesis, Digital Signal Processing Handbook (1999) 9-11 . Richard C. Dorf, et. al.. Broadcasting (2000) 44-47. Richard V. Cox. Speech Coding (1999) 5-8. Randy Goldberg and Lance Riek. A Practical Handbook of Speech Coders (1999) Chapter 2:1-28, Chapter 4: 1-14, Chapter 9: 1-9, Chapter 10:1-18. Mark Nelson and Jean-Loup Gailly. Speech Compression, The Data Compression Book (1995) 289-319. Khalid Sayood. Introduction to Data Compression (2000) 497-509. Richard Wolfson, Jay Pasachoff. Physics for Scientists and Engineers (1995) 376-377.
[2] [3]
[4]
[5]