Você está na página 1de 2

AUDIO SEARCH USING DYNAMIC PROGRAMMING

Patha Sreedhar
Sreedhar.patha@research.iiit.ac.in

Speech and Vision Lab, International Institute of Information Technology, Hyderabad


Abstract: An unsupervised way of handling audio search is experimented using dynamic programming methodology. Segmental dynamic time warping is applied on MediaEval 2012 African database and NIST evaluation is done. Keywords: Dynamic Programming (DP), Dynamic Time Warping (DTW), Segmental Dynamic Time Warping (SDTW). Introduction: Dynamic programming: It helps in solving problems more quickly by taking advantage of lookup methodology instead of solving for every similar problem i.e. a problem is divided into various overlapping small problems and solved. It does not calculate for every problem instead similar problems are just looked up thus decreasing complexity. Dynamic time warping: It aligns two time dependent sequences in nonlinear warping fashion. It helps in finding similarity between two sequences Segmental dynamic time warping: If two sequences are of different length or the difference between them is large then DTW cannot be performed hence we use SDTW where we take part of the long sequence and try to align it to other sequence and compute the cost. Then the short sequence is aligned with a shift, this is continued till the end of long sequence. Hence we get different alignment costs for a sequence and the best cost is opted out. Skoechiba band: In SDTW, the window considered for alignment and the shift to be taken and condition for trellis calculation gives various bands and one of the widely used band is skoechiba band where the window size is taken to be query length plus R and shift of R is considered. The trellis is calculated only if distance between the test and model frame is less than or equal to R. In general in speech areas R is considered as speech rate. Itakura transitions: In trellis calculations, the transitions can be many and one of the widely used transition is Itakutra type of transition where one considers the insertion arc, substitution arc and diagonal arcs.

Figure 1: Itakuras transitions

Needleman Wunsch transitions: This is the other type transition which is used where we have deletion arc instead of diagonal arc.

Figure 2: Needleman - Wunsch transitions

Methodology: Gaussian posteriorgrams for both the test and model utterances are extracted from Mel Frequency Cepstral Coefficients (MFCC) of utterances. Segmental Dynamic Time Warping (SDTW) is applied on these features. For a given test utterance one cannot align it to the total model utterance, a part of utterance will be matched to the test utterance hence SDTW used instead of DTW directly. The window length for the SDTW procedure is selected, in general test utterance length plus 6 is

used to accommodate the speech rate variability. Trellis is calculated only if the distance between the test and model frame is less than 6 (in general). This window is shifted with 6 frames and glided along entire length of model utterance. At each shift, the trellis is computed and the alignment cost is noted. The minimum alignment cost among all wrapping paths is considered to spot the required test utterance. Experimentation: In trellis calculation, Itakura type and Needleman Wunsch type of transitions are also considered for experimenting. Frame shift of 1 instead of 6 was also tried. A variable speech rate of (query length/3) instead of constant 6 was also tried upon. Database and evaluation: African MediaEval 2012 database is used for experimentation and NIST evaluation is used. Results:
Table 1: Results of various experimentations

References: [1] Kishore Prahallad, Tutorial on Dynamic Programming as part of the lectures on Speech Technology: A practical introduction. [2] Y. Zhang and J. Glass, Unsupervised Spoken Keyword Spotting via Segmental DTW on Gaussian Posteriorgrams, Proc. ASRU, 398-403, Merano, Dec. 2009. [3] H. Wang and Tan Lee CUHK System for the Spoken Web Search task at Mediaeval 2012, in MediaEval 2012 Workshop, 2012. [4] Rabiner Lawrence, and Biing-Hwang Juang. "Fundamentals of speech recognition.

Window size

Window shift

Maximum Term Weighted Value Itakuras Needleman arc wunschs arc NA 0.2149 0.1432 0.2254 0.2251 NA

Query length + 6 (4/3) * Query length

1 6 (1/3)* Query length

Future developments: Techniques like Pseudo Relevance Feedback (PRF) can be used on top to increase the accuracy. Various other features like voicing, sonority can be considered for betterment of results. A different type of band such as Itakura parallelogram can also be used to improve accuracy. Acknowledgement: This work is part of course Topics in speech processing: Audio Information Retrieval held at IIIT Hyderabad during spring 2013. Author is very much thankful to Dr. Kishore Prahallad, Dr. Suryakanth V Gangashetty for their support. Author is also thankful to Gautham Mantenna for his constant guidance.

Você também pode gostar