Você está na página 1de 12

Audio Engineering Society

Convention Paper 6161


Presented at the 116th Convention 2004 May 811 Berlin, Germany
This convention paper has been reproduced from the author's advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.

Evaluation of Objective Loudness Meters


Gilbert A. Soulodre
1 1

Communications Research Centre, Ottawa, Ontario K2H 8S2, Canada gilbert.soulodre@crc.ca

ABSTRACT There are many applications where it is desirable to objectively measure the perceived loudness of typical audio signals. The ITU-R is investigating suitable objective measures (meters) that would allow the perceived loudness of various program materials to be equalized for broadcast applications. Ten objective loudness meters were submitted for formal evaluation by several private companies and research organizations. The loudness meters were evaluated in their ability to predict the results of an extensive database derived from a series of formal subjective tests conducted at five test sites around the world. The performance of the various loudness meters is compared and rated using several newly proposed metrics. Several basic objective loudness measures were also evaluated. industry where dynamics processing is commonly used to maximize the perceived loudness of a recording. Currently within the ITU-R, a Special Rapporteur Group (SRG3) has been given the task of identifying an objective means of measuring (a meter) the perceived loudness of typical program material for broadcast applications. The intent is to develop a means of reducing the variation in perceived loudness as the nature and content of the program material changes, or as the user changes between broadcast stations. Ultimately, as a result of the work, some loudness metering system(s) may be shown to be suitable for this purpose and a new ITU-R recommendation will be established. In the first part of the ITU-R study a subjective test method was developed to examine loudness perception of typical program materials [1]. Subjective tests were

1.

INTRODUCTION

In many applications it is desirable to be able to measure and control the subjective loudness of typical program material. Examples of this include television and radio broadcast applications where the nature and content of the audio material changes frequently. In these applications the audio content can continually switch between music and speech, or some combination of the two. These changes in the content of the program material can result in significant changes in subjective loudness. Moreover, various forms of dynamics processing are frequently applied to the signals, which can have a significant effect on the perceived loudness of the signal. Of course, the matter of subjective loudness is also of great importance to the music

Soulodre conducted at several sites around the world to create a subjective database for evaluating the performance of the proposed loudness meters. Subjects matched the loudness of various audio sequences to a reference sequence. The audio sequences were taken from actual broadcasts (television and radio). A total of ten commercially developed loudness meters were submitted by seven different proponents for evaluation at the Audio Perception Lab of the Communications Research Centre. In addition, the author contributed two additional basic loudness algorithms to serve as a performance baseline. The individual audio sequences of the subjective database were processed through each of the loudness meters and the measured loudness estimates were recorded. These objective readings were then compared against the subjective loudness ratings using a variety of metrics to assess each meters performance. 2. SUBJECTIVE DATABASE

Loudness Meter Evaluations

Reference Test Item

Figure 1: Subjective test setup. The subjective tests consisted of a loudness-matching task. Subjects listened to a broad range of typical program material and adjusted the level of each test item until its perceived loudness matched that of a reference signal. The reference signal in this experiment consisted of English female speech reproduced at a level of 60 dBA. In an unpublished study, Benjamin found this to be a typical listening level for television viewing in actual homes [2].

As a first step in identifying a suitable objective loudness meter, a subjective database was developed that provides perceived loudness ratings for a variety of typical program materials. The program materials used in the tests were taken from actual television and radio broadcasts from various locations around the world. The sequences included music, television and movie dramas, sporting events, news broadcasts, sound effects, and advertisements. Included in the sequences were speech segments in several languages. The subjective tests were carried out at five separate sites around the world providing a total of 97 listeners. The test sites consisted of the Australian Broadcasting Corporation (Australia), the British Broadcasting Corporation (England), the Communications Research Centre (Canada), the National Acoustic Laboratories (Australia), and the National Film Board (Canada). 2.1. Subjective Test Set-up

Figure 2: User interface of subjective test system A software-based subjective test system developed at the Australian Broadcasting Corporation allowed the listener to switch instantly back and forth between test items and adjust the level (loudness) of each item. A screen-shot of the test software is shown in Figure 2. The level of the test items could be adjusted in 0.33 dB steps. Selecting the button labeled 1 accessed the reference signal. The level of the reference signal was held fixed. Using the computer keyboard, the subject selected a given test item and adjusted its level until its loudness matched the reference signal. Subjects could instantly

The test setup used in the subjective experiments is depicted in Figure 1. The subjective tests were conducted in acoustically dead listening environments. A single loudspeaker was placed directly in front of the seated listener.

AES 116th Convention, Berlin, Germany, 2004 May 811


Page 2 of 12

Soulodre

Loudness Meter Evaluations

switch between any of the test items by selecting the Table 1: Content of program material used in subjective appropriate key. The sequences played continuously tests. (looped) during the tests. The software recorded the Category Number Description gain settings for each test item as set by the subject. of Items Therefore, the subjective tests produced a set of gain 1 16 Speech only, no background sounds values (in decibels) required to match the loudness of 2 4 Drama (dialogue with each test sequence with the reference sequence. environmental sounds) 3 22 Speech with background music Prior to conducting the formal blind tests, each subject 4 28 Speech with background sounds underwent a training session in which they became (interview/sports) acquainted with the test software and their task in the 5 14 Instrumental music experiment. Since many of the test items contained a 6 6 Music with a lead singer mixture of speech and other sounds (i.e. music, 7 4 Singing voice with no instruments background noises, etc.), the subjects were specifically 8 2 Sound effects, environmental instructed to match the loudness of the overall signal, sounds, no speech not just the speech component of the signals. In the formal blind tests, subjects matched the perceived loudness of 96 monophonic audio sequences. A threemember panel made up of SRG3 members selected the test sequences as well as the reference item. The sequences were taken from actual television and radio broadcasts, and were chosen to provide a broad range of program material from around the world. The audio sequences were recorded at a sampling rate of 48kHz using 16 bit PCM. The test sequences can be roughly classified into eight different categories according to their content. The eight categories are listed in Table 1. It can be seen that about 74% of the sequences contained some speech element. This seems reasonably representative of typical television content. The experiment was designed such that the consistency of each subjects loudness matching could be evaluated in post hoc analysis. Specifically, there were 48 separate test sequences used in the experiment, and each one was reproduced with two different level offsets. In one case, an attenuation was applied to the sequence so that the subjects would likely have to boost the sequence in order to match its loudness to that of the reference signal. In the other case, a boost was applied and so the subjects would likely need to attenuate the test sequence in order to match its loudness to the reference signal. This provided the 96 test items used in the experiment. Using this approach, it is possible to measure the precision with which a listener can match loudness levels, which in turn provides an indication of the level of precision required for a practical loudness meter. Each of the test sites sent their results to the author and the data were combined into a single subjective database for analysis. It was well appreciated beforehand that there were numerous experimental variables that could not be fully controlled across all test sites. For example, parameters of the test set-up such as the choice of loudspeaker and the room conditions (reverberant characteristics and background noise) would be different from one test site to the next. Other factors such as the training of subjects or any cultural differences could not be entirely controlled either. Despite these variables, an analysis of the subjective results showed a very high degree of agreement and consistency between the test sites. Specifically, the correlation between the results of the different test sites was 0.99. This indicates that the various sites did very well at conducting the tests as instructed, and suggests that the subjective database is quite reliable. The correlation between each subjects data and the average over all subjects was calculated and is plotted in Figure 3. It can be seen that for the most part there is very good agreement among the subjects data. However, the correlation for four of the subjects was significantly lower as indicated by the squares around the data points in the figure. The members of SRG3 decided that the data for these four subjects should be discarded from further analysis.

AES 116th Convention, Berlin, Germany, 2004 May 811


Page 3 of 12

Soulodre
1.00

Loudness Meter Evaluations decibel gain values, any readings in phons had to be converted to a decibel representation. Certain meters needed to repeat each audio sequence before the loudness reading stabilized. There are several issues regarding the operation of a successful loudness meter that must ultimately be considered in the final ITU-R recommendation. For example, the complexity and cost of a meter must be assessed before it can be recommended. Another important consideration is whether or not the loudness meter can operate in real time. However, these issues are beyond the scope of the present paper and will not be considered here. The 96 audio sequences and the reference sequence were processed through each of the meters and the predicted loudness ratings were recorded. Each meters loudness ratings were then normalized by subtracting the loudness measurement for the reference sequence from the ratings for each of the test sequences. As a result of this process, the overall performance of each loudness meter could be evaluated based on the agreement between the predicted ratings and the actual subjective ratings obtained in the formal subjective tests. The ability of each meter to predict individual audio sequences could also be examined. Once the process of obtaining loudness measures from each of the meters was complete, the identities of the meters were hidden. Each meter was randomly assigned a letter from A to K, and neither the author nor the meter proponents knew the identities of the meters during the subsequent analysis. Meter E was not included in the analysis. Table 2: Correlation between loudness meter outputs.
A B C D F G H I J K A 1.00 0.98 0.90 0.97 0.96 0.97 0.93 0.94 0.93 0.97 B 0.98 1.00 0.88 0.95 0.94 0.98 0.88 0.92 0.90 0.95 C 0.90 0.88 1.00 0.94 0.95 0.94 0.77 0.97 0.96 0.94 D 0.97 0.95 0.94 1.00 0.97 0.98 0.88 0.97 0.98 1.00 F 0.96 0.94 0.95 0.97 1.00 0.96 0.86 0.97 0.95 0.97 G 0.97 0.98 0.94 0.98 0.96 1.00 0.85 0.96 0.96 0.98 H 0.93 0.88 0.77 0.88 0.86 0.85 1.00 0.84 0.83 0.88 I 0.94 0.92 0.97 0.97 0.97 0.96 0.84 1.00 0.98 0.97 J 0.93 0.90 0.96 0.98 0.95 0.96 0.83 0.98 1.00 0.98 K 0.97 0.95 0.94 1.00 0.97 0.98 0.88 0.97 0.98 1.00

0.95 Correlation to Mean

0.90

0.85

0.80

20

40

Subject

60

80

100

Figure 3: Correlation between individual subjects data and mean result. Squares indicate discarded data. The members of SRG3 further decided that data from the remaining 93 subjects should be processed using a zero-order correction. That is, each subjects data was normalized to remove the DC component of his/her loudness ratings. This was done because it was determined that a successful loudness meter does not need to account for the absolute loudness of a signal, but rather it needs to estimate the relative loudness between audio sequences. 3. OBJECTIVE LOUDNESS METERS

A call for proposals was put forward by the ITU-R SRG3 and a total of 10 loudness meters were submitted by 7 different private companies and research organizations. Some of these meters are currently available commercially. The meter proponents (listed alphabetically) consisted of DAG2000/IRT, Dolby Laboratories, Dorrough, NHK/Yamaki, Opticom, Pinguin, and TC Electronic. The meters were sent to the Communications Research Centres Audio Perception Lab, where the author conducted the analysis. A verification test was first undertaken to confirm that all of the loudness meters were functioning as intended and to ensure that the author fully understood how to operate each meter. The meter implementations were a mixture of hardware and software, with some being a combination of the two. Some of the meters produced loudness measures in units of phons, while others gave values in decibels. Since the subjective database consisted of a set of

Before examining the meters abilities to predict the subjective loudness ratings, the level of agreement

AES 116th Convention, Berlin, Germany, 2004 May 811


Page 4 of 12

Soulodre between meters was examined. Specifically, the intermeter correlations were calculated using the 96 loudness readings from each meter. While this does not provide any indication of how well the meters are able to match the subjective database, it does provide an initial comparison of their performance. The inter-meter correlations are given in Table 2. It can be seen that there is reasonably good agreement among the various meters. The main exception to this is Meter H which tends to have lower correlation with the remaining meters. The relatively high correlation between the meters suggests that they are basing their measures on similar aspects of the signals. 3.1. Basic Loudness Measures

Loudness Meter Evaluations (or be annoyed by) small changes in loudness. Based on the findings of Soulodre et al., loudness errors of less than 1.25 dB are expected to go largely unnoticed [1]. Therefore, a meter could be considered to be ideal if all of its errors were less than 1.25 dB. Conversely, even a single error beyond some limit could be considered entirely unacceptable, thus disqualifying a given meter from further consideration. For example, what is the potential impact of an objective meter that wrongly predicts the loudness of a given sequence by say 10dB? The SRG3 members agreed on nine different performance metrics to compare the meters. Certain metrics were designed to reflect a meters average performance while accounting for the impact of outliers. Severe outliers resulted in a corresponding penalty on the meters performance rating. Some of the metrics had been suggested previously in a study by Soulodre and Norcross [3]. One obvious way to evaluate the performance is to compute the correlation between the subjective database and the relative loudness values predicted by the individual meters. While the correlation coefficient (R) is a common metric for comparing the performance of various systems, there are inherent limitations with this approach. For example, the correlation coefficient can provide an indication of the overall relation between the subjective database and the output of a given loudness meter. However, it does not provide any information regarding the performance of the meter for individual audio sequences. In the context of an objective loudness meter, this can be a very important consideration. A loudness meter may eventually be used to automatically monitor and equalize the levels of audio signals in a broadcast application. Therefore, if the meter makes a significant error in judging the loudness of a particular audio signal, then the result of this error could be a sudden drop or increase in loudness. While small drops or increases in loudness may be acceptable (or indeed, undetectable) to the end listener, larger changes will be quite unacceptable. Consider a situation where someone is listening at a low level in order not to disturb other people nearby. If there is a sudden drop in loudness then the listener may not be able to properly hear the audio content at that time. Conversely, if there is a sudden increase in loudness, then this may disturb those nearby.

In a previous paper by the author, several basic objective loudness measures were evaluated [3]. These objective measures consisted of a simple frequency weighting function, followed by an RMS measurement block. The best performance in that study was obtained with a frequency weighting curve referred to as the Revised Low-frequency B-curve (RLB). Unweighted RMS was also found to perform quite well in that study. Therefore these two basic loudness measures (Leq and Leq(RLB)) were included in the present analysis to provide a benchmark for the performance of the 10 proposed meters. The Leq measure was labeled Meter L, while the Leq(RLB) measure was labeled Meter M in the subsequent analysis. Leq is defined as,

1 T x2 Leq = 10 log10 2 dt , dB T 0 xRef

(1)

where x is the signal at the input to the meter, xRef is some reference level, and T is the length of the audio sequence. 4. PERFORMANCE METRICS

In order to assess the performance of the various loudness meters objectively, it was necessary to establish a set of suitable performance metrics that would effectively reflect the requirements of a practical loudness meter. In general, we want the meter to match the relative levels of the database as closely as possible. However, small errors in the meters predictions are probably acceptable since listeners are unlikely to detect

AES 116th Convention, Berlin, Germany, 2004 May 811


Page 5 of 12

Soulodre Spearmans rho was also used as a metric in evaluating the performance of the meters. It is similar to the correlation metric but is based on the relative ranking of the data rather than the actual numerical values of the data. This metric does not provide any information regarding individual audio sequences. Another potential problem with a simple correlation (or Spearmans rho) metric is that it is insensitive to possible scaling issues. For example, two meters may have identical correlation coefficients, while one of the meters is always off by a factor of 10. To avoid this particular problem, the RMS (root-mean-square) error between the subjective loudness ratings and the objective values could be computed. The RMS error metric as defined below directly accounts for the error (in dB) between individual subjective and objective data points.

Loudness Meter Evaluations

some limit value. That is, any error beyond some limit may be deemed unacceptable.

MAE = max { e(i ) ; i = 1, ..., N } , dB.

(4)

RMSE =

1 N

i =1

e (i )

1/ 2

, dB.

(2)

A more elaborate penalty-based metric derived in [3] was also used to evaluate the meters. The goal of this metric was to specifically account for the requirements of a successful objective loudness meter in typical broadcast applications. For each audio sequence a performance index between 0 and 1 is computed based on the error between the objective measure and the subjective database. A value of 1 indicates perfect performance for that audio sequence (i.e., the objective meter perfectly predicted the subjective loudness). Conversely, a value of 0 indicates that the performance of the objective meter was unacceptable. A performance index of 0 occurs when the absolute error (in dB) is beyond some acceptable limit. The proposed performance index is shown in Figure 4.

where e(i) = objective(i) subjective(i) is the difference between the meter output and the subjective loudness rating for each audio sequence. The average absolute error (AAE) was also included as a performance metric. It is similar to the RMS error and is defined as
AAE = 1 N

e (i )
i =1

, dB.

(3)

Given the above discussion, it is reasonable to try to devise a performance metric that reflects the loudness meters average performance while accounting for the impact of outliers (errors). Severe outliers should result in a corresponding penalty on the meters performance rating. There are several possible approaches to penalize the objective meters based on the severity of outlying data points. One basic approach is to determine the maximum absolute error (MAE) that a meter produces over all of the audio sequences in the database. The loudness meter with the lowest MAE (in dB), as defined in Equation 4 would be preferred in this case. Alternatively, one may wish to eliminate any loudness meter that produces a maximum absolute error above

Figure 4: Proposed performance index curve. As stated above, the performance index is computed for each audio sequence in the database. An overall Loudness Performance Index (LPI) is then computed using the product of the individual performance indices as follows,

LPI =

i =1

max 1

subj (i ) obj (i ) L

, 0 ,

(5)

where p determines the shape of the performance index curve, and L is the limit (in dB) for the maximum absolute error. A higher value of p has the effect of

AES 116th Convention, Berlin, Germany, 2004 May 811


Page 6 of 12

Soulodre

Loudness Meter Evaluations

flattening the top of the curve while causing the slopes to be steeper. The curve in Figure 4 has p = 2.5 and L = 10 dB. With this choice of parameters any individual errors of less than 1 dB will receive virtually no penalty, whereas any error greater than 10 dB will receive a value of 0. Taking the product of the individual performance indices produces a desireable result. Specifically, the overall LPI will be zero if any of the individual performance indices is zero. Moreover, individual errors (subjective result minus objective result) are more severely penalized as their magnitude increases towards the limit. With this metric a perfect objective loudness meter would have an LPI of 1.

To reflect the variance in the subjective data, the meters were also evaluated using a variant of the LPI. The Advanced Loudness Performance Index (ALPI) modifies the curve in Figure 4 so that it has a flat top (Performance Index = 1) for any errors less that +/- 1.25 dB. That is, the ALPI does not penalize a meter for any errors less than 1.25 dB. This range encompases the range of confidence intervals for the subjective database. Therefore, this metric does not penalize a meter if its loudness rating falls within the confidence interval for a given audio sequence. Skovenborg proposed a metric referred to as the Subjective Deviation as a means of evaluating the impact of errors for individual audio sequences [4]. It is defined as,

The subjective database against which the loudness meters are being evaluated may also be considered. That MeterPrediction( i ) median (SubjectiveValues(i )) is, individual points in the database represent the SubjDev (i ) = InterquartileRange (SubjectiveValues(i )) average loudness ratings as derived from numerous listeners. There is an inherent variance in the subjective (6) data, which is commonly reflected in the 95% confidence intervals around the mean values as shown in Figure 5. This variance limits the resolution with With the Subjective Deviation metric a meter will not which the loudness meters can be evaluated. It may be be penalized as severely on test sequences with larger reasonable to consider any objective data point that falls variances in the subjective ratings. In other words, if the within the confidence interval to have zero error (i.e., a subjects did not agree very well on the loudness of a given audio sequence, then the meters will not be overly perfect prediction) for that test item. penalized for not accurately predicting the mean loudness rating. 10
5

Gain, dB

Subjective Deviation values are calculated for each audio sequence. In order to obtain a single overall rating for a given meter, the Mean Subjective Deviation was used. This is obtained as follows, MeanSubjDev = 1 N

-5

SubjDev(i)
i =1

(7)

-10

-15 0 10 20 30 40 50 60 Test Sequence 70 80 90 100

Therefore, a lower value for the Mean Subjective Deviation indicates better performance by the loudness meter. In order to take advantage of some of the properties of both the LPI metric described earlier and the Subjective Deviation metric, another metric was derived. The Product Subjective Deviation is defined as,

Figure 5: Mean subjective loudness ratings with confidence intervals. It can be seen from Figure 5 that the confidence intervals for the database are actually quite small, illustrating the high level of agreement among subjects. The confidence intervals are less than 2dB for all audio sequences, and are typically less than 1dB.

N 1 ProdSubjDev = i =1 1 + SubjDev (i )

1/ N

(8)

AES 116th Convention, Berlin, Germany, 2004 May 811


Page 7 of 12

Soulodre

Loudness Meter Evaluations

Like the LPI metric, a Product Subjective Deviation of 1 indicates that the meter performance is perfect. Each of the ten proposed loudness meters as well as the two basic loudness measures were evaluated using the nine performance metrics described above. The results are presented in the following section.
5. RESULTS

The results for Meter C are plotted in Figure 8. The figure shows slightly better clustering of the data points along the diagonal for this meter. Meter D performed significantly better than the previous three meters. Figure 9 shows the majority of data points closely following the diagonal for this meter. Figure 10 plots the results for Meter F. Its performance appears to be similar to Meter D, although the worstcase outlier for Meter F is not as severe (3.6dB versus 4.7dB). The results for Meter G are plotted in Figure 11. The performance of this meter is quite reasonable and is somewhat similar to Meter D. It was revealed in an SRG3 meeting that Meter G was an implementation of B-weighted Leq. Figure 12 plots the results for Meter H. This meters performance is significantly worse than all of the other meters. This meter was also found to have the lowest correlation with the other meters (see Table 2). Figures 13 and 14 provide the results for Meter I and Meter J respectively. It can be seen that the two meters have similar performance, with Meter I being slightly better. This is most evident at lower gain values. Meter K performs quite well as shown in Figure 15. The data points are well clustered along the diagonal. The performance of Meter K is very similar to that of Meter D as indicated by the inter-meter correlation in Table 2. The results for Meter L are plotted in Figure 16. This meter performs very well with all of the data points closely following the diagonal. This is the Leq meter that was contributed by the author. Finally, Figure 17 plots the results for Meter M. Meter M provides the best performance of all of the loudness meters examined in this study. All of the data points are tightly clustered along the diagonal, with a worst-case outlier of 3.6dB. While Meter L and Meter M have similar performance, data points at gains between 5 and 10 dB are closer to the diagonal for Meter M. Therefore, based on a visual inspection of the scatter plots for each meter, Meter M appears to be the best at predicting the results in the subjective database. This is the Leq(RLB) meter proposed by the author.

A good first step in evaluating the performance of the loudness meters is to plot the predicted (objective) loudness values with the corresponding values from the database. Figures 6 through 17 plot the results for the twelve loudness meters. The data are plotted in terms of the gain that needs to be applied to a given audio signal in order to match its level to the reference signal. The open circles represent speech-based audio sequences, while the stars are nonspeech-based sequences. It should be noted that a perfect objective meter would result in all data points falling on the diagonal line having a slope of 1 and passing through the origin (as shown in the figures). Any data point falling above the diagonal line indicates that the meter overestimated the gain required to match the loudness of that audio sequence to the reference signal. That is, the meter underestimated the perceived loudness of that particular audio sequence. From Figure 6 it can be seen that Meter As performance is not very good. Specifically, the data points are not tightly clustered around the diagonal as desired. There are several outlying data points for this meter. Notably, there is a data point in the top righthand corner of the graph that is quite far from the diagonal. This point represents the result for a selection of classical music and proved to be a problem for all of the loudness meters. All of the meters underestimated the perceived loudness of this sequence. Figure 7 shows the results for Meter B. It can be seen from the graph that its performance is similar to Meter As. Of note, the outlying data point in the upper righthand corner is not as severe for this meter. In fact, Meter B performs best at predicting the loudness of this audio sequence. It was revealed in an SRG3 meeting that Meter B was an implementation of A-weighted Leq.

AES 116th Convention, Berlin, Germany, 2004 May 811


Page 8 of 12

Soulodre

Loudness Meter Evaluations

Figure 6: Meter A.

Figure 9: Meter D.

Figure 7: Meter B [Leq(A)].

Figure 10: Meter F.

Figure 8: Meter C.

Figure 11: Meter G [Leq(B)].

AES 116th Convention, Berlin, Germany, 2004 May 811


Page 9 of 12

Soulodre

Loudness Meter Evaluations

Figure 12: Meter H.

Figure 15: Meter K.

Figure 13: Meter I.

Figure 16: Meter L [Leq].

Figure 14: Meter J.

Figure 17: Meter M [Leq(RLB)].

AES 116th Convention, Berlin, Germany, 2004 May 811


Page 10 of 12

Soulodre

Loudness Meter Evaluations

Table 3: Performance of the loudness meters for the nine performance metrics. Values in brackets [ ] indicate relative ranking for each metric. R A B [Leq(A)] C D F G [Leq(B)] H I J K L [Leq] M [Leq(RLB)]
0.944[10] 0.929[11] 0.955 [9] 0.976 [3] 0.965 [8] 0.972 [5] 0.848[12] 0.972 [5] 0.968 [7] 0.975 [4] 0.979 [2] 0.982 [1]

Spearman rho
0.9165[10] 0.8888[11] 0.9522 [8] 0.9580 [5] 0.9508 [9] 0.9524 [7] 0.8407[12] 0.9602 [3] 0.9547 [6] 0.9582 [4] 0.9713 [1] 0.9713 [1]

RMSE (dB)
2.37[11] 2.19[10] 1.75 [9] 1.31 [3] 1.55 [8] 1.37 [6] 3.33[12] 1.36 [5] 1.51 [7] 1.33 [4] 1.26 [2] 1.15 [1]

MAE (dB)
6.37[10] 6.39[11] 5.76 [9] 4.70 [5] 3.61 [1] 4.19 [4] 6.89[12] 4.80 [6] 4.97 [7] 5.13 [8] 4.33 [3] 3.62 [2]

AAE (dB)
1.88[11] 1.77[10] 1.35 [9] 0.99 [3] 1.28 [8] 1.07 [5] 2.90[12] 1.09 [6] 1.17 [7] 0.99 [3] 0.93 [2] 0.87 [1]

LPI
0.1139[11] 0.1754[10] 0.3574 [9] 0.5960 [3] 0.5059 [7] 0.5770 [5] 0.0092[12] 0.5889 [4] 0.4910 [8] 0.5754 [6] 0.6245 [2] 0.6973 [1]

ALPI

Prod SubjDev

Mean SubjDev

0.1177[11] 0.0000[10] 0.9608[11] 0.1830[10] 0.0000[10] 0.8945[10] 0.3737 [9] 0.6325 [3] 0.5296 [7] 0.6131 [5] 0.6218 [4] 0.5193 [8] 0.6105 [6] 0.6640 [2] 0.7459 [1] 0.6239 [8] 0.6722 [9] 0.6793 [4] 0.5152 [4] 0.6200 [9] 0.6686 [8] 0.6695 [5] 0.5390 [5] 0.6637 [6] 0.5513 [6] 0.6500 [7] 0.5926 [7] 0.6800 [3] 0.5140 [3] 0.7090 [2] 0.4509 [2] 0.7147 [1] 0.4337 [1]

0.0093[12] 0.0000[10] 1.4698[12]

The twelve loudness meters were assessed using the nine performance metrics described in Section 4. The results are summarized in Table 3. Each column in the table represents the results for one of the performance metrics. The numbers shown in square brackets indicate the relative ranking of the loudness meters for each metric. It can be seen that Meter M is ranked as the best meter for all of the metrics except the maximum absolute error (MAE). For this metric it is ranked second. However, it can be considered to be effectively equivalent to the first ranked meter for this measure, since its error is only 0.01dB larger. Meter M is the Leq(RLB) measure proposed in [3]. According to the various performance metrics the second best meter is Meter L. This is a simple Leq measure (i.e. an RMS measure). Therefore, for the present study, none of the loudness meters submitted by the proponents performed as well as Leq or Leq(RLB). It is interesting to note that the various performance metrics tend to agree in their ranking of the best and worst meters. For example, all of the metrics indicated that Meter M is the best loudness meter, while Meter H performed the poorest.

6.

CONCLUSIONS

A comprehensive examination of the current state-ofthe-art in objective loudness meters was conducted. Formal subjective loudness matching experiments were performed at five test sites around the world in order to create a subjective database. The tests used actual broadcast material representative of the full range of expected broadcast audio content. Seven proponents submitted ten objective loudness meters for evaluation. In addition, two basic loudness measures (Leq(RLB) and Leq) were contributed by the author. The performances of the twelve loudness meters were evaluated using nine different performance metrics. Leq(RLB) was found to be the best loudness meter according to all of the performance metrics. Leq was found to perform almost equally well. The findings suggest that for typical broadcast material, a simple energy-based loudness measure is more robust than more complex measures that may include detailed perceptual models. This finding is supported by the fact that one European broadcaster (TV2/Denmark) has been successfully using high-pass filtered RMS for many years as a measure of loudness [5].

AES 116th Convention, Berlin, Germany, 2004 May 811


Page 11 of 12

Soulodre

Loudness Meter Evaluations


5 0 -5 Relative Level, dB -10 -15 -20 -25 -30 1 10

The results of this study are being used by the ITU-R in the preparation of a new recommendation on loudness metering in broadcast applications.
7. ACKNOWLEDGEMENTS

The author would like to acknowledge the contributions of the following people who oversaw the subjective tests at the different test sites, Michael Bennett and Ian Dash (ABC), Andrew Mason (BBC), Michel Lavoie (CRC), Michael Drolet (NFB), and Elizabeth Convery and Gitte Keidser (NAL). The author would also like to thank the meter proponents for their participation in this study.
8. REFERENCES

10

10 Frequency, Hz

10

Figure 18: RLB weighting curve. The RLB weighting curve can be implemented as a 2nd order filter,

[1] Soulodre, G.A., Lavoie, M.C., and Norcross, S.G., The Subjective Loudness of Typical Program Material, Presented at the 115th AES Convention, New York, October 2003, preprint 5892. [2] Private Communication, Eric Benjamin, Dolby Laboratories, May 2003. [3] Soulodre, G.A. and Norcross, S.G., Objective Measures of Loudness, Presented at the 115th AES Convention, New York, October 2003, preprint 5896. [4] Private Communication to the SRG3 committee. Esben Skovenborg, TC Electronic, August 2003. [5] Private Communication to the SRG3 committee. Eddy B. Brixen, August 2003.
9. APPENDIX

Z-1 a1 Z-1 a2

b0

b1

b2

with the following coefficients.


-

a1 a2

-1.99004745483398 0.99007225036621

b0 b2 b3

1.0 -2.0 1.0

With the RLB curve defined, Leq(RLB) is measured as,

Leq(RLB) was found to be the best loudness meter in the present study. The RLB weighting curve is shown in Figure 18. It can be seen that the RLB weighting consists of a simple highpass filter.

1 T x2 Leq ( RLB ) = 10 log10 RLB dt , dB 2 0 xRef T

(10)

where xRLB is the input signal filtered by the RLB weighting curve, xRef is some reference level, and T is the length (duration) of the audio sequence.

AES 116th Convention, Berlin, Germany, 2004 May 811


Page 12 of 12