Você está na página 1de 8

Principles of Performance Measurement

Tom Wilson

Introduction

Many engineers are brought into the performance world without sucient exposure to the scientic principles that exist. This paper provides an introduction to the principles of performance measurement. Extensive discussion exists for several topics in referenced material and is not repeated here. Exposure to the eld of statistics is inevitable, and, in this paper, it is kept to a minimum. Without these foundational principles, the execution of more complicated tasks, such as performance evaluation, are likely to be tarnished with unnecessary error. Measurement is the process of obtaining the magnitude of a quantity relative to a unit of measurement. Length, mass, and time are common quantities. There are numerous units of measurements (e.g., meter, gram, second) using a variety of systems (refer to [Low10]). Performance measurement is the concept of measurement applied to the performance aspects of computing systems. Performance measurement often measures time intervals, or the time between events. Response time is a performance measurement. Performance measurement also includes event counting, such as arrivals to the system. Counting is performed over an interval, which also can be measured. We will explain why counting is best dened as a statistic. When we take measurements, accuracy is a concern in at least three areas. The rst is detailing what was measured (i.e., what are the starting and ending points of what is being measured). The second is the accuracy associated with the measuring device and/or process. The third is the accuracy of the representation of the measurement. We will also discuss the loss of information introduced by statistics as a source of inaccuracy. The term metric is synonymous with measurement. In system development situations, metrics can include estimates. It is important to understand when estimates are mixed with measurements, as this aects accuracy. Older literature may refer to a performance metric as a performance index.

Measuring Intervals

In computing systems, the most common interval type is time. Examples of a measurements that do not involve time are the sizes of a le before and after compression. What is being measured is likely to be the number of bytes of memory or disk space. In spite of many possible examples, time is usually the point of focus for performance measurements. Time intervals are usually bounded by start and end events. Response time is the time between command submission and response receipt. One of the areas of accuracy already mentioned involves detailing what is measured. For the system developer, the response time is typically bounded by the entry of data into the system and the exit of a result out of the system. To the user, his view of the system includes the network in between. Figure 1 illustrates these two intervals. Not understanding the interval bounds is a common source of error. In this example, there are two very dierent things which can be measured. The error is not in measurement, but in specifying what was measured.
Command System Response Time Result Received Sent Command Sent Result Received

Perceived Response Time

Figure 1: Example Measurement Intervals


Published

in CMG MeasureIT, June 2010.

Since time is a continuous variable, measuring it and performing calculations with any measurements involves error since measuring devices have limited resolution. Systems that interface to the human world typically have sensors which attempt to measure physical quantities (e.g., the amount of fuel in a tank, the weight of an object, wind speed). These are also measuring continuous variables. However, the continuous realm is converted to a discrete one in computing systems. Again, we will concentrate on time intervals. [Gun05] contains a signicant discussion on the various aspects of time. During this conversion of the continuous realm to a discrete one, the resolution of our measuring device becomes a factor. This is the second area of accuracy previously mentioned. The sciences of chemistry and physics give appropriate attention to the error introduced by measuring since both deal with physical materials and environments. Studies in computing tend to ignore this concept by using as many digits of precision as the computer allows, resulting in a false sense of resolution. There is abundant material on the topic of signicant digits, or signicant gures ([Low10], [Gun06], [wik, Signicant gures, 2010]) and we will not repeat it here. The basic premise is that measuring instruments have limited resolution. Using them introduces some amount of error as the object being measured exceeds that resolution. Introducing resolution beyond that which exists is an incorrect practice that can have big consequences, especially when a small error is multiplied by a large number. So, rules exist for expressing and manipulating (e.g., addition or multiplication) measurements. Representing numbers can also impact accuracy. If the measurement cannot be stored properly, error is introduced. A common example is representing 0.1 in a oating point format (refer to [wik, Floating point, 2010]). If the arithmetic manipulation of two numbers requires more resolution than the format supports (e.g., adding a very large number and a very small number), there is also error. Note that such errors are dierent than those related to signicant digits. Much literature exists on the topic of number representation. Some software packages overcome hardware limitations by introducing more bits with which to represent numbers.

Measurement Qualities

Once we take measurements, we would like to understand their qualities (i.e., how good are they). Moreover, we would like to quantify those qualities. Accuracy expresses how close a measured value is to its expected or true value. Unfortunately, the true value may not be known. Precision expresses how close a set of measurements are to each other. Statistically, the standard deviation can reect the precision of a set of measurements. When trying to quantify accuracy, a calculation similar to standard deviation can compute the deviation from the expected value rather than from the mean. [wik, Accuracy and precision, 2010] discusses these two terms with respect to a probability distribution. A measurement is repeatable if the same person and/or the same environment produces the measurement. A measurement is reproducible if another person or another environment produces the measurement. These qualities may be hard to quantify. As systems grow in complexity, these aspects are important. Figure 2 illustrates the accuracy and precision qualities with four examples. The bulls eye provides a twodimensional view while the line below the bulls eye gives a one-dimensional view. For the bulls eyes, the center is the expected value. For the lines, a green arrow is the expected value.

Mean

Target

Accuracy = high Precision = high (a)

Accuracy = low Precision = high (b)

Accuracy = high Precision = low (c)

Accuracy = low Precision = low (d)

Figure 2: Accuracy vs. Precision

Here all examples give a qualitative expression (high vs. low), not a quantitative one. In case (a), both accuracy and precision are high; this is the ideal. In case (b), precision is high but accuracy is low. This situation often 2

involves a systematic error that is very repeatable. Measuring the users perceived response time instead of the system response time is a good example. All of the values would be consistently higher than the expected value and would cluster around a mean. In case (c), accuracy is high but precision is low. This situation is not likely, but can happen. It can be the result of many unknown factors that contribute small errors that tend to cancel each other out. In case (d), accuracy and precision are low. This can result from a bad test or a bad system. Cases (c) and (d) can arise by executing non-repeatable tests. Altering the solution or unknowingly changing what is measured makes a test non-repeatable. Factors that impact measurement are discussed below.

Sampling Counts

Counters are frequently used to measure performance. But as previously mentioned, counters are best viewed as statistics. Descriptive, or summary, statistics (e.g., average/mean or median) represent a large amount of data using a small amount of data. This is eectively what counters do. When a counter changes value, the information related to when it changed is lost. This is not incorrect; it is just an observation about a shortcoming of the counter. A counter is sampled at certain intervals to capture its values over time. In statistics, such values are termed as time-series data ([wik, Time Series, 2010]). Usually, the sampling interval has a consistent length. The sampling interval is often independent of the frequency of the data (i.e., the rate at which the events arrive). Signal processing techniques ([wik, Sampling (signal processing), 2010]) deal with many of the issues that can arise when converting continuous data to discrete data. While we are not dealing with truly continuous data (since events in a system occur at discrete times corresponding to clock ticks), we are dealing with rates of change that the sampling frequency may not handle. Increasing the sampling rate by decreasing the interval size is one way to improve the accuracy. We will discuss the concerns with sampling counts by looking at an example. Table 1 lists 100 values that were randomly generated. The numbers were rounded o to three decimal places and sorted for the convenience of presentation. We interpret the numbers as times at which events occur that we would count.
Table 1: Source Data to be Sampled

3.845 5.080 5.685 6.511 7.250 8.326 9.399 9.878 10.270 11.051

3.906 5.268 5.745 6.583 7.290 8.493 9.408 9.897 10.316 11.053

4.353 5.350 5.901 6.601 7.433 8.755 9.425 9.926 10.323 11.107

4.516 5.417 5.961 6.630 7.520 8.829 9.478 9.950 10.452 11.190

4.545 5.471 6.128 6.668 7.628 8.912 9.499 10.029 10.487 11.507

4.621 5.477 6.132 6.686 7.708 8.944 9.554 10.021 10.510 11.713

4.696 5.557 6.134 6.688 7.763 9.092 9.641 10.105 10.598 11.862

4.786 5.607 6.263 6.690 7.946 9.114 9.672 10.143 10.857 11.962

4.887 5.633 6.369 6.724 8.127 9.274 9.804 10.230 10.924 12.107

4.950 5.646 6.387 7.021 8.130 9.319 9.876 10.263 11.004 12.527

The numbers were sampled using three dierent sets of parameters. The dierences in the sets are the choices for the starting value and the size of the sampling interval. Basically, the sampling method creates a histogram. It is well established that histograms are sensitive to the data they represent. Table 2 shows the results of the sampling (note: when the counter is sampled, it is reset to 0). Color has been added to correlate the samples to a subsequent gure. Samples 1 and 2 have the same interval width (1.0) but dierent starting points. Sample 3 has half the width (i.e., twice the frequency) of the rst two samples. Note that samples 1 and 2 can be derived from sample 3 (by summing adjacent pairs of values) and do not provide any additional information beyond sample 3. The sampling results are graphed in Figure 3. All three curves are summarizing the same data. The rst thing to note is the magnitude is directly aected by the interval width (or frequency). The second thing to note is that there is some variation in magnitude depending on the starting value and the actual data. We might conclude that we are using dierent measuring devices to take measurements. Thus, we get dierent results. A key reminder about summary statistics is this: Information is lost and so error is introduced. This is natural and often acceptable. Each of the samples in the previous example carries some amount of error. Usually multiple summary statistics are used together in order to minimize the impact of the loss of information. Nonetheless, the conclusions we draw can be risky. If we use the summary data to do further analysis rather than the original data, we must remember that we may be introducing more error. The average of the data in Table 1 is 8.110 (rounded). If we approximate an average

Table 2: Sampling Results

Sample 1 Hour Count 3.0 0 4.0 2 5.0 8 6.0 14 7.0 15 8.0 9 9.0 8 10.0 18 11.0 15 12.0 9 13.0 2 14.0 0

Sample 2 Hour Count 3.5 0 4.5 3 5.5 13 6.5 14 7.5 13 8.5 9 9.5 13 10.5 20 11.5 9 12.5 5 13.5 1 14.5 0

Hour 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5

Sample 3 Count Hour 0 9.0 0 9.5 2 10.0 1 10.5 7 11.0 6 11.5 8 12.0 6 12.5 9 13.0 4 13.5 5 14.0 4 14.5

Count 4 9 9 11 4 5 4 1 1 0 0 0

Sampling Example
20
G G G G

Hour Sampling on Hour Hour Sampling on Halfhour Halfhour Sampling

15
G G G

Count

10
G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G

9 Hour

10

11

12

13

14

15

Figure 3: Three Sampling Results for Initial Data

from the sampling results in Table 2, we obtain results of 8.100, 8.130, and 8.115 for the respective samples. Of course, the higher frequency sample is closer to the actual average (i.e., has less error). Summary data may be the basis of some future modeling eort where these errors are likely to be overlooked. Of the 100 values in Table 1, nine values were changed slightly. Table 3 shows the old and new values. The impacts seems like it would be rather insignicant.
Table 3: Modied Data

Old New

9.319 9.519

9.399 9.529

9.408 9.578

9.425 9.585

9.478 9.598

9.499 9.619

9.554 10.054

10.598 10.398

10.857 10.457

However, the impact on the sampling results appears to be signicant. Figure 4 illustrates the new samples. The rst sample (i.e., the green line) results in a maximum of 17 while the second sample (i.e., the blue line) results in a maximum of 28. The choice of sampling parameters seems to be important now. The actual average is now 8.118 (rounded). The averages computed from the summary data are 8.110, 8.170, and 8.140. These averages are not giving us any insight to the potential problem. Frequency analysis may be necessary to detect this behavior and 4

establish an appropriate sampling interval.


Potential Sampling Anomaly
G G G

25

Hour Sampling on Hour Hour Sampling on Halfhour Halfhour Sampling

20
G G

Count

15
G G G

G G G G

10
G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G

9 Hour

10

11

12

13

14

15

Figure 4: Three Sampling Results for Modied Data

Performance monitoring functions sometimes return other statistics rather than measurements or counts. Such functions are intended to be viewed in realtime. Rather than viewing the actual data over a time period, the functions attempt to convey history in a snapshot via a statistic. An example of such a function is a moving average. A moving average averages counts that fall within a moving interval. Moving averages come in several forms (refer to [wik, Moving Average, 2010]) and have the eect of smoothing out sharp peaks and valleys. Moving averages also lag behind the real data. By averaging counts, we are computing a statistic from statistics. There is increased loss of information. Figure 5 shows two examples where a count and moving averages are graphed. The count is a counter sampled during every time period. One moving average (i.e., Avg. 2) average the last two counts; the other (i.e., Avg. 4) averages the last four counts. The eects of smoothing and lag are visible. The smoothing eect is more pronounced in the second example because the frequency changes more quickly. UNIXs load average function returns an exponential moving average (refer to [wik, Exponential Moving Average]). Several references ([Wal01], [Gun03], [Gun05, Ch. 4], and [wik, Load (computing), 2010]) discuss how this function works as well as its interpretation.

Factors Impacting Measurement

The factors that impact measurement can be divided into two groups: direct and indirect. A direct factor is unavoidable. It is a minimal bound that the measurement can never be below. One direct factor is the work that must necessarily be done to achieve the objective. For example, it takes a certain amount of time for a specic algorithm to nd the maximum number in a specic set of numbers. It takes a dierent amount of time to sort that same set of numbers. This minimum amount of time cannot be avoided. Changing the set of numbers might change the measurement because the measurement is sensitive to the input. This aspect can be referred to as the problem. This time is also impacted by several other direct factors. Collectively, these deal with the solution or implementation: the algorithm, the software language, the hardware platform, etc. Change any of these aspects and the measurement might also change. Caching and virtual memory are dynamically-changing, run-time strategies which cause variation in measurement. Indirect factors are theoretically avoidable. In complex systems, other work is being executed concurrently (but not necessarily in parallel) and such work competes for resources that the object that we are measuring needs (e.g., the CPU, memory). The interval that we are measuring may include subintervals that do not need to be part of it.

Example 1
20 20

Example 2

Count Average 2 Average 4

Count Average 2 Average 4

15

15

Count

10

Count 0 5 10 Time 15 20

10

0 0 5 10 Time 15 20

(a)

(b)

Figure 5: Two Examples Comparing Counts and Moving Averages

Figure 6 provides a notional illustration where the interval we are measuring contains a subinterval not associated with what we are measuring. However, it impacts the desired measurement.
Measured Interval Direct Intervals Indirect Interval

Figure 6: Notional Factors That Impact Measurement

So, how can these indirect intervals be removed (or, conversely, be accidentally introduced)? Systems have many conguration, or tunable, parameters that can be changed. These settings dictate how concurrent work is handled. When there is contention for resources, prioritization policies dictate the ordering of work, thereby impacting the measured interval. Figure 7 reects a sequence of disk requests. In the timeline on the left, a rst-come, rst-served (FCFS) policy processes each request in the order in which it arrives. We are trying to measure request 2, which arrives while request 1 is in progress. The subsequent requests are irrelevant with this policy. The measurement contains the direct interval (request 2) and an indirect interval (request 1).
FCFS Policy Response Time Scan Policy Response Time

Figure 7: Two Disk Request Prioritization Policies

In the right timeline, the scan policy is used. This policy reorders the requests according to their relative positions to the current position. In this case, request 2 is across the disk and the subsequent requests are in between. The measured interval is signicantly dierent for the same sequence of requests. Note that other requests would have shorter measurements (i.e., this policy favors them).

Finally, some indirect work may be required to measure. In some cases, the time and/or space required to perform numerous measurements can aect the measurements themselves. When debugging a timing problem, adding instrumentation can alter the timing, causing the problem to disappear. This can make the problem not repeatable. By changing the overhead, we change the timing of a race condition such that overhead due to contention is changed. This can be a very frustrating situation to deal with. A simple insertion of instrumentation (e.g., lines of code that capture counting data) can also cause data to change locations in a cache and/or pages in a virtual memory system.

What to Measure

What needs to be measured is mostly dictated by what the objective is, but several general comments can be made. In many situations, more measurements lead to higher accuracy and to better understanding. In the situations where measurement negatively impacts measurement, fewer measurements will be available. Event data are preferred to statistics, but collecting them may not be feasible due to the potential volume. Consider a simple system where requests are added to a queue as they arrive. Queued requests are removed in order to be processed. We are interested in understanding the queue length behavior. Figure 8 illustrates an instance of this system in use.
Counting Events vs. Tracking State
10
G G G G G G

Additions Q Len: Maximum Q Len: Current Removals


G G G G

G G

G G

G G

G G G G G

G G G G

Count

G G

G G

10 0 1 2 3

4 Time

Figure 8: Queue Length Example

If we sample the current queue length (blue line) only, we will not learn very much. Too much happens between samples and so a higher sampling rate is necessary to rely on this information alone. Instrumentation can be added to the system to keep track of the maximum length that the queue reached during each interval (purple dashed line). This provides more insight and answers our initial question. Additional instrumentation could be added to count the addition events (i.e., green line) and deletion events (i.e., red line, which is shown as negative values for clarity) during each interval. Now we have even more insight into the actions aecting the queue. Ultimately, each addition and removal event could be logged and viewed at a convenient time. Each option discussed has a cost, any of which may or may not be aordable. Performance is about resource (e.g., CPU, memory, I/O) usage. So understanding what the resources of interest are is the rst step. Counters are usually involved in keeping track of the resource requests. Times are desired to measure how long resources are held, how long requests wait, and how much time transpires between requests. When these aspects are measured and studied, the performance of the system can be eciently managed. This entire section is really about performance evaluation. This is a complex topic on its own and the previous discussion typies the plethora of possibilities that can exist. Evaluation requires measurement in order to achieve the objectives.

Conclusion

Measurement is a foundational concept of performance evaluation. When taking measurements, we are concerned with three forms of accuracy: specifying what is to be measured, the expression of the measurement due to the resolution of the measuring device, and the resolution of the storage representation. Some measurements are really statistics that count events during an interval. When dealing with statistics, it is important to remember that some information is lost. Measurement is directly aected by the problem (i.e., the work being performed) and the solution (i.e., the implementation of the system). It is indirectly aected by contending work and the systems conguration settings. Knowledge of these aspects helps us understand why measurements can vary.

References
[Gun03] Neil Gunther. UNIX Load Average Series. http://www.teamquest.com/resources/gunther/display/ 4/index.htm, 2003. [Gun05] Neil Gunther. Analyzing Computer System Performance with Perl::PDQ. Springer, 1st edition, 2005. [Gun06] Neil Gunther. Guerrilla Capacity Planning: A Tactical Approach to Planning for Highly Scalable Applications and Services. Springer, Berlin, 1 edition, 2006. [Low10] Steve Lower. The Measure of Matter: All About Units, Measurements, and Error. http://www.chem1. com/acad/webtext/matmeasure, 2010. [Wal01] Ray Walker. Examining Load Average. http://www.linuxjournal.com/article/9001, 2001. [wik] Wikipedia. http://en.wikipedia.org. Specic page and last date referenced are noted in each citation.
*Caveat lector: Because of the number of contributors to Wikipedia and the ease with which bad information can be introduced, readers should always validate what they are reading by other established sources. The references are given because of their simple accessibility.