Escolar Documentos
Profissional Documentos
Cultura Documentos
Email to Friend
Comments
2.1 - Introduction
The Quantitative Methods section of the CFA curriculum has traditionally been placed second in the sequence
of study topics, following the Ethics and Professional Standards review. It's an interesting progression: the
ethics discussions and case studies will invariably make most candidates feel very positive, very high-minded
about the road on which they have embarked. Then, without warning, they are smacked with a smorgasbord of
formulas, graphs, Greek letters, and challenging terminology. We know – it's easy to become overwhelmed. At
the same time, the topics covered in this section – time value of money, performance measurement, statistics
and probability basics, sampling and hypothesis testing, correlation and linear regression analysis – provide the
candidate with a variety of highly essential analytical tools and are a crucial prerequisite for the subsequent
material on fixed income, equities, and portfolio management. In short, mastering the material in this section
will make the CFA's entire Body of Knowledge that
much easier to handle.
5. Maturity Premium - All else being equal, a bond obligation will be more sensitive to interest rate
fluctuations the longer to maturity it is.
<< Back
Next >>
2.4 - Time Value Of Money Calculations
Here we will discuss the effective annual rate, time value of money problems, PV of a perpetuity, an ordinary
annuity, annuity due, a single cash flow and a series of uneven cash flows. For each, you should know how to
both interpret the problem and solve the problems on your approved calculator. These concepts will cover LOS'
5.b and 5.c.
The Effective Annual Rate
CFA Institute's LOS 5.b is explained within this section. We'll start by defining the terms, and then presenting
the formula.
The effective annual yield represents the actual rate of return, reflecting all of the compounding periods during
the year. The effective annual yield (or EAR) can be computed given the stated rate and the frequency of
compounding. We'll discuss how to make this computation next.
Formula 2.1
Effective annual rate (EAR) = (1 + Periodic interest rate)m – 1
Keep in mind that the effective annual rate will always be higher than the stated rate if there is more than one
compounding period (m > 1 in our formula), and the more frequent the compounding, the higher the EAR.
On PV and FV problems, switching the time units - either by calling for quarterly or
monthly compounding or by expressing time in months and the interest rate in years - is
an often-used tactic to trip up test takers who are trying to go too fast. Remember to
make sure the units agree for r and N, and are consistent with the frequency of
compounding, prior to solving.
The formula for the PV of a perpetuity is derived from the PV of an ordinary annuity, which at N = infinity, and
assuming interest rates are positive, simplifies to:
Formula 2.2
Formula 2.3
(1) FV = PV * (1 + r)N
(2) PV = FV * { 1 }
(1 + r)N
FV = $14,693.28
At an interest rate of 8%, we calculate today's value that will grow to $10,000 in five years:
PV = ($10,000)*(0.680583) = $6805.83
A. $100,000
B. $117,459
C. $148,644
D. $161,506
Answer:
The problem asks for a value today (PV). It provides the future sum of money (FV) = $1,000,000; an interest
rate (r) = 10% or 0.1; yearly time periods (N) = 20, and it indicates annual compounding. Using the PV formula
listed above, we get the following:
Using a calculator with financial functions can save time when solving PV and FV problems. At the same time,
the CFA exam is written so that financial calculators aren't required. Typical PV and FV problems will test the
ability to recognize and apply concepts and avoid tricks, not the ability to use a financial calculator. The
experience gained by working through more examples and problems increase your efficiency much more than a
calculator.
FV and PV of an Ordinary Annuity and an Annuity Due
To solve annuity problems, you must know the formulas for the future value annuity factor and the present
value annuity factor.
Formula 2.4
Formula 2.5
FV Annuity Factor
The FV annuity factor formula gives the future total dollar amount of a series of $1 payments, but in problems
there will likely be a periodic cash flow amount given (sometimes called the annuity amount and denoted by A).
Simply multiply A by the FV annuity factor to find the future value of the annuity. Likewise for PV of an
annuity: the formula listed above shows today's value of a series of $1 payments to be received in the future. To
calculate the PV of an annuity, multiply the annuity amount A by the present value annuity factor.
The FV and PV annuity factor formulas work with an ordinary annuity, one that assumes the first cash flow is
one period from now, or t = 1 if drawing a timeline. The annuity due is distinguished by a first cash flow
starting immediately, or t = 0 on a timeline. Since the annuity due is basically an ordinary annuity plus a lump
sum (today's cash flow), and since it can be fit to the definition of an ordinary annuity starting one year ago, we
can use the ordinary annuity formulas as long as we keep track of the timing of cash flows. The guiding
principle: make sure, before using the formula, that the annuity fits the definition of an ordinary annuity with
the first cash flow one period away.
A. $109,000
B. $143.200
C. $151,900
D. $165,600
Answer:
The problem gives the annuity amount A = $10,000, the interest rate r = 0.09, and time periods N = 10. Time
units are all annual (compounded annually) so there is no need to convert the units on either r or N. However,
the starting today introduces a wrinkle. The annuity being described is an annuity due, not an ordinary annuity,
so to use the FV annuity factor, we will need to change our perspective to fit the definition of an ordinary
annuity.
Drawing a timeline should help visualize what needs to be done:
Multiplying this amount by the annuity amount of $10,000, we have the future value at time period 9. FV =
($10,000)*(15.19293) = $151,929. To finish the problem, we need the value at t10. To calculate, we use the
future value of a lump sum, FV = PV*(1 + r)N, with N = 1, PV = the annuity value after 9 periods, r = 9.
Notice that choice "C" in the problem ($151,900) agrees with the preliminary result of the value of the annuity
at t = 9. It's also the result if we were to forget the distinction between ordinary annuity and annuity due, and go
forth and solve the problem with the ordinary annuity formula and the given parameters. On the CFA exam,
problems like this one will get plenty of takers for choice "C" – mostly the people trying to go too fast!!
It helps to set up this problem as if it were on a spreadsheet, to keep track of the cash flows and to make sure
that the proper inputs are used to either discount or compound each cash flow. For example, assume that we are
to receive a sequence of uneven cash flows from an annuity and we're asked for the present value of the annuity
at a discount rate of 8%. Scratch out a table similar to the one below, with periods in the first column, cash
flows in the second, formulas in the third column and computations in the fourth.
Suppose we are required to find the future value of this same sequence of cash flows after period 5. Here's the
same approach using a table with future value formulas rather than present value, as in the table above:
Check the present value of $9,122.86, discounted at the 8% rate for five years:
PV = ($9,122.86)/(1.08)5 = $6,208.86. In other words, the principle of equivalence applies even in examples
where the cash flows are unequal.
<< Back
Next >>
2.5 - Time Value Of Money Applications
I. MORTGAGES
Most of the problems from the time value material are likely to ask for either PV or FV and will provide the
other variables. However, on a test with hundreds of problems, the CFA exam will look for unique and creative
methods to test command of the material. A problem might provide both FV and PV and then ask you to solve
for an unknown variable, either the interest rate (r), the number of periods (N) or the amount of the annuity (A).
In most of these cases, a quick use of freshmen-level algebra is
all that's required. We'll cover two real-world applications –
each was the subject of an example in the resource textbook, so
either one may have a reasonable chance of ending up on an
exam problem.
Formula 2.6
For example, if a company's earnings were $100 million five years ago, and are $200 million today, the
annualized five-year growth rate could be found by:
growth rate (g) = (FV/PV)1/N – 1 = (200,000,000/100,000,000) 1/5 – 1 = (2) 1/5 – 1 = (1.1486984) – 1 = 14.87%
Since PV of an annuity = (annuity payment)*(PV annuity factor), we solve for annuity payment (A), which will
be the monthly payment:
Formula 2.7
With a loan of $250,000, the monthly payment in this example would be $250,000/166.7916, or $1,498.88 a
month.
Exam Tips and Tricks
Higher-level math functions usually don't end up on the test, partly because they give an
unfair advantage to those with higher-function calculators and because questions must be
solved in an average of one to two minutes each at Level I. Don't get bogged down with
understanding natural logs or transcendental numbers.
Answer:
To organize and summarize this information, we will need her three cash inflows to be the equivalent of her one
cash outflow.
All amounts are given to calculate inflows 1 and 2 and the outflow. The third inflow has an unknown annuity
amount that will need to be determined using the other amounts. We start by drawing a timeline and specifying
that all amounts be indexed at t = 30, or her retirement day.
Next, calculate the three amounts for which we have all the necessary information, and index to t = 30.
Since the three cash inflows = cash outflow, we have ($100,627) + ($337,606) + X = $800,608, or X =
$362,375 at t = 30. In other words, the money she saves from years 11 through 30 will need to be equal to
$362,375 in order for her to meet retirement goals.
We find that by increasing the annual savings from $5,000 to $7,919 starting in year 11 and continuing to year
30, she will be successful in accumulating enough income for retirement.
How are Present Values, Future Value and Cash Flows connected?
The cash flow additivity principle allows us to add amounts of money together, provided they are indexed to the
same period. The last example on retirement savings illustrates cash flow additivity: we were planning to
accumulate a sum of money from three separate sources and we needed to determine what the total amount
would be so that the accumulated sum could be compared with the client's retirement cash outflow requirement.
Our example involved uneven cash flows from two separate annuity streams and one single lump sum that has
already accumulated. Comparing these inputs requires each amount to be indexed first, prior to adding them
together. In the last example, the annuity we were planning to accumulate in years 11 to 30 was projected to
reach $362,375 by year 30. The current savings initiative of $5,000 a year projects to $72,433 by year 10. Right
now, time 0, we have $10,000. In other words, we have three amounts at three different points in time.
Example:
To illustrate, assume we are asked to use the NPV approach to choose between two projects, and our company's
weighted average cost of capital (WACC) is 8%. Project A costs $7 million in upfront costs, and will generate
$3 million in annual income starting three years from now and continuing for a five-year period (i.e. years 3 to
7). Project B costs $2.5 million upfront and $2 million in each of the next three years (years 1 to 3). It generates
no annual income but will be sold six years from now for a sales price of $16 million.
Project A: The present value of the outflows is equal to the current cost of $7 million. The inflows can be
viewed as an annuity with the first payment in three years, or an ordinary annuity at t = 2 since ordinary
annuities always start the first cash flow one period away.
Multiplying by the annuity payment of $3 million, the value of the inflows at t = 2 is ($3 million)*(3.99271) =
$11.978 million.
Project B: The inflow is the present value of a lump sum, the sales price in six years discounted to the present:
$16 million/(1.08)6 = $10.083 million.
Cash outflow is the sum of the upfront cost and the discounted costs from years 1 to 3. We first solve for the
costs in years 1 to 3, which fit the definition of an annuity.
Problems on the CFA exam are frequently set up so that it is tempting to pick a choice
that seems intuitively better (i.e. by people who are guessing), but this is wrong by NPV
rules. In the case we used, Project B had lower costs upfront ($2.5 million versus $7
million) with a payoff of $16 million, which is more than the combined $15 million
payoff of Project A. Don't rely on what feels better; use the process to make the
decision!
The Internal Rate of Return
The IRR, or internal rate of return, is defined as the discount rate that makes NPV = 0. Like the NPV process, it
starts by identifying all cash inflows and outflows. However, instead of relying on external data (i.e. a discount
rate), the IRR is purely a function of the inflows and outflows of that project. The IRR rule states that projects
or investments are accepted when the project's IRR exceeds a hurdle rate. Depending on the application, the
hurdle rate may be defined as the weighted average cost of capital.
Example:
Suppose that a project costs $10 million today, and will provide a $15 million payoff three years from now, we
use the FV of a single-sum formula and solve for r to compute the IRR.
IRR = (FV/PV)1/N –1 = (15 million/10 million)1/3 – 1 = (1.5) 1/3 – 1 = (1.1447) – 1 = 0.1447, or 14.47%
In this case, as long as our hurdle rate is less than 14.47%, we green light the project.
However, it requires an assumed discount rate, and also assumes that this percentage rate will be stable over the
life of the project, and that cash inflows can be reinvested at the same discount rate. In the real world, those
assumptions can break down, particularly in periods when interest rates are fluctuating. The appeal of the IRR
rule is that a discount rate need not be assumed, as the worthiness of the investment is purely a function of the
internal inflows and outflows of that particular investment. However, IRR does not assess the financial impact
on a firm; it only requires meeting a minimum return rate.
The NPV and IRR methods can rank two projects differently, depending on thesize of the investment. Consider
the case presented below, with an NPV of 6%:
By the NPV rule we choose Project A, and by the IRR rule we prefer B. How do we resolve the conflict if we
must choose one or the other? The convention is to use the NPV rule when the two methods are inconsistent, as
it better reflects our primary goal: to grow the financial wealth of the company.
t1 t2 t3 t4 t5
Outflows
1. The cost of any investment purchased
2. Reinvested dividends or interest
3. Withdrawals
Inflows
1.The proceeds from any investment sold
2.Dividends or interest received
3.Contributions
Example:
Each inflow or outflow must be discounted back to the present using a rate (r) that will make PV (inflows) = PV
(outflows). For example, take a case where we buy one share of a stock for $50 that pays an annual $2 dividend,
and sell it after two years for $65. Our money-weighted rate of return will be a rate that satisfies the following
equation:
Solving for r using a spreadsheet or financial calculator, we have a money-weighted rate of return = 17.78%.
Note that the exam will test knowledge of the concept of money-weighted return, but any
computations should not require use of a financial calculator
It's important to understand the main limitation of the money-weighted return as a tool for evaluating managers.
As defined earlier, the money-weighted rate of return factors all cash flows, including contributions and
withdrawals. Assuming a money-weighted return is calculated over many periods, the formula will tend to place
a greater weight on the performance in periods when the account size is highest (hence the label money-
weighted).
In practice, if a manager's best years occur when an account is small, and then (after the client deposits more
funds) market conditions become more unfavorable, the money-weighted measure doesn't treat the manager
fairly. Here it is put another way: say the account has annual withdrawals to provide a retiree with income, and
the manager does relatively poorly in the early years (when the account is larger), but improves in later periods
after distributions have reduced the account's size. Should the manager be penalized for something beyond his
or her control? Deposits and withdrawals are usually outside of a manager's control; thus, a better performance
measurement tool is needed to judge a manager more fairly and allow for comparisons with peers – a
measurement tool that will isolate the investment actions, and not penalize for deposit/withdrawal activity.
Formula 2.8
For time-weighted performance measurement, the total period to be measured is broken into many sub-periods,
with a sub-period ending (and portfolio priced) on any day with significant contribution or withdrawal activity,
or at the end of the month or quarter. Sub-periods can cover any length of time chosen by the manager and need
not be uniform. A holding-period return is computed using the above formula for all sub-periods. Linking (or
compounding) HPRs is done by
(a) adding 1 to each sub-period HPR, then
(b) multiplying all 1 + HPR terms together, then
(c) subtracting 1 from the product:
The annualized rate of return takes the compounded time-weighted rate and standardizes it by computing a
geometric average of the linked holding-period returns.
Formula 2.9
Answer:
For this example, the year is broken into four holding-period returns to be calculated for each quarter. Also,
since a significant contribution of $20,000 was received intra-period, we will need to calculate two holding-
period returns for the third quarter, June 30, 2004, to July 30, 2004, and July 30, 2004, to Sept 30, 2004. In total,
there are five HPRs that must be computed using the formula HPR = (MV1 – MV0 + D1 – CF1)/MV0. Note that
since D1, or dividend payments, are already factored into the ending-period value, this term will not be needed
for the computation. On a test problem, if dividends or interest is shown separately, simply add it to ending-
period value. The ccalculations are done below (dollar amounts in thousands):
Now we link the five periods together, by adding 1 to each HPR, multiplying all terms, and subtracting 1 from
the product, to find the compounded time- weighted rate of return:
Annualizing: Because our compounded calculation was for one year, the annualized figure is the same
+14.90%. If the same portfolio had a 2003 return of 20%, the two-year compounded number would be ((1 +
0.20)*(1 + 14.90)) – 1, or 37.88%. Annualize by adding 1, and then taking to the 1/Y power, and then
subtracting 1: (1 + 37.88)1/2 – 1 = 17.42%.
Note: The annualized number is the same as a geometric average, a concept covered in the statistics section.
Answer:
For money-weighted returns covering a single period, we know PV (inflows) – PV (outflows) = 0. If we pay
$100 for a stock today, and sell it in one year later for $105, and collect a $2 dividend, we have a money-
weighted return or IRR = ($105)/(1 + r) + ($2)/(1 + r) – $100 = $0. r = ($105 + $2)/$100 – 1, or 7%.
Money-weighted return = time-weighted return for a single period where the cash flow is received at the end. If
the period is any time frame other than one year, take (1 + the result), multiply by 1/Y and subtract 1 to find the
annualized return.
2.8 - Calculating Yield
Formula 2.10
By bank convention, years are 360 days long, not 365. If you recall the joke about banker's hours being shorter
than regular business hours, you should remember that banker's years are also shorter.
For example, if a T-bill has a face value of $50,000, a current market price of $49,700 and a maturity in 100
days, we have:
RBD = D/F * 360/t = ($50,000-$49,700)/$50000 * 360/100 = 300/50000 * 3.6 = 2.16%
On the exam, you may be asked to compute the market price, given a quoted yield, which can be accomplish by
using the same formula and solving for D:
Formula 2.11
D = RBD*F * t/360
Example:
Using the previous example, if we have a bank discount yield of 2.16%, a face value of $50,000 and days to
maturity of 100, then we calculate D as follows:
D = (0.0216)*(50000)*(100/360) = 300
Formula 2.12
Formula 2.13
EAY = (1 + HPY)365/t – 1
Example:
Continuing with our example T-bill, we have:
EAY = (1 + HPY)365/t – 1 = (1 + 0.006036)365/100 – 1 = 2.22 percent.
Remember that EAY > bank discount yield, for three reasons: (a) yield is based on purchase price, not face
value, (b) it is annualized with compound interest (interest on interest), not simple interest, and (c) it is based on
a 365-day year rather than 360 days. Be prepared to compare these two measures of yield and use these three
reasons to explain why EAY is preferable.
The third measure of yield is the money market yield, also known as the CD equivalent yield, and is denoted by
rMM. This yield measure can be calculated in two ways:
1. When the HPY is given, rMM is the annualized yield based on a 360-day year:
Formula 2.14
rMM = (HPY)*(360/t)
For our example, we computed HPY = 0.6036%, thus the money market yield is:
2. When bond price is unknown, bank discount yield can be used to compute the money market yield, using this
expression:
Formula 2.15
rMM = (360* rBD)/(360 – (t* rBD) = (360*0.0216)/(360 – (100*0.0216)) = 2.1735%, which is identical to the result
at which we arrived using HPY.
Interpreting Yield
This involves essentially nothing more than algebra: solve for the unknown and plug in the known quantities.
You must be able to use these formulas to find yields expressed one way when the provided yield number is
expressed another way.
Since HPY is common to the two others (EAY and MM yield), know how to solve for HPY to answer a
question.
Basics
• Descriptive Statistics - Descriptive statistics are tools used to summarize and consolidate large masses
of numbers and data so that analysts can get their hands around it, understand it and use it. The learning
outcomes in this section of the guide (i.e. the statistics section) are focused on descriptive statistics.
• Inferential Statistics - Inferential statistics are tools used to draw larger generalizations from observing
a smaller portion of data. In basic terms, descriptive statistics intend to describe. Inferential statistics
intend to draw inferences, the process of inferring. We will use inferential statistics in section D.
Probability Concepts, later in this chapter.
Measurement Scales
Data is measured and assigned to specific points based on a chosen scale. A measurement scale can fall into one
of four categories:
1. Nominal - This is the weakest level as the only purpose is to categorize data but not rank it in any way.
For example, in a database of mutual funds, we can use a nominal scale for assigning a number to
identify fund style (e.g. 1 for large-cap value, 2 for large-cap growth, 3 for foreign blend, etc.). Nominal
scales don't lend themselves to descriptive tools – in the mutual fund example, we would not report the
average fund style as 5.6 with a standard deviation of 3.2. Such descriptions are meaningless for
nominal scales.
2. Ordinal - This category is considered stronger than nominal as the data is categorized according to some
rank that helps describe rankings or differences between the data. Examples of ordinal scales include the
mutual fund star rankings (Morningstar 1 through 5 stars), or assigning a fund a rating between 1 and 10
based on its five-year performance and its place within its category (e.g. 1 for the top 10%, 2 for funds
between 10% and 20% and so forth). An ordinal scale doesn't always fully describe relative differences
– in the example of ranking 1 to 10 by performance, there may be a wide performance gap between 1
and 2, but virtually nothing between 6, 7, and 8.
3. Interval - This is a step stronger than the ordinal scale, as the intervals between data points are equal,
and data can be added and subtracted together. Temperature is measured on interval scales (Celsius and
Fahrenheit), as the difference in temperature between 25 and 30 is the same as the difference between 85
and 90. However, interval scales have no zero point – zero degrees Celsius doesn't indicate no
temperature; it's simply the point at which water freezes. Without a zero point, ratios are meaningless –
for example, nine degrees is not three times as hot as three degrees.
4. Ratio - This category represents the strongest level of measurement, with all the features of interval
scales plus the zero point, giving meaning to ratios on the scale. Most measurement scales used by
financial analysts are ratios, including time (e.g. days-to-maturity for bonds), money (e.g. earnings per
share for a set of companies) and rates of return expressed as a percentage.
Frequency Distribution
A frequency distribution seeks to describe large data sets by doing four things:
Frequency distribution is one of the simplest methods employed to describe populations of data and can be used
for all four measurement scales – indeed, it is often the best and only way to describe data measured on a
nominal, ordinal or interval scale. Frequency distributions are sometimes used for equity index returns over a
long history – e.g. the S&P 500 annual or quarterly returns
grouped into a series of return intervals.
<< Back
Next >>
2.10 - Basic Statistical Calculations
Holding Period Return
The holding return period formula was introduced
previously when discussing time-weighted return
measurement. The same formula applies when applied to
frequency distributions (descriptions changed slightly):
Formula 2.16
Rt = [(Pt – Pt - 1 + Dt)/ Pt – 1]
There are 40 observations in this distribution (last 10 years, four quarters per year), and the relative frequency is
found by dividing the number in the second column by 40. The cumulative absolute frequency (fourth column)
is constructed by adding the frequency of all observations at or below that point. So for the fifth interval, +5%
to +10%, we find the cumulative absolute frequency by adding the absolute frequency in the fifth interval and
all previous intervals: 2+1+5+17+10=35. The last column, cumulative relative frequency, takes the number in
the fourth column and divides by 40, the total number of observations.
A return polygon presents a line chart rather than a bar chart. Here is the data from the frequency distribution
presented with a return polygon:
Look Out!
Central Tendency
The term "measures of central tendency" refers to the various methods used to describe where large groups of
data are centered in a population or a sample. Here it is stated another way: if we were to pull one value or
observation from a population or sample, what would we typically expect the value to be? Various methods are
used to calculate central tendency. The most frequently used is the arithmetic mean, or the sum of observations
divided by the number of observations.
-1.5%-2.5%+5.6%+10.7%
+0.8%-7.7%-10.1% +2.2%
-1.9%-6.2%+17.1% +4.8%
We find the arithmetic mean by adding the 20 observations together, then dividing by 20.
((-1.5%) + (-2.5%) + 5.6% + 10.7% + 0.8% + (7.7%) + (-10.1%) + 2.2% + 12.0% + 10.9% + (-2.6%) + 0.2% +
(-1.9%) + (-6.2%) + 17.1% + 4.8% + 9.1% + 3.0% + (-0.2%) + 1.8%) = 45.5%
The arithmetic mean formula is used to compute population mean (often denoted by the Greek symbol μ),
which is the arithmetic mean of the entire population. The population mean is an example of a parameter, and
by definition it must be unique. That is, a given population can have only one mean. The sample mean (denoted
by X or X-bar) is the arithmetic mean value of a sample. It is an example of a sample statistic, and will be
unique to a particular sample. In other words, five samples drawn from the same population may produce five
different sample means.
While the arithmetic mean is the most frequently used measure of central tendency, it does have shortcomings
that in some cases tend to make it misleading when describing a population or sample. In particular, the
arithmetic mean is sensitive to extreme values.
Example:
For example, let's say we have the following five observations: -9000, 1.4, 1.6, 2.4 and 3.7. The arithmetic
mean is –1798.2 [(-9000 + 1.4 + 1.6 + 2.4 + 3.7)/5], yet –1798.2 has little meaning in describing our data set.
The outlier (-9000) draws down the overall mean. Statisticians use a variety of methods to compensate for
outliers, such as, for example, eliminating the highest and lowest value before calculating the mean.
For example, by dropping –9000 and 3.7, the three remaining observations have a mean of 1.8, a more
meaningful description of the data. Another approach is to use either the median or mode, or both.
Weighted Average or Mean
The weighted average or weighted mean, when applied to a portfolio, takes the mean return of each asset class
and weights it by the allocation of each class.
Say a portfolio manager has the following allocation and mean annual performance returns achieved for each
class:
The weighted mean is calculated by weighting the return on each class and summing:
Median
Median is defined as the middle value in a series that is sorted in either ascending or descending order. In the
example above with five observations, the median, or middle value, is 1.6 (i.e. two values below 1.6, and two
values above 1.6). In this case, the median is a much fairer indication of the data compared to the mean of –
1798.2.
Mode
Mode is defined as the particular value that is most frequently observed. In some applications, the mode is the
most meaningful description. Take a case with a portfolio of ten mutual funds and their respective ratings: 5, 4,
4, 4, 4, 4, 4, 3, 2 and 1. The arithmetic mean rating is 3.5 stars. However in this example, the modal rating of
four describes the majority of observations and might be seen as a fairer description.
Weighted Mean
Weighted mean is frequently seen in portfolio problems in which various assets classes are weighted within the
portfolio – for example, if stocks comprise 60% of a portfolio, then 0.6 is the weight. A weighted mean is
computed by multiplying the mean of each weight by the weight, and then summing the products.
Take an example where stocks are weighted 60%, bonds 30% and cash 10%. Assume that the stock portion
returned 10%, bonds returned 6% and cash returned 2%. The portfolio's weighted mean return is:
Stocks (wtd) + Bonds (wtd) + Cash (wtd) = (0.6)*(0.1) + (0.3)*(0.06) + (0.1)*(0.02) = (0.06) + (0.018) +
(0.002) = 8%
Geometric Mean
We initially introduced the geometric mean earlier in the computations for time-weighted performance. It is
usually applied to data in percentages: rates of return over time, or growth rates. With a series of n observations
of statistic X, the geometric mean (G) is:
Formula 2.1 7
G = (X1*X2*X3*X4 … *Xn)1/n
G = ((1.04)*(1.05)*(0.97)*(1.1))1/4 – 1 = 3.9%.
It's important to gain experience with using geometric mean on percentages, which involves linking the data
together: (1) add 1 to each percentage, (2) multiply all terms together, (3) carry the product to the 1/n power and
(4) subtract 1 from the result.
The harmonic mean is most associated with questions about dollar cost averaging, but its use is limited.
Arithmetic mean, weighted mean and geometric mean are the most frequently used measures and should be the
main emphasis of study.
By the same process, quartiles are the result of a distribution being divided into four parts; quintiles refer to five
parts; deciles, 10 parts; and percentiles, 100 parts. A manager in the second quintile would be better than 60%
(bottom three quintiles) and below 20% (the top quintile) (i.e. somewhere between 20% and 40% in percentile
terms). A manager at the 21st percentile has 20 percentiles above, 79 percentiles below.
2.11 - Standard Deviation And Variance
1. Taking the difference between each observed value and the mean, which is the deviation
2. Using the absolute value of each deviation, adding all deviations together
Answer:
Range = Maximum – Minimum = (+12.3%) – (+5.0%) = 7.3%
Mean absolute deviation starts by finding the mean: (10.1% + 7.7% + 5.0% + 12.3% + 12.2% + 10.9%)/6 =
9.7%.
Each of the six observations deviate from the 9.7%; the absolute deviation ignores +/–.
1st: 10.1 – 9.7 = 0.4 3rd: 5.0 – 9.7 = 4.7 5th: 12.2 – 9.7 = 2.5
2nd: 7.7 – 9.7 = 2.0 4th: 12.3 – 9.7 = 2.6 6th: 10.9 – 9.7 = 1.2
Next, the absolute deviations are summed and divided by 6:(0.4 + 2.0 + 4.7 + 2.6 + 2.5 + 1.2)/6 = 13.4/6 =
2.233333, or rounded, 2.2%.
Variance
Variance (σ2) is a measure of dispersion that in practice can be easier to apply than mean absolute deviation
because it removes +/– signs by squaring the deviations.
Returning to the example of mid-cap mutual funds, we had six deviations. To compute variance, we take the
square of each deviation, add the terms together and divide by the number of observations.
Variance = (0.16 + 4.0 + 22.09 + 6.76 + 6.25 + 1.44)/6 = 6.7833. Variance is not in the same units as the
underlying data. In this case, it's expressed as 6.7833% squared – difficult to interpret unless you are a
mathematical expert (percent squared?).
Standard Deviation
Standard deviation (σ) is the square root of the variance, or (6.7833)1/2 = 2.60%. Standard deviation is expressed
in the same units as the data, which makes it easier to interpret. It is the most frequently used measure of
dispersion.
Our calculations above were done for a population of six mutual funds. In practice, an entire population is either
impossible or impractical to observe, and by using sampling techniques, we estimate the population variance
and standard deviation. The sample variance formula is very similar to the population variance, with one
exception: instead of dividing by n observations (where n = population size), we divide by (n – 1) degrees of
freedom, where n = sample size. So in our mutual fund example, if the problem was described as a sample of a
larger database of mid-cap funds, we would compute variance using n – 1, degrees of freedom.
Sample variance (s2) = (0.16 + 4.0 + 22.09 + 6.76 + 6.25 + 1.44)/(6 – 1) = 8.14
(8.14)1/2 = 2.85%.
In fact, standard deviation is so widely used because, unlike variance, it is expressed in the same units as the
original data, so it is easy to interpret, and can be used on distribution graphs (e.g. the normal distribution).
Target semivariance is a variation of this concept, considering only those squared deviations below a certain
target. For example, if a mutual fund has a mean quarterly return of +3.6%, we may wish to focus only on
quarters where the outcome is –5% or lower. Target semivariance eliminates all quarters above –5%. From
there, the process of computing target semivariance follows the same procedure as other variance measures.
Chebyshev's Inequality
Chebyshev's inequality states that the proportion of observations within k standard deviations of an arithmetic
mean is at least 1 – 1/k2, for all k > 1.
# of Standard
Chebyshev's % of
Deviations from Mean
Inequality Observations
(k)
1 – 1/(2)2, or 1 – 1/4,
2 75 (.75)
or 3/4
1 – 1/(3)2, or 1 – 1/9,
3 89 (.8889)
or
1 – 1/(4)2, or 1 –
4 94 (.9375)
1/16, or 15/16
Given that 75% of observations fall within two standard deviations, if a distribution has an annual mean return
of 10% and a standard deviation of 5%, we can state that in 75% of the years, the return will be anywhere from
0% to 20%. In 25% of the years, it will be either below 0% or above 20%. Given that there are 89% falling
within three standard deviations means that in 89% of the years, the return will be within a range of –5% to
+25%. Eleven percent of the time it won't.
Later we will learn that for so-called normal distributions, we expect about 95% of the observations to fall
within two standard deviations. Chebyshev's inequality is more general and doesn't assume a normal
distribution, that is, it applies to any shaped distribution.
Coefficient of Variation
The coefficient of variation (CV) helps the analyst interpret relative dispersion. In other words, a calculated
standard deviation value is just a number. Does this number indicate high or low dispersion? The coefficient of
variation helps describe standard deviation in terms of its proportion to its mean by this formula:
Formula 2.18
CV = s/X
Sharpe Ratio
The Sharpe ratio is a measure of the risk-reward tradeoff of an investment security or portfolio. It starts by
defining excess return, or the percentage rate of return of a security above the risk-free rate. In this view, the
risk-free rate is a minimum rate that any security should earn. Higher rates are available provided one assumes
higher risk.
The Sharpe ratio is calculated by dividing the ratio of excess return, to the standard deviation of return.
Formula 2.19
A nonsymmetrical or skewed distribution occurs when one side of the distribution does not mirror the other.
Applied to investment returns, nonsymmetrical distributions are generally described as being either positively
skewed (meaning frequent small losses and a few extreme gains) or negatively skewed (meaning frequent small
gains and a few extreme losses).
For positively skewed distributions, the mode (point at the top of the curve) is less than the median (the point
where 50% are above/50% below), which is less than the arithmetic mean (sum of observations/number of
observations). The opposite rules apply to negatively skewed distribution: mode is greater than median, which
is greater than arithmetic mean.
Positive: Mean > Median > Mode Negative: Mean < Median < Mode
Notice that by alphabetical listing, it’s mean à median à mode. For positive skew, they are separated with a
greater than sign, for negative, less than.
Kurtosis
Kurtosis refers to the degree of peak in a distribution. More peak than normal (leptokurtic) means that a
distribution also has fatter tails and that there is a greater chance of extreme outcomes compared to a normal
distribution.
The kurtosis formula measures the degree of peak. Kurtosis equals three for a normal distribution; excess
kurtosis calculates and expresses kurtosis above or below 3.
In figure 2.5 below, the solid line is the normal distribution; the dashed line is leptokurtic distribution.
Figure 2.5: Kurtosis
I. Basics
Random Variable
A random variable refers to any quantity with uncertain
expected future values. For example, time is not a random
variable since we know that tomorrow will have 24 hours,
the month of January will have 31 days and so on.
However, the expected rate of return on a mutual fund
and the expected standard deviation of those returns are
random variables. We attempt to forecast these random
variables based on past history and on our forecast for the
economy and interest rates, but we cannot say for certain
what the variables will be in the future – all we have are
forecasts or expectations.
Outcome
Outcome refers to any possible value that a random
variable can take. For expected rate of return, the range of
outcomes naturally depends on the particular investment or proposition. Lottery players have a near-certain
probability of losing all of their investment (–100% return), with a very small chance of becoming a
multimillionaire (+1,000,000% return – or higher!). Thus for a lottery ticket, there are usually just two extreme
outcomes. Mutual funds that invest primarily in blue chip stocks will involve a much narrower series of
outcomes and a distribution of possibilities around a specific mean expectation. When a particular outcome or a
series of outcomes are defined, it is referred to as an event. If our goal for the blue chip mutual fund is to
produce a minimum 8% return every year on average, and we want to assess the chances that our goal will not
be met, our event is defined as average annual returns below 8%. We use probability concepts to ask what the
chances are that our event will take place.
Event
If a list of events ismutually exclusive, it means that only one of them can possibly take place. Exhaustive
events refer to the need to incorporate all potential outcomes in the defined events. For return expectations, if
we define our two events as annual returns equal to or greater than 8% and annual returns equal to or less than
8%, these two events would not meet the definition of mutually exclusive since a return of exactly 8% falls into
both categories. If our defined two events were annual returns less than 8% and annual returns greater than 8%,
we've covered all outcomes except for the possibility of an 8% return; thus our events are not exhaustive.
1. The probability of any event is a number between 0 and 1, or 0 < P(E) < 1. A P followed by parentheses
is the probability of (event E) occurring. Probabilities fall on a scale between 0, or 0%, (impossible) and
1, or 100%, (certain). There is no such thing as a negative probability (less than impossible?) or a
probability greater than 1 (more certain than certain?).
2. The sum of all probabilities of all events equals 1, provided the events are both mutually exclusive and
exhaustive. If events are not mutually exclusive, the probabilities would add up to a number greater than
1, and if they were not exhaustive, the sum of probabilities would be less than 1. Thus, there is a need to
qualify this second property to ensure the events are properly defined (mutually exclusive, exhaustive).
On an exam question, if the probabilities in a research study are added to a number besides 1, you might
question whether this principle has been met.
These terms refer to the particular approach an analyst has used to define the events and make predictions on
probabilities (i.e. the likelihood of each event occurring). How exactly does the analyst arrive at these
probabilities? What exactly are the numbers based upon? The approach is empirical, subjective or a priori.
Empirical Probabilities
Empirical probabilities are objectively drawn from historical data. If we assembled a return distribution based
on the past 20 years of data, and then used that same distribution to make forecasts, we have used an empirical
approach. Of course, we know that past performance does not guarantee future results, so a purely empirical
approach has its drawbacks.
Subjective Probabilities
Relationships must be stable for empirical probabilities to be accurate and for investments and the economy,
relationships change. Thus, subjective probabilities are calculated; these draw upon experience and judgment to
make forecasts or modify the probabilities indicated from a purely empirical approach. Of course, subjective
probabilities are unique to the person making them and depend on his or her talents – the investment world is
filled with people making incorrect subjective judgments.
A Priori Probabilities
A priori probabilities represent probabilities that are objective and based on deduction and reasoning about a
particular case. For example, if we forecast that a company is 70% likely to win a bid on a contract (based on an
either empirical or subjective approach), and we know this firm has just one business competitor, then we can
also make an a priori forecast that there is a 30% probability that the bid will go to the competitor.
Conditional Probability
Conditional probability answers this question: what is the probability of this one event occurring, given that
another event has already taken place? A conditional probability has the notation P(A | B), which represents the
probability of event A, given B. If we believe that a stock is 70% likely to return 15% in the next year, as long
as GDP growth is at least 3%, then we have made our prediction conditional on a second event (GDP growth).
In other words, event A is the stock will rise 15% in the next year; event B is GDP growth is at least 3%; and
our conditional probability is P(A | B) = 0.9.
2.14 - Joint Probability
Joint probability is defined as the probability of both A and B taking place, and is denoted by P(AB).
Joint probability is not the same as conditional probability, though the two concepts are often confused.
Conditional probability assumes that one event has taken place or will take place, and then asks for the
probability of the other (A, given B). Joint probability does not
have such conditions; it simply asks for the chances of both
happening (A and B). In a problem, to help distinguish between
the two, look for qualifiers that one event is conditional on the
other (conditional) or whether they will happen concurrently (joint).
Probability definitions can find their way into CFA exam questions. Naturally, there may also be questions that
test the ability to calculate joint probabilities. Such computations require use of the multiplication rule, which
states that the joint probability of A and B is the product of the conditional probability of A given B, times the
probability of B. In probability notation:
Formula 2.20
Given a conditional probability P(A | B) = 40%, and a probability of B = 60%, the joint probability P(AB) =
0.6*0.4 or 24%, found by applying the multiplication rule.
Formula 2.21
For example, if the probability of A = 0.4, and the probability of B = 0.45, and the joint probability of both is
0.2, then the probability of either A or B = 0.4 + 0.45 – 0.2 = 0.65.
Remembering to subtract the joint probability P(AB) is often the difficult part of applying this rule. Indeed, if
the addition rule is required to solve a probability problem on the exam, you can be sure that the wrong answers
will include P(A) + P(B), and P(A)*P(B). Just remember that the addition rule is asking for either A or B, so
you don't want to double count. Thus, the probability of both A and B, P(AB), is an intersection and needs to
be subtracted to arrive at the correct probability.
Two events are not independent when the conditional probability of A given B is higher or lower than the
unconditional probability of A. In this case, A is dependent on B. Likewise, if P(B | A) is greater or less than
P(B), we know that B depends on A.
Formula 2.22
Moreover, the rule generalizes for more than two events provided they are all independent of one another, so the
joint probability of three events P(ABC) = P(A) * (P(B) * P(C), again assuming independence.
Formula 2.23
This rule is easiest to remember if you compare the formula to the weighted-mean calculation used to compute
rate of return on a portfolio. In that exercise, each asset class had an individual rate of return, weighted by its
allocation to compute the overall return. With the total probability rule, each scenario has a conditional
probability (i.e. the likelihood of event A, given that scenario), with each conditional probability weighted by
the probability of that scenario occurring.
The total probability rule applies to three or more scenarios provided they are mutually exclusive and
exhaustive. The formula is the sum of all weighted conditional probabilities (weighted by the probability of
each scenario occurring).
Answer:
The analyst's expected value for next year's sales is (0.1)*(16.0) + (0.3)*(15.0) + (0.3)*(14.0) + (0.3)*(13.0) =
$14.2 million.
The total probability rule for finding the expected value of variable X is given by E(X) = E(X | S)*P(S) + E(X |
SC)*P(SC) for the simplest case: two scenarios, S and SC, that are mutually exclusive and exhaustive. If we refer
to them as Scenario 1 and Scenario 2, then E(X | S) is the expected value of X in Scenario 1, and E(X | SC) is the
expected value of X in Scenario 2.
Tree Diagram
The total probability rule can be easier to visualize if the information is presented in a tree diagram. Take a case
where we have forecasted company sales to be anywhere in a range from $13 to $16 million, based on
conditional probabilities.
This company is dependent on the overall economy and on Wal-Mart's same-store sales growth, leading to the
conditional probability scenarios demonstrated in figure 2.7 below:
In a good economy, our expected sales would be 25% likely to be $16 million, and 75% likely to be $15
million, depending on Wal-Mart's growth number. In a bad economy, we would be equally likely to generate
$13 million if Wal-Mart sales drop more than 2% or $14 million (if the growth number falls between –2% and
+1.9%).
We predict that a good economy is 40% likely, and a bad economy 60% likely, leading to our expected value
for sales: (0.4)*(15.25) + (0.6)*(13.5) = 14.2 million.
2.15 - Advanced Probability Concepts
Covariance
Covariance is a measure of the relationship between two random variables, designed to show the degree of co-
movement between them. Covariance is calculated based on the probability-weighted average of the cross-
products of each random variable's deviation from its own
expected value. A positive number indicates co-
movement (i.e. the variables tend to move in the same
direction); a value of 0 indicates no relationship, and a
negative covariance shows that the variables move in the
opposite direction.
Correlation
Correlation is a concept related to covariance, as it also gives an indication of the degree to which two random
variables are related, and (like covariance) the sign shows the direction of this relationship (positive (+) means
that the variables move together; negative (-) means they are inversely related). Correlation of 0 means that
there is no linear relationship one way or the other, and the two variables are said to be unrelated.
A correlation number is much easier to interpret than covariance because a correlation value will always be
between –1 and +1.
• –1 indicates a perfectly inverse relationship (a unit change in one means that the other will have a unit
change in the opposite direction)
• +1 means a perfectly positive linear relationship (unit changes in one always bring the same unit
changes in the other).
Moreover, there is a uniform scale from –1 to +1 so that as correlation values move closer to 1, the two
variables are more closely related. By contrast, a covariance value between two variables could be very large
and indicate little actual relationship, or look very small when there is actually a strong linear correlation.
Correlation is defined as the ratio of the covariance between two random variables and the product of their two
standard deviations, as presented in the following formula:
Formula 2.24
As a result: Covariance (A, B) = Correlation (A, B)*Standard Deviation (A)*Standard Deviation (B)
Both correlation and covariance with these formulas are likely to be required in a calculation in which the other
terms are provided. Such an exercise simply requires remembering the relationship, and substituting the terms
provided. For example, if a covariance between two numbers of 30 is given, and standard deviations are 5 and
15, the correlation would be 30/(5)*(15) = 0.40. If you are given a correlation of 0.40 and standard deviations of
5 and 15, the covariance would be (0.4)*(5)*(15), or 30.
Variance (σ2) is computed by finding the probability-weighted average of squared deviations from the expected
value.
Example: Variance
In our previous example on making a sales forecast, we found that the expected value was $14.2 million.
Calculating variance starts by computing the deviations from $14.2 million, then squaring:
Answer:
Variance weights each squared deviation by its probability: (0.1)*(3.24) + (0.3)*(0.64) + (0.3)*(0.04) +
(0.3)*(1.44) = 0.96
The variance of return is a function of the variance of the component assets as well as the covariance between
each of them. In modern portfolio theory, a low or negative correlation between asset classes will reduce overall
portfolio variance. The formula for portfolio variance in the simple case of a two–asset portfolio is given by:
Formula 2.25
Stock Bond
Stock 350 80
Bond 80
From this matrix, we know that the variance on stocks is 350 (the covariance of any asset to itself equals its
variance), the variance on bonds is 150 and the covariance between stocks and bonds is 80. Given our portfolio
weights of 0.5 for both stocks and bonds, we have all the terms needed to solve for portfolio variance.
Answer:
Portfolio variance = w2A*σ2(RA) + w2B*σ2(RB) + 2*(wA)*(wB)*Cov(RA, RB) =(0.5)2*(350) + (0.5)2*(150) +
2*(0.5)*(0.5)*(80) = 87.5 + 37.5 + 40 = 165.
Standard Deviation (σ), as was defined earlier when we discuss statistics, is the positive square root of the
variance. In our example, σ = (0.96)1/2, or $0.978 million.
(165)1/2 = 12.85%.
A two–asset portfolio was used to illustrate this principle; most portfolios contain far more than two assets, and
the formula for variance becomes more complicated for multi-asset portfolios (all terms in a covariance matrix
need to be added to the calculation).
Answer:
To calculate covariance, we start by finding the probability-weighted sales estimate (expected value):
GM = (0.3)*(10) + (0.4)*(4) + (.03)*( –4) = 3 + 1.6 – 1.2 = 3.4%
In the following table, we compute covariance by taking the deviations from each expected value in each
market environment, multiplying the deviations together (the cross products) and then weighting the cross
products by the probability
The last column (prob-wtd.) was found by multiplying the cross product (column 4) by the probability of that
scenario (column 5).
The covariance is found by adding the values in the last column: 6.534+0.072+8.214 = 14.82.
Bayes' Formula
We all know intuitively of the principle that we learn from experience. For an analyst, learning from experience
takes the form of adjusting expectations (and probability estimates) based on new information. Bayes' formula
essentially takes this principle and applies it to the probability concepts we have already learned, by showing
how to calculate an updated probability, the new probability given this new information. Bayes' formula is the
updated probability, given new information:
Bayes' Formula:
Conditional probability of new info. given the event * (Prior probability of the event)
Unconditional Probability of New Info
Formula 2.26
Number of ways
Step
this step can be done
1 6
2 3
3 1
4 5
Factorial Notation
n! = n*(n – 1)*(n – 2) … *1. In other words, 5!, or 5 factorial is equal to (5)*(4)*(3)*(2)*(1) = 120. In counting
problems, it is used when there is a given group of size n, and the exercise is to assign the group to n slots; then
the number of ways these assignments could be made is given by n!. If we were managing five employees and
had five job functions, the number of possible combinations is 5! = 120.
Combination Notation
Combination notation refers to the number of ways that we can choose r objects from a total of n objects, when
the order in which the r objects is listed does not matter.
In shorthand notation:
Formula 2.27
nCr = n = n!
r (n – r)!*r!
Thus if we had our five employees and we needed to choose three of them to team up on a new project, where
they will be equal members (i.e. the order in which we choose them isn't important), formula tells us that there
are 5!/(5 – 3)!3! = 120/(2)*(6) = 120/12, or 10 possible combinations.
Permutation notation
Permutation notation takes the same case (choosing r objects from a group of n) but assumes that the order that
“r” is listed matters. It is given by this notation:
Formula 2.28
P = n!/(n – r)!
n r
Returning to our example, if we not only wanted to choose three employees for our project, but wanted to
establish a hierarchy (leader, second-in-command, subordinate), by using the permutation formula, we would
have 5!/(5 – 3)! = 120/2 = 60 possible ways.
Now, let's consider how to calculate problems asking the number of ways to choose robjects from a total of
nobjects when the order in which the robjects are listed matters, and when the order does not matter.
• The combination formula is used if the order of r does not matter. For choosing three objects from a
total of five objects, we found 5!/(5 – 3)!*3!, or 10 ways.
• The permutation formula is used if the order of r does matter. For choosing three objects from a total of
five objects, we found 5!/(5 – 3)!, or 60 ways.
Method When appropriate?
Factorial Assigning a group of size n to n slots
Combination Choosing r objects (in any order) from group of n
Permutation Choosing r objects (in particular order) from group
of n
2.16 - Common Probability Distributions
The topics in this section provide a number of the quantitative building blocks useful in analyzing and
predicting random variables such as future sales and earnings, growth rates, market index returns and returns on
individual asset classes and specific securities. All of these variables have uncertain outcomes; thus there is risk
that any downside uncertainty can result in a surprising and material impact. By understanding the mechanics of
probability distributions, such risks can be understood and analyzed, and measures taken to hedge or reduce
their impact.
Probability Distribution
A probability distribution gathers together all possible outcomes of a random variable (i.e. any quantity for
which more than one value is possible), and summarizes these outcomes by indicating the probability of each of
them. While a probability distribution is often associated with the bell-shaped curve, recognize that such a curve
is only indicative of one specific type of probability, the so-called normal probability distribution. The CFA
curriculum does focus on normal distributions since they frequently apply to financial and investment variables,
and are used in hypothesis testing. However, in real life, a probability distribution can take any shape, size and
form.
In this case, we would have a uniform probability distribution: the chances that our random day would fall on
any particular day are the same, and the graph of our probability distribution would be a straight line.
Probability distributions can be simple to understand as in this example, or they can be very complex and
require sophisticated techniques (e.g., option pricing models, Monte Carlo simulations) to help describe all
possible outcomes.
Discrete Random Variables
Discrete random variables can take on a finite or countable number of possible outcomes. The previous example
asking for a day of the week is an example of a discrete variable, since it can only take seven possible values.
Monetary variables expressed in dollars and cents are always discrete, since money is rounded to the nearest
$0.01. In other words, we may have a formula that suggests a stock worth $15.75 today will be $17.1675 after it
grows 9%, but you can’t give or receive three-quarters of a penny, so our formula would round the outcome of
9% growth to an amount of $17.17.
• a stock can grow by 9% next year or by 10%, and in between this range it could grow by 9.3%, 9.4%,
9.5%
• in between 9.3% and 9.4% the rate could be 9.31%, 9.32%, 9.33%, and in between 9.32% and 9.33% it
could grow 9.32478941%
• clearly there is no end to how precise the outcomes could be broken down; thus it’s described as a
continuous variable.
Rates of return can theoretically range from –100% to positive infinity. Time is bound on the lower side by 0.
Market price of a security will also have a lower limit of $0, while its upper limit will depend on the security –
stocks have no upper limit (thus a stock price’s outcome > $0), but bond prices are more complicated, bound by
factors such as time-to-maturity and embedded call options. If a face value of a bond is $1,000, there’s an upper
limit (somewhere above $1,000) above which the price of the bond will not go, but pinpointing the upper value
of that set is imprecise.
Probability Function
A probability function gives the probabilities that a random variable will take on a given list of specific values.
For a discrete variable, if (x1, x2, x3, x4 …) are the complete set of possible outcomes, p(x) indicates the chances
that X will be equal to x. Each x in the list for a discrete variable will have a p(x). For a continuous variable, a
probability function is expressed as f(x).
The two key properties of a probability function, p(x) (or f(x) for continuous), are the following:
Determining whether a function satisfies the first property should be easy to spot since we know that
probabilities always lie between 0 and 1. In other words, p(x) could never be 1.4 or –0.2. To illustrate the
second property, say we are given a set of three possibilities for X: (1, 2, 3) and a set of three for Y: (6, 7, 8),
and given the probability functions f(x) and g(y).
x f(x) y g(y)
1 0.31 6 0.32
2 0.43 7 0.40
3 0.26 8 0.23
For all possibilities of f(x), the sum is 0.31+0.43+0.26=1, so we know it is a valid probability function. For all
possibilities of g(y), the sum is 0.32+0.40+0.23 = 0.95, which violates our second principle. Either the given
probabilities for g(y) are wrong, or there is a fourth possibility for y where g(y) = 0.05. Either way it needs to
sum to 1.
x f(x)
<0 0.2
>0 0.8
2.17 - Common Probability Distribution Calculations
From the table, we find that the probability that x is less than or equal to 4 is 0.55, the summed probabilities of
the first three P(X) terms, or the number found in the cdf column for the third row, where x < 4. Sometimes a
question might ask for the probability of x being greater than 4, for which this problem is 1 – P(x < 4) = 1 –
0.55 = 0.45. This is a question most people should get – but one that will still have too many people answering
0.55 because they weren’t paying attention to the "greater than".
2 0.2 0.2
4 0.2 0.4
6 0.2 0.6
8 0.2 0.8
10 0.2 1.0
According to the distribution above, we have the probability of x = 8 as 0.2. The probability of x = 2 is the
same, 0.2.
Suppose that the question called for P(4 < X < 8). The answer would be the sum of P(4) + P(6) + P(8) = 0.2 +
0.2 + 0.2 = 0.6.
Suppose the question called for P(4 < X < 8). In this case, the answer would omit P(4) and P(8) since it’s less
than, NOT less than or equal to, and the correct answer would be P(6) = 0.2. The CFA exam writers love to test
whether you are paying attention to details and will try to trick you – the probability of such tactics is pretty
much a 1.0!
Thus, a binomial random variable is described by two parameters: p (the probability of success of one trial) and
n (the number of trials). A binomial probability distribution with p = 0.50 (equal chance of success or failure)
and n = 4 would appear as:
0 0.0625 0.0625
1 0.25 0.3125
2 0.375 0.6875
3 0.25 0.9325
4 0.0625 1.0000
The reference text demonstrates how to construct a binomial probability distribution by using the formula p(x) =
(n!/(n – x)!x!)*(px)*(1 – p)n-x. We used this formula to assemble the above data, though the exam would
probably not expect you to create each p(x); it would probably provide you with the table, and ask for an
interpretation. For this table, the probability of exactly one success is 0.25; the probability of three or fewer
successes is 0.9325 (the cdf value in the row where x = 3); the probability of at least one is 0.9325 (1 – P(0)) =
(1 – 0.0625) = 0.9325.
Calculations
The expected value of a binomial random variable is given by the formula n*p. In the example above, with n =
4 and p = 0.5, the expected value would be 4*0.5, or 2.
The variance of a binomial random variable is calculated by the formula n*p*(1 – p). Using the same example,
we have variance of 4*0.5*0.5 = 1.
If our binomial random variable still had n = 4 but with a greater predictability in the trial, say p = 9, our
variance would reduce to 4*0.9*0.1 = 0.36. For successive trials (i.e. for higher n), both mean and variance
increase but variance increases at a lower rate – thus the higher the n, the better the model works at predicting
probability.
To calculate probabilities, find the area under a pdf curve such as the one graphed here. In this example, what is
the probability that the random variable will be between 1 and 3? The area would be a rectangle with a width of
2 (the distance between 1 and 3), and height of 0.2, 2*0.2 = 0.4.
What is the probability that x is less than 3? The rectangle would have a width of 3 and the same height: 0.2.
3*0.2 = 0.6
2.18 - Common Probability Distribution Properties
Normal Distribution
The normal distribution is a continuous probability distribution that, when graphed as a probability density,
takes the form of the so-called bell-shaped curve. The bell shape results from the fact that, while the range of
possible outcomes is infinite (negative infinity to positive infinity), most of the potential outcomes tend to be
clustered relatively close to the distribution’s mean value. Just how close they are clustered is given by the
standard deviation. In other words, a normal distribution is described completely by two parameters: its mean
(μ) and its standard deviation (σ).
While any normal distribution will share these defining characteristics, the mean and standard deviation will be
unique to the random variable, and these differences will affect the shape of the distribution. On the following
page are two normal distributions, each with the same mean, but the distribution with the dotted line has a
higher standard deviation.
For a portfolio distribution with n stocks, the multivariate distribution is completely described by the n mean
returns, the n standard deviations and the n*(n – 1)/2 correlations. For a 20-stock portfolio, that’s 20 lists of
returns, 20 lists of variances of return and 20*19/2, or 190 correlations.
2.19 - Confidence Intervals
While a normally-distributed random variable can have
many potential outcomes, the shape of its distribution
gives us confidence that the vast majority of these
outcomes will fall relatively close to its mean. In fact, we
can quantify just how confident we are. By
using confidence intervals - ranges that are a function of
the properties of a normal bell-shaped curve - we can
define ranges of probabilities.
In other words, by assuming normal distribution, we are 68% confident that a variable will fall within one
standard deviation. Within two standard deviation intervals, our confidence grows to 95%. Within three
standard deviations, 99%. Take an example of a distribution of returns of a security with a mean of 10% and a
standard deviation of 5%:
• 68% of the returns will be between 5% and 15% (within 1 standard deviation, 10 + 5).
• 95% of the returns will be between 0% and 20% (within 2 std. devs., 10 + 2*5).
• 99% of the returns will be between –5% and 25% (within 3 std. devs., 10 + 3*5)
Standardizing a random variable X is done by subtracting X from the mean value (μ), and then dividing the
result by the standard deviation (σ). The result is a standard normal random variable which is denoted by the
letter Z.
Formula 2.31
Z = (X – μ)/σ
Example 1:
If a distribution has a mean of 10 and standard deviation of 5, and a random observation X is –2, we would
standardize our random variable with the equation for Z.
The standard normal random variable Z tells us how many standard deviations the observation is from the
mean. In this case, –2 translates to 2.4 standard deviations away from 10.
Example 2:
You are considering an investment portfolio with an expected return of 10% and a standard deviation of 8%.
The portfolio's returns are normally distributed. What is the probability of earning a return less than 2%?
Again, we'd start with standardizing random variable X, which in this case is 10%:
Next, one would often consult a Z-table for cumulative probabilities for a standard normal distribution in order
to determine the probability. In this case, for Z = -1, P(Z ≤ x) – 0.158655, or 16%.
Keep in mind that your upcoming exam will not provide Z-tables, so, how would you solve this problem on test
day?
The answer is that you need to remember that 68% of observations fall + 1 standard deviation on a normal
curve, which means that 32% are not within one standard deviation. This question essentially asked for
probability of more than one standard deviation below, or 32%/2 = 16%. Study the earlier diagram that shows
specific percentages for certain standard deviation intervals on a normal curve – in particular, remember 68%
for + one away, and remember 95% for + two away.
Shortfall Risk
Shortfall risk is essentially a refinement of the modern-day development of mean-variance analysis, that is, the
idea that one must focus on both risk and return as opposed to simply the return. Risk is typically measured by
standard deviation, which measures all deviations – i.e. both positive and negative. In other words, positive
deviations are treated as if they were equal to negative deviations. In the real world, of course, negative
surprises are far more important to quantify and predict with clarity if one is to accurately define risk. Two
mutual funds could have the same risk if measured by standard deviation, but if one of those funds tends to have
more extreme negative outcomes, while the other had a high standard deviation due to a preponderance of
extreme positive surprises, then the actual risk profiles of those funds would be quite different. Shortfall risk
defines a minimum acceptable level, and then focuses on whether a portfolio will fall below that level over a
given time period.
Formula 2.32
Portfolio A Portfolio B
Expected Annual Return 8% 12%
Standard Deviation 10% 16%
Answer:
The SFRatio for portfolio A is (8 – (–2))/10 = 1.0
The SFRatio for portfolio B is (12 – (–2))/16 = 0.875
In other words, the minimum threshold is one standard deviation away in Portfolio A, and just 0.875 away in
Portfolio B, so by safety-first rules we opt for Portfolio A.
Lognormal Distributions
A lognormal distribution has two distinct properties: it is always positive (bounded on the left by zero), and it is
skewed to the right. Prices for stocks and many other financial assets (anything which by definition can never
be negative) are often found to be lognormally distributed. Also, the lognormal and normal distributions are
related: if a random variable X is lognormally distributed, then its natural log, ln(X) is normally distributed.
(Thus the term “lognormal” – the log is normal.) Figure 2.11 below demonstrates a typical lognormal
distribution.
2.20 - Discrete and Continuous Compounding
In discrete compounded rates of return, time moves
forward in increments, with each increment having a rate
of return (ending price / beginning price) equal to 1. Of
course, the more frequent the compounding, the higher
the rate of return. Take a security that is expected to
return 12% annually:
With greater frequency of compounding (i.e. as holding periods become smaller and smaller) the effective rate
gradually increases but in smaller and smaller amounts. Extending this further, we can reduce holding periods
so that they are sliced smaller and smaller so they approach zero, at which point we have the continuously
compounded rate of return. Discrete compounding relates to measurable holding periods and a finite number of
holding periods. Continuous compounding relates to holding periods so small they cannot be measured, with
frequency of compounding so large it goes to infinity.
The continuous rate associated with a holding period is found by taking the natural log of 1 + holding-period
return) Say the holding period is one year and holding-period return is 12%:
In other words, if 11.33% were continuously compounded, its effective rate of return would be about 12%.
Earlier we found that 12% compounded hourly comes to about 12.7496%. In fact, e (the transcendental number)
raised to the 0.12 power yields 12.7497% (approximately).
As we've stated previously, actual calculations of natural logs are not likely for answering a question as they
give an unfair advantage to those with higher function calculators. At the same time, an exam problem can test
knowledge of a relationship without requiring the calculation. For example, a question could ask:
Q. A portfolio returned 5% over one year, if continuously compounded, this is equivalent to ____?
A. ln 5
B. ln 1.05
C. e5
D. e1.05
The answer would be B based on the definition of continuous compounding. A financial function calculator or
spreadsheet could yield the actual percentage of 4.879%, but wouldn't be necessary to answer the question
correctly on the exam.
Monte Carlo simulations are used in a number of applications, often as a complement to other risk-assessment
techniques in an effort to further define potential risk. For example, a pension-benefit administrator in charge of
managing assets and liabilities for a large plan may use computer software with Monte Carlo simulation to help
understand any potential downside risk over time, and how changes in investment policy (e.g. higher or lower
allocations to certain asset classes, or the introduction of a new manager) may affect the plan. While traditional
analysis focuses on returns, variances and correlations between assets, a Monte Carlo simulation can help
introduce other pertinent economic variables (e.g. interest rates, GDP growth and foreign exchange rates) into
the simulation.
Monte Carlo simulations are also important in pricing derivative securities for which there are no existing
analytical methods. European- and Asian-style options are priced with Monte Carlo methods, as are certain
mortgage-backed securities for which the embedded options (e.g. prepayment assumptions) are very complex.
A general outline for developing a Monte Carlo simulation involves the following steps (please note that we are
oversimplifying a process that is often highly technical):
1. Identify all variables about which we are interested, the time horizon of the analysis and the distribution
of all risk factors associated with each variable.
2. Draw K random numbers using a spreadsheet generator. Each random variable would then be
standardized so we have Z1, Z2, Z3… ZK.
3. Simulate the possible values of the random variable by calculating its observed value with Z1, Z2, Z3…
ZK.
4. Following a large number of iterations, estimate each variable and quantity of interest to complete one
trial. Go back and complete additional trials to develop more accurate estimates.
Historical Simulation
Historical simulation, or back simulation, follows a similar process for large numbers of iterations, with
historical simulation drawing from the previous record of that variable (e.g. past returns for a mutual fund).
While both of these methods are very useful in developing a more meaningful and in-depth analysis of a
complex system, it's important to recognize that they are basically statistical estimates; that is, they are not as
analytical as (for example) the use of a correlation matrix to understand portfolio returns. Such simulations tend
to work best when the input risk parameters are well defined.
<< Back
Next >>
2.21 - Sampling and Estimation
A data sample, or subset of a larger population, is used to help understand the behavior and characteristics of
the entire population. In the investing world, for example, all of the familiar stock market averages are samples
designed to represent the broader stock market and indicate its performance return. For the domestic publicly-
traded stock market, populated with at least 10,000 or more companies, the Dow Jones Industrial
Average (DJIA) has just 30 representatives; the S&P 500 has 500. Yet these samples are taken as valid
indicators of the broader population. It's important to understand the mechanics of sampling and estimating,
particularly as they apply to financial variables, and have the insight to critique the quality of research derived
from sampling efforts.
BASICS
Sometimes it is impractical or impossible to label every single member of an entire population, in which case
systematic sampling methods are used. For example, take a case where we wanted to research whether the S&P
500 companies were adding or laying off employees, but we didn't have the time or resources to contact all 500
human resources departments. We do have the time and resources for an in-depth study of a 25-company
sample. A systematic sampling approach would be to take an alphabetical list of the S&P 500 and contact every
25th company on the list, i.e. companies #25, #50, #75, etc., up until #500. This way we end up with 25
companies and it was done under a system that's approximately random and didn't favor a particular company or
industry.
Sampling Error
Suppose we polled our 25 companies and came away with a
conclusion that the typical S&P 500 firm will be adding
approximately 5% to their work force this fiscal year, and, as a result, we are optimistic about the health of the
economy. However, the daily news continues to indicate a fair number of layoffs at some companies and hiring
freezes at other firms, and we wonder whether this research has actually done its job. In other words, we suspect
sampling error: the difference between the statistic from our sample (5% job growth) and the population
parameter we were estimating (actual job growth).
Sampling Distribution
A sampling distribution is analogous to a population distribution: it describes the range of all possible values
that the sampling statistic can take. In the assessment of the quality of a sample, the approach usually involves
comparing the sampling distribution to the population distribution. We expect the sampling distribution to be a
pattern similar to the population distribution – that is, if a population is normally distributed, the sample should
also be normally distributed. If the sample is skewed when we were expecting a normal pattern with most of the
observations centered around the mean, it indicates potential problems with the sample and/or the methodology.
The table below illustrates a stratified approach to improving our economic research on current hiring
expectations. In our earlier approach that randomly drew from all 500 companies, we may have accidentally
drawn too heavily from a sector doing well, and under-represented other areas. In stratified random sampling,
each of the 500 companies in the S&P 500 index is assigned to one of 12 sectors. Thus we have 12 strata, and
our sample of 25 companies is based on drawing from each of the 12 strata, in proportions relative to the
industry weights within the index. The S&P weightings are designed to replicate the domestic economy, which
is why financial services and health care (which are relatively more important sectors in today's economy) are
more heavily weighted than utilities. Within each sector, a random approach is used – for example, if there are
120 financial services companies and we need five financial companies for our research study, those five would
be selected via a random draw, or by a systematic approach (i.e. every 24th company on an alphabetical list of
the subgroup).
Percent
Percent of Companies to Companies to
Sector Sector of S&P
S&P 500 sample sample
500
Time-Series Data
Time series date refers to one variable taken over discrete, equally spaced periods of time. The distinguishing
feature of a time series is that it draws back on history to show how one variable has changed. Common
examples include historical quarterly returns on a stock or mutual fund for the last five years, earnings per share
on a stock each quarter for the last ten years or fluctuations in the market-to-book ratio on a stock over a 20-
year period. In every case, past time periods are examined.
Cross-Sectional Data
Cross section data typically focuses on one period of time and measures a particular variable across several
companies or industries. A cross-sectional study could focus on quarterly returns for all large-cap value mutual
funds in the first quarter of 2005, or this quarter's earnings-per-share estimates for all pharmaceutical firms, or
differences in the current market-to-book ratio for the largest 100 firms traded on the NYSE. We can see that
the actual variables being examined may be similar to a time-series analysis, with the difference being that a
single time period is the focus, and several companies, funds, etc. are involved in the study. The earlier example
of analyzing hiring plans at S&P 500 companies is a good example of cross-sectional research.
The first assumption - that the sample distribution will be normal - holds regardless of the distribution of the
underlying population. Thus the central limit theorem can help make probability estimates for a sample of a
non-normal population (e.g. skewed, lognormal), based on the fact that the sample mean for large sample sizes
will be a normal distribution. This tendency toward normally distributed series for large samples gives the
central limit theorem its most powerful attribute. The assumption of normality enables samples to be used in
constructing confidence intervals and to test hypotheses, as we will find when covering those subjects.
Exactly how large is large in terms of creating a large sample? Remember the number 30. According to the
reference text, that's the minimum number a sample must be before we can assume it is normally distributed.
Don't be surprised if a question asks how large a sample should be – should it be 20, 30, 40, or 50? It's an easy
way to test whether you've read the textbook, and if you remember 30, you score an easy correct answer.
Standard Error
The standard error is the standard deviation of the sample statistic. Earlier, we indicated that the sample
variance is the population variance divided by n (sample size). The formula for standard error was derived by
taking the positive square root of the variance.
If the population standard deviation is given, standard error is calculated by this ratio: population standard
deviation / square root of sample size, or σ/(n)1/2. If population standard deviation is unknown, the sample
standard deviation (s) is used to estimate it, and standard error = s/(n)1/2. Note that "n" in the denominator means
that the standard error becomes smaller as the sample size becomes larger, an important property to remember.
The level of confidence we want to establish is given by the number α, or alpha, which is the probability that a
point estimate will not fall in a confidence range. The lower the alpha, the more confident we want to be – e.g.
alpha of 5% indicates we want to be 95% confident; 1% alpha indicates 99% confidence.
Properties of an Estimator
The three desirable properties of an estimator are that they are unbiased, efficient and consistent:
1. Unbiased - The expected value (mean) of the estimate's sampling distribution is equal to the underlying
population parameter; that is, there is no upward or downward bias.
2. Efficiency - While there are many unbiased estimators of the same parameter, the most efficient has a
sampling distribution with the smallest variance.
3. Consistency - Larger sample sizes tend to produce more accurate estimates; that is, the sample parameter
converges on the population parameter.
Formula 2.33
In other words, if we want to be 99% confident that a parameter will fall within a range, we need to make
that interval wider than we would if we wanted to be only 90% confident. The actual reliability factors
used are derived from the standard normal distribution, or Z value, at probabilities of alpha/2 since the
interval is two-tailed, or above and below a point.
Degrees of Freedom
Degrees of freedom are used for determining the reliability-factor portion of the confidence interval with the t–
distribution. In finding sample variance, for any sample size n, degrees of freedom = n –
1. Thus for a sample size of 8, degrees of freedom are 7. For a sample size of 58, degrees of freedom are 57.
The concept of degrees of freedom is taken from the fact that a sample variance is based on a series of
observations, not all of which can be independently selected if we are to arrive at the true parameter. One
observation essentially depends on all the other observations. In other words, if the sample size is 58, think of
that sample of 58 in two parts: (a) 57 independent observations and (b) one dependent observation, on which the
value is essentially a residual number based on the other observations. Taken together, we have our estimates
for mean and variance. If degrees of freedom is 57, it means that we would be "free" to choose any 57
observations (i.e. sample size – 1), since there is always that 58th value that will result in a particular sample
mean for the entire group.
Characteristic of the t-distribution is that additional degrees of freedom reduce the range of the confidence
interval, and produce a more reliable estimate. Increasing degrees of freedom is done by increasing sample size.
For larger sample sizes, use of the z-statistic is an acceptable alternative to the t-distribution – this is true since
the z-statistic is based on the standard normal distribution, and the t-distribution moves closer to the standard
normal at higher degrees of freedom.
Student's t-distribution
Student's t-distribution is a series of symmetrical distributions, each distribution defined by its degrees of
freedom. All of the t-distributions appear similar in shape to a standard normal distribution, except that,
compared to a standard normal curve, the t-distributions are less peaked and have fatter tails. With each increase
in degrees of freedom, two properties change: (1) the distribution's peak increases (i.e. the probability that the
estimate will be closer to the mean increases), and (2) the tails (in other words, the parts of the curve far away
from the mean estimate) approach zero more quickly – i.e. there is a reduced probability of extreme values as
we increase degrees of freedom. As degrees of freedom become very large – as they approach infinity – the t-
distribution approximates the standard normal distribution.
At the same time, two other factors tend to make larger sample sizes less desirable. The first consideration,
which primarily affects time-series data, is that
population parameters have a tendency to change over
time. For example, if we are studying a mutual fund and
using five years of quarterly returns in our analysis (i.e.
sample size of 20, 5 years x 4 quarters a year). The
resulting confidence interval appears too wide so in an
effort to increase precision, we use 20 years of data (80
observations). However, when we reach back into the
1980s to study this fund, it had a different fund manager,
plus it was buying more small-cap value companies,
whereas today it is a blend of growth and value, with mid
to large market caps. In addition, the factors affecting
today's stock market (and mutual fund returns) are much
different compared to back in the 1980s. In short, the
population parameters have changed over time, and data
from 20 years ago shouldn't be mixed with data from the most recent five years.
The other consideration is that increasing sample size can involve additional expenses. Take the example of
researching hiring plans at S&P 500 firms (cross-sectional research). A sample size of 25 was suggested, which
would involve contacting the human resources department of 25 firms. By increasing the sample size to 100, or
200 or higher, we do achieve stronger precision in making our conclusions, but at what cost? In many cross-
sectional studies, particularly in the real world, where each sample takes time and costs money, it's sufficient to
leave sample size at a certain lower level, as the additional precision isn't worth the additional cost.
Bookshelves are filled with hundreds of such models that "guarantee" a winning investment strategy. Of course,
to borrow a common industry phrase, "past performance does not guarantee future results". Data-mining bias
refers to the errors that result from relying too heavily on data-mining practices. In other words, while some
patterns discovered in data mining are potentially useful, many others might just be coincidental and are not
likely to be repeated in the future - particularly in an "efficient" market. For example, we may not be able to
continue to profit from the January effect going forward, given that this phenomenon is so widely recognized.
As a result, stocks are bid for higher in November and December by market participants anticipating the
January effect, so that by the start of January, the effect is priced into stocks and one can no longer take
advantage of the model. Intergenerational data mining refers to the continued use of information already put
forth in prior financial research as a guide for testing the same patterns and overstating the same conclusions.
Distinguishing between valid models and valid conclusions, and those ideas that are purely coincidental and the
product of data mining, presents a significant challenge as data mining is often not easy to discover. A good
start to investigate for its presence is to conduct an out-of-sample test - in other words, researching whether the
model actually works for periods that do not overlap the time frame of the study. A valid model should continue
to be statistically significant even when out-of-model tests are conducted. For research that is the product of
data mining, a test outside of the model's time frame can often reveal its true nature. Other warning signs
involve the number of patterns or variables examined in the research - that is, did this study simply search
enough variables until something (anything) was finally discovered? Most academic research won't disclose the
number of variables or patterns tested in the study, but oftentimes there are verbal hints that can reveal the
presence of excessive data mining.
Above all, it helps when there is an economic rationale to explain why a pattern exists, as opposed to simply
pointing out that a pattern is there. For example, years ago a research study discovered that the market tended to
have positive returns in years that the NFC wins the Super Bowl, yet it would perform relatively poorly when
the AFC representative triumphs. However, there's no economic rationale for explaining why this pattern exists
- do people spend more, or companies build more, or investors invest more, based on the winner of a football
game? Yet the story is out there every Super Bowl week. Patterns discovered as a result of data mining may
make for interesting reading, but in the process of making decisions, care must be taken to ensure that mined
patterns not be blindly overused.
Survivorship Bias
A common form of sample-selection bias in financial databases is survivorship bias, or the tendency for
financial and accounting databases to exclude information on companies, mutual funds, etc. that are no longer
in existence. As a result, certain conclusions can be made that may in fact be overstated were one to remove this
bias and include all members of the population. For example, many studies have pointed out the tendency of
companies with low price-to-book-value ratios to outperform those firms with higher P/BVs. However, these
studies most likely aren't going to include those firms that have failed; thus data is not available and there is
sample-selection bias. In the case of low and high P/BV, it stands to reason that companies in the midst of
declining and failing will probably be relatively low on the P/BV scale yet, based on the research, we would be
guided to buy these very same firms due to the historical pattern. It's likely that the gap between returns on low-
priced (value) stocks and high-priced (growth) stocks has been systematically overestimated as a result of
survivorship bias. Indeed, the investment industry has developed a number of growth and value indexes.
However, in terms of defining for certain which strategy (growth or value) is superior, the actual evidence is
mixed.
Sample selection bias extends to newer asset classes such as hedge funds, a heterogeneous group that is
somewhat more removed from regulation, and where public disclosure of performance is much more
discretionary compared to that of mutual funds or registered advisors of separately managed accounts. One
suspects that hedge funds will disclose only the data that makes the fund look good (self-selection bias),
compared to a more developed industry of mutual funds where the underperformers are still bound by certain
disclosure requirements.
Look-Ahead Bias
Research is guilty of look-ahead bias if is makes use of information that was not actually available on a
particular day, yet the researchers assume it was. Let's returning to the example of buying low price-to-book-
value companies; the research may assume that we buy our low P/BV portfolio on Jan 1 of a given year, and
then (compared to a high P/BV portfolio) hold it throughout the year. Unfortunately, while a firm's current stock
price is immediately available, the book value of the firm is generally not available until months after the start
of the year, when the firm files its official 10-K. To overcome this bias, one could construct P/BV ratios using
current price divided by the previous year's book value, or (as is done by Russell's indexes) wait until midyear
to rebalance after data is reported.
Time-Period Bias
This type of bias refers to an investment study that may appear to work over a specific time frame but may not
last in future time periods. For example, any research done in 1999 or 2000 that covered a trailing five-year
period may have touted the outperformance of high-risk growth strategies, while pointing to the mediocre
results of more conservative approaches. When these same studies are conducted today for a trailing 10-year
period, the conclusions might be quite different. Certain anomalies can persist for a period of several quarters or
even years, but research should ideally be tested in a number of different business cycles and market
environments in order to ensure that the conclusions aren't specific to one unique period or environment.
2.23 - Calculating Confidence Intervals
When population variance (σ2) is known, the z-statistic can be used to calculate a reliability factor. Relative to
the t-distribution, it will result in tighter confidence intervals and more reliable estimates of mean and standard
deviation. Z-values are based on the standard normal distribution.
Formula 2.34
For alpha of 5% (i.e. a 95% confidence interval), the reliability factor (Zα/2) is 1.96, but for a CFA exam
problem, it is usually sufficient to round to an even 2 to solve the problem. (Remember that z-value at 95%
confidence is 2, as tables for z-values are sometimes not provided!) Given a sample size of 16, a sample mean
of 20 and population standard deviation of 25, a 95% confidence interval would be 20 + 2*(25/(16)1/2) = 20 +
2*(25/4) = 20 + 12.5. In short, for this sample size and for these sample statistics, we would be 95% confident
that the actual population mean would fall in a range from 7.5 to 32.5.
Suppose that this 7.5-to-32.5 range was deemed too broad for our purposes. Reducing the confidence interval is
accomplished in two ways: (1) increasing sample size, and (2) decreasing our allowable level of confidence.
1. Increasing sample size from 16 to 100 - Our 95% confidence is now equal to 20 + 2*(25/(100)1/2) = 20 +
2*(25/10) = 20 + 5. In other words, increasing the sample size to 100 narrows the 95% confidence range: min
15 to max 25.
When population variance is unknown, we will need to use the t-distribution to establish confidence intervals.
The t-statistic is more conservative; that is, it results in broader intervals. Assume the following sample
statistics: sample size = 16, sample mean = 20, sample standard deviation = 25.
To use the t-distribution, we must first calculate degrees of freedom, which for sample size 16 is equal to n – 1
= 15. Using an alpha of 5% (95% confidence interval), our confidence interval is 20 + (2.131) * (25/161/2),
which gives a range minimum of 6.68 and a range maximum of 33.32.
As before, we can reduce this range with (1) larger samples and/or (2) reducing allowable degree of confidence:
1. Increase sample size from 16 to 100 - The range is now equal to 20 + 2 * (25/10) à minimum 15 and
maximum 25 (for large sample sizes the t-distribution is sufficiently close to the z-value that it becomes an
acceptable alternative).
2. Reduce confidence from 95% to 90% - The range is now equal to 20 + 1.65 * (25/10) à minimum 15.875 and
maximum 24.125.
For a 95% confidence interval, if sample size = 100, sample standard deviation = 10 and our point estimate is
15, the confidence interval is 15 + 2* (10/1001/2) or 15 + 2. We are 95% confident that the population mean will
fall between 13 and 17.
Suppose we wanted to construct a 99% confidence interval. Reliability factor now becomes 2.58 and we have
15 + 2.58*(10/1001/2) or 15 + 2.58, or a minimum of 12.42 and a maximum of 17.58.
The table below summarizes the statistics used in constructing confidence intervals, given various situations:
While these calculations don't seem difficult, it's true that this material seems at times to run
together, particularly if a CFA candidate has never used it or hasn't studied it in some time.
While not likely to be a major point of emphasis, expect at least a few questions on confidence
intervals and in particular, a case study that will test basic knowledge of definitions, or that
will compare/contrast the two statistics presented (t-distribution and z-value) to make sure you
know which is useful in a given application. More than anything, the idea is to introduce
confidence intervals and how they are constructed as a prerequisite for hypothesis testing
2.24 - Hypothesis Testing
Hypothesis testing provides a basis for taking ideas or theories that someone initially develops about the
economy or investing or markets, and then deciding whether these ideas are true or false. More precisely,
hypothesis testing helps decide whether the tested ideas are probably true or probably false as the conclusions
made with the hypothesis-testing process are never made with 100% confidence – which we found in the
sampling and estimating process: we have degrees of confidence - e.g. 95% or 99% - but not absolute certainty.
Hypothesis testing is often associated with the procedure for acquiring and developing knowledge known as the
scientific method. As such, it relates the fields of
investment and economic research (i.e., business topics)
to other traditional branches of science (mathematics,
physics, medicine, etc.)
What is a Hypothesis?
A hypothesis is a statement made about a population parameter. These are typical hypotheses: "the mean annual
return of this mutual fund is greater than 12%", and "the mean return is greater than the average return for the
category". Stating the hypothesis is the initial step in a defined seven-step process for hypothesis testing – a
process developed based on the scientific method. We indicate each step below. In the remainder of this section
of the study guide, we develop a detailed explanation for how to answer each step's question.
Null Hypothesis
Step #1 in our process involves stating the null and alternate hypothesis. The null hypothesis is the statement
that will be tested. The null hypothesis is usually denoted with "H0". For investment and economic research
applications, and as it relates to the CFA exam, the null hypothesis will be a statement on the value of a
population parameter, usually the mean value if a question relates to return, or the standard deviation if it relates
to risk. It can also refer to the value of any random variable (e.g. sales at company XYZ are at least $10 million
this quarter). In hypothesis testing, the null hypothesis is initially regarded to be true, until (based on our
process) we gather enough proof to either reject the null hypothesis, or fail to reject the null hypothesis.
Alternative Hypothesis
The alternative hypothesis is a statement that will be accepted as a result of the null hypothesis being rejected.
The alternative hypothesis is usually denoted "Ha". In hypothesis testing, we do not directly test the worthiness
of the alternate hypothesis, as our testing focus is on the null. Think of the alternative hypothesis as the residual
of the null – for example, if the null hypothesis states that sales at company XYZ are at least $10 million this
quarter, the alternative hypothesis to this null is that sales will fail to reach the $10 million mark. Between the
null and the alternative, it is necessary to account for all possible values of a parameter. In other words, if we
gather evidence to reject this null hypothesis, then we must necessarily accept the alternative. If we fail to reject
the null, then we are rejecting the alternative.
One-Tailed Test
The labels "one-tailed" and "two-tailed" refer to the standard normal distribution (as well as all of the t-
distributions). The key words for identifying a one-tailed test are "greater than or less than". For example, if our
hypothesis is that the annual return on this mutual fund will be greater than 8%, it's a one-tailed test that will be
rejected based only on finding observations in the left tail.
Figure 2.13 below illustrates a one-tailed test for "greater than" (rejection in left tail). (A one-tailed test for "less
than" would look similar to the graph below, with the rejection region for less than in the right tail rather than
the left.)
Two-Tailed test
Characterized by the words "equal to or not equal to". For example, if our hypothesis were that the return on a
mutual fund is equal to 8%, we could reject it based on observations in either tail (sufficiently higher than 8% or
sufficiently lower than 8%).
Choosing what will be the null and what will be the alternative depends on the case and what it is we wish to
prove. We usually have two different approaches to what we could make the null and alternative, but in most
cases, it's preferable to make the null what we believe we can reject, and then attempt to reject it. For example,
in our case of a one-tailed test with the return hypothesized to be greater than 8%, we could make the greater-
than case the null (alternative being less than), or we could make the greater-than case the alternative (with less
than the null). Which should we choose? A hypothesis test is typically designed to look for evidence that may
possibly reject the null. So in this case, we would make the null hypothesis "the return is less than or equal to
8%", which means we are looking for observations in the left tail. If we reject the null, then the alternative is
true, and we conclude the fund is likely to return at least 8%.
Test Statistic
Step #2 in our seven-step process involves identifying an appropriate test statistic. In hypothesis testing, a test
statistic is defined as a quantity taken from a sample that is used as the basis for testing the null hypothesis
(rejecting or failing to reject the null).
Calculating a test statistic will vary based upon the case and our choice of probability distribution (for example,
t-test, z-value). The general format of the calculation is:
Formula 2.36
Errors in hypothesis testing come in two forms: Type I and Type II. A type I error is defined as rejecting the
null hypothesis when it is true. A type II error is defined as not rejecting the null hypothesis when it is false. As
the table below indicates, these errors represent two of the four possible outcomes of a hypothesis test:
The reason for separating type I and type II errors is that, depending on the case, there can be serious
consequences for a type I error, and there are other cases when type II errors need to be avoided, and it is
important to understand which type is more important to avoid.
Significance Level
Denoted by α, or alpha, the significance level is the probability of making a type I error, or the probability that
we will reject the null hypothesis when it is true. So if we choose a significance level of 0.05, it means there is a
5% chance of making a type I error. A 0.01 significance level means there is just a 1% chance of making a type
I error. As a rule, a significance level is specified prior to calculating the test statistic, as the analyst conducting
the research may use the result of the test statistic calculation to impact the choice of significance level (may
prompt a change to higher or lower significance). Such a change would take away from the objectivity of the
test.
While any level of alpha is permissible, in practice there is likely to be one of three possibilities for significance
level: 0.10 (semi-strong evidence for rejecting the null hypothesis), 0.05 (strong evidence), and 0.01 (very
strong evidence). Why wouldn't't we always opt for 0.01 or even lower probabilities of type I errors – isn't the
idea to reduce and eliminate errors? In hypothesis testing, we have to control two types of errors, with a tradeoff
that when one type is reduced, the other type is increased. In other words, by lowering the chances of a type I
error, we must reject the null less frequently – including when it is false (a type II error). Actually quantifying
this tradeoff is impossible because the probability of a type II error (denoted by β, or beta) is not easy to define
(i.e. it changes for each value of θ). Only by increasing sample size can we reduce the probability of both types
of errors.
Decision Rule
Step #4 in the hypothesis-testing process requires stating a decision rule. This rule is crafted by comparing two
values: (1) the result of the calculated value of the test statistic, which we will complete in step #5 and (2) a
rejection point, or critical value (or values) that is (are) the function of our significance level and the probability
distribution being used in the test. If the calculated value of the test statistic is as extreme (or more extreme)
than the rejection point, then we reject the null hypothesis, and state that the result is statistically significant.
Otherwise, if the test statistic does not reach the rejection point, then we cannot reject the null hypothesis and
we state that the result is not statistically significant. A rejection point depends on the probability distribution,
on the chosen alpha, and on whether the test in one-tailed or two-tailed.
For example, if in our case we are able to use the standard normal distribution (the z-value), if we choose an
alpha of 0.05, and we have a two-tailed test (i.e. reject the null hypothesis when the test statistic is either above
or below), the two rejection points are taken from the z-values for standard normal distributions: below -1.96
and above +1.96. Thus if the calculated test statistic is in these two rejection ranges, the decision would be to
reject the null hypothesis. Otherwise, we fail to reject the null hypothesis.
Look Out!
Traditionally, it was said that we accepted the null hypothesis; however, the authors have
discouraged use of the word "accept", in terms of accepting the null hypothesis, as those
terms imply a greater degree of conviction about the null than is warranted. Having made the
effort to make this distinction, do not be surprised if this subtle change (which seems
inconsequential on the surface) somehow finds its way onto the CFA exam (if you answer
"accept the null hypothesis", you get the question wrong, and if you answer "fail to reject the
null hypothesis" you score points.
Power of a Test
The power of a hypothesis test refers to the probability of correctly rejecting the null hypothesis. There are two
possible outcomes when the null hypothesis is false: either we (1) reject it (as we correctly should) or (2) we
accept it – and make a type II error. Thus the power of a test is also equivalent to 1 minus the beta (β), the
probability of a type II error. Since beta isn't quantified, neither is the power of a test. For hypothesis tests, it is
sufficient to specify significance level, or alpha. However, given a choice between more than one test statistic
(for example, z-test, t-test), we will always choose the test that increases a test's power, all other factors equal.
Hypothesis tests, as a basis for testing the value of population parameters, are also set up to reject or not reject
based on "number of standard deviations away from the mean". The basic structure for testing the null
hypothesis at the 5% significance level, again using the standard normal, is -1.96 < [(sample mean –
hypothesized population mean) / standard error] < +1.96, or, equivalently,-1.96 * (std. error) < (sample mean) –
(hypo. pop. mean) < +1.96 * (std. error).
In hypothesis testing, we essentially create an interval within which the null will not be rejected, and we are
95% confident in this interval (i.e. there's a 5% chance of a type I error). By slightly rearranging terms, the
structure for a confidence interval and the structure for rejecting/not rejecting a null hypothesis appear very
similar – an indication of the relationship between the concepts.
The final step, or step #7, involves making the investment or economic decision (i.e. the real-world decision). In
this context, the statistical decision is but one of many considerations. For example, take a case where we
created a hypothesis test to determine whether a mutual fund outperformed its peers in a statistically significant
manner. For this test, the null hypothesis was that the fund's mean annual return was less than or equal to a
category average; the alternative was that it was greater than the average. Assume that at a significance level of
0.05, we were able to establish statistical significance and reject the null hypothesis, thus accepting the
alternative. In other words, our statistical decision was that this fund would outperform peers, but what is the
investment decision? The investment decision would likely take into account (for example) the risk tolerance of
the client and the volatility (risk) measures of the fund, and it would assess whether transaction costs and tax
implications make the investment decision worth making. In other words, rejecting/not rejecting a null
hypothesis does not automatically require that a decision be carried out; thus there is the need to assess the
statistical decision and the economic or investment decision in two separate steps.
2.25 - Interpreting Statistical Results
Results Where Data is Normally Distributed and Variance is Known or Unknown
Answer:
Test statistic = (10.6 – 12)/1.6 = -1.4/1.6 = -0.875. This value does not fall below the rejection point, so
we cannot reject the null hypothesis with statistical certainty.
2. When we are making hypothesis tests on a population mean, it's relatively likely that the population
variance will be unknown. In these cases, we use a sample standard deviation when computing standard
error, and the t-statistic for the decision rule (i.e. as the source for our rejection level). Compared to the z
or standard normal, a t-statistic is more conservative (i.e. higher rejection points for rejecting the null
hypothesis). In cases with large sample sizes (at least 30), the z-statistic may be substituted.
Example:
Take a case where sample size is 16. In this case, the t-stat is the only appropriate choice. For the t-
distribution, degrees of freedom are calculated as (sample size – 1), df = 15 in this example. In this case,
assume we are testing a hypothesis that a population mean is greater than 8, so this will be a one-tailed
test (right tail): null hypothesis is μ < 8, and the alternative is that μ > 8. Our required significance level
is 0.05. Using the table for Student's t-distribution for df = 15 and p = 0.05, the critical value (rejection
point) is 1.753. In other words, if our calculated test statistic is greater than 1.753, we reject the null
hypothesis.
Answer:
Moving to step 5 of the hypothesis-testing process, we take a sample where the mean is 8.3 and the
standard deviation is 6.1. For this sample, standard error = s /n1/2 = 6.1/(16)1/2 = 6.1/4 = 1.53. The test
statistic is (8.3 – 8.0)/1.53 = 0.3/1.53, or 0.196. Comparing 0.196 to our rejection point of 1.753, we are
unable to reject the null hypothesis.
Note that in this case, our sample mean of 8.3 was actually greater than 8; however, the hypothesis test
is set up to require statistical significance, not simply compare a sample mean to the hypothesis. In
other words, the decisions made in hypothesis testing are also a function of sample size (which at 16 is
low), the standard deviation, the required level of significance and the t-distribution. Our interpretation
in this example is that the 8.3 from the sample mean, while nominally higher than 8, simply isn't
significantly higher than 8, at least to the point where we would be able to definitively make a
conclusion regarding the population mean being greater than 8.
Relative equality of population means of two normally distributed populations, where independent
random sample assumed variances are equal or unequal
For the case where the population variances for two separate groups can be assumed to be equal, a technique for
pooling an estimate of population variance (s2) from the sample data is given by the following formula (assumes
two independent random samples):
Formula 2.37
Where: n1, n2 are samples sizes, and s12, s22 are sample variances.
Degrees of freedom = n1 + n2 – 2
For testing equality of two population means (i.e. μ1 = μ2), the test statistic calculates the difference in sample
means (X1 – X2), divided by the standard error: the square root of (s2/n1 + s2/n2).
Answer:
If sample means were 8.6 and 8.9, the t = (8.6 – 8.9)/2 = -0.3/2 = -0.15. Tests of equality/inequality are two-
sided tests. With df = 38 (sum of samples sizes – 2) and if we assume 0.05 significance (p = 0.025), the
rejection level is t < -2.024, or t > +2.024. Since our computed test statistic was –0.15, we cannot reject the null
hypothesis that these population means are equal.
1. For hypothesis tests of equal population means where variances cannot be assumed to be equal, the
appropriate test statistic for the hypothesis is the t-stat, but we can no longer pool an estimate of standard
deviation, and the standard error becomes the square root of [(s12/n1) + (s22/n2)]. The null hypothesis remains μ1
= μ2, and the test statistic is calculated similar to the previous example (i.e. difference in sample means /
standard error). Computing degrees of freedom is approximated by this formula
Look Out!
Take a case where we are comparing two mutual funds that are both classified as large-cap growth, in which we
are testing whether returns for one are significantly above the other (statistically significant). The paired-
comparisons test is appropriate since we assume some degree of correlation, as returns for each will be
dependent on the market. To calculate the t-statistic, we first find the sample mean difference, denoted by d:
d = (1/n)(d1 + d2 + d3 …. + dn), where n is the number of paired observations (in our example, the number of
quarters for which we have quarterly returns), and each d is the difference between each observation in the
sample. Next, sample variance, or (sum of all deviations from d )2/(n – 1) is calculated, with standard deviation
(sd) the positive square root of the variance. Standard error = sd/(n)1/2.
For our mutual example, if our mean returns are for 10 years (40 quarters of data), have a sample mean
difference of 2.58, and a sample standard deviation of 5.32, our test statistic is computed as (2.58)/((5.32)/
(40)1/2), or 3.067. At 49 degrees of freedom with a 0.05 significance level, the rejection point is 2.01. Thus we
reject the null hypothesis and state that there is a statistically significant difference in returns between these
funds.
In hypothesis tests for the variance on a single normally distributed population, the appropriate test statistic is
known as a “chi-square”, denoted by χ2. Unlike the distributions we have been using previously, the chi-square
is asymmetrical as it is bound on the left by zero. (This must be true since variance is always a positive
number.) The chi-square is actually a family of distributions similar to the t-distributions, with different degrees
of freedom resulting in a different chi-square distribution.
Formula 2.38
Where: n = sample size, s2 = sample variance, σ02 = population variance from hypothesis
Sample variance s2 is refereed to as the sum of deviations between observed values and sample mean2, degrees
of freedom, or n – 1
Our test will examine quarterly returns over the past five years, so n = 20, and degrees of freedom = 19. Our test
is a greater-than test with the null hypothesis of σ2 < (10)2, or 100, and an alternate hypothesis of σ2 > 100.
Using a 0.05 level of significance, our rejection point, from the chi-square tables with df = 19 and p = 0.05 in
the right tail, is 30.144. Thus if our calculated test statistic is greater than 30.144, we reject the null hypothesis
at 5% level of significance.
Answer:
Examining the quarterly returns for this period, we find our sample variance (s2) is 135. With n = 20 and σ02 =
100, we have all the data required to calculate the test statistic.
Since 25.65 is less than our critical value of 30.144, we do not have enough evidence to reject the null
hypothesis. While this fund may indeed be quite volatile, its volatility isn't statistically more meaningful than
the market average for the period.
Hypothesis Tests Relating to the equality of the Variances of Two Normally Distributed Populations,
where both Samples are Random and Independent
For hypothesis tests concerning relative values of the variances from two populations – whether σ12 (variance of
the first population) and σ22 (variance of the second) are equal/not equal/greater than/less than – we can
construct hypotheses in one of three ways.
When a hypothesis test compares variances from two populations and we can assume that random samples from
the populations are independent (uncorrelated), the appropriate test is the F-test, which represents the ratio of
sample variances. As with the chi-square, the F-distribution is a family of asymmetrical distributions (bound on
the left by zero). The F-family of distributions is defined by two values of degrees of freedom: the numerator
(df1) and denominator (df2). Each of the degrees of freedom are taken from the sample sizes (each sample size –
1).
The F-test taken from the sample data could be either s12/s22, or s22/s12 - with the convention to use whichever
ratio produces the larger number. This way, the F-test need only be concerned with values greater than 1, since
one of the two ratios is always going to be a number above 1.
Answer:
Our F-statistic is (8.5)2/(6.3)2 = 72.25/39.69 = 1.82.
Since 1.82 does not reach the rejection level of 2.51, we cannot reject the null hypothesis, and we state that the
risk between these funds is not significantly different.
Concepts from the hypothesis-testing section are unlikely to be tested by rigorous exercises in number
crunching but rather in identifying the unique attributes of a given statistic. For example, a typical question
might ask, “In hypothesis testing, which test statistic is defined by two degrees of freedom, the numerator and
the denominator?”, giving you these choices: A. t-test, B. z-test, C. chi-square, or D. F-test. Of course, the
answer would be D. Another question might ask, “Which distribution is NOT symmetrical?”, and then give you
these choices: A. t, B. z, C. chi-square, D. normal. Here the answer would be C. Focus on the defining
characteristics, as they are the most likely source of exam questions.
Nonparametric hypothesis tests are designed for cases where either (a) fewer or different assumptions about
the population data are appropriate, or (b) where the hypothesis test is not concerned with a population
parameter.
In many cases, we are curious about a set of data but believe that the required assumptions (for example,
normally distributed data) do not apply to this example, or else the sample size is too small to comfortably make
such an assumption. A number of nonparametric alternatives have been developed to use in such cases. The
table below indicates a few examples that are analogous to common parametric tests.
Concern of
Parametric test Nonparametric
hypothesis
Wilcoxian signed-rank
Single mean t-test, z-test
test
Differences between t-test (or Mann-Whitney U-test
means approximate t-test)
Paired comparisons t-test Sign test, or Wilcoxian
A number of these tests are constructed by first converting data into ranks (first, second, third, etc.) and then
fitting the data into the test. One such test applied to testing correlation (the degree to which two variables are
related to each other) is the Spearman rank correlation coefficient. The Spearman test is useful in cases where a
normal distribution cannot be assumed – usually when a variable is bound by zero (always positive), or where
the range of values are limited. For the Spearman test, each observation in the two variables is ranked from
largest to smallest, and then the differences between the ranks are measured. The data is then used to find the
test statistic rs: 1 – [6*(sum of squared differences)/n*(n2 – 1)]. This result is compared to a rejection point
(based on the Spearman rank correlation) to determine whether to reject or not reject the null hypothesis.
Another situation requiring a nonparametric approach is to answer a question about something other than a
parameter. For example, analysts often wish to address whether a sample is truly random or whether the data
have a pattern indicating that it is not random (tested with the so-called “runs test”). Tests such as Kolmogorov-
Smirnov find whether a sample comes from a population that is distributed a certain way. Most of these
nonparametric examples are specialized and unlikely to be tested in any detail on the CFA Level I exam.
2.26 - Correlation and Regression
Financial variables are often analyzed for their correlation to other variables and/or market averages. The
relative degree of co-movement can serve as a powerful predictor of future behavior of that variable. A sample
covariance and correlation coefficient are tools used to indicate relation, while a linear regression is a technique
designed both to quantify a positive relationship between random variables, and prove that one variable is
dependent on another variable. When you are analyzing a security, if returns are found to be significantly
dependent on a market index or some other independent source, then both return and risk can be better
explained and understood.
Scatter Plots
A scatter plot is designed to show a relationship between two
variables by graphing a series of observations on a two-
dimensional graph – one variable on the X-axis, the other on the
Y-axis.
To illustrate, take a sample of five paired observations of annual returns for two mutual funds, which we will
label X and Y:
Average X and Y returns were found by dividing the sum by n or 5, while the average of the cross-products is
computed by dividing the sum by n – 1, or 4. The use of n – 1 for covariance is done by statisticians to ensure
an unbiased estimate.
Interpreting a covariance number is difficult for those who are not statistical experts. The 99.64 we computed
for this example has a sign of "returns squared" since the numbers were percentage returns, and a return
squared is not an intuitive concept. The fact that Cov(X,Y) of 99.64 was greater than 0 does indicate a positive
or linear relationship between X and Y. Had the covariance been a negative number, it would imply an inverse
relationship, while 0 means no relationship. Thus 99.64 indicates that the returns have positive co-movement
(when one moves higher so does the other), but doesn’t offer any information on the extent of the co-movement.
Formula 2.39
r = (covariance between X, Y) / (sample standard deviation of X) *
(sample std. dev. of Y).
Answer:
As with sample covariance, we use (n – 1) as the denominator in calculating sample variance (sum of squared
deviations as the numerator) – thus in the above example, each sum was divided by 4 to find the variance.
Standard deviation is the positive square root of variance: in this example, sample standard deviation of X is
(136.06)1/2, or 11.66; sample standard deviation of Y is (99.14)1/2, or 9.96.
Therefore, the correlation coefficient is (99.64)/11.66*9.96 = 0.858. A correlation coefficient is a value between
–1 (perfect inverse relationship) and +1 (perfect linear relationship) – the closer it is to 1, the stronger the
relationship. This example computed a number of 0.858, which would suggest a strong linear relationship.
Hypothesis Testing: Determining Whether a Positive or Inverse Relationship Exists Between Two
Random Variables
A hypothesis-testing procedure can be used to determine whether there is a positive relationship or an inverse
relationship between two random variables. This test uses each step of the hypothesis-testing procedure,
outlined earlier in this study guide. For this particular test, the null hypothesis, or H0, is that the correlation in
the population is equal to 0. The alternative hypothesis, Ha, is that the correlation is different from 0. The t-test
is the appropriate test statistic. Given a sample correlation coefficient r, and sample size n, the formula for the
test statistic is this:
Using our computed sample r of 0.858, t = r*(n – 2)1/2/(1 – r2)1/2 = (0.858)*(3)1/2/(1 – (0.858)2)1/2 = (1.486)/
(0.514) = 2.891. Comparing 2.891 to our rejection point of 3.182, we do not have enough evidence to reject the
null hypothesis that the population correlation coefficient is 0. In this case, while it does appear that there is a
strong linear relationship between our two variables (and thus we may well be risking a type II error), the results
of the hypothesis test show the effects of a small sample size; that is, we had just three degrees of freedom,
which required a high rejection level for the test statistic in order to reject the null hypothesis. Had there been
one more observation on our sample (i.e. degrees of freedom = 4), then the rejection point would have been
2.776 and we would have rejected the null and accepted that there is likely to be a significant difference from 0
in the population r. In addition, level of significance plays a role in this hypothesis test. In this particular
example, we would reject the null hypothesis at a 0.1 level of significance, where the rejection level would be
any test statistic higher than 2.353.
Of course, a hypothesis-test process is designed to give information about that example and the pre-required
assumptions (done prior to calculating the test statistic). Thus it would stand that the null could not be rejected
in this case. Quite frankly, the hypothesis-testing exercise gives us a tool to establish significance to a sample
correlation coefficient, taking into account the sample size. Thus, even though 0.858 feels close to 1, it’s also
not close enough to make conclusions about correlation of the underlying populations – with small sample size
probably a factor in the test.
CFA Level 1 - Quantitative Methods
Email to Friend
Comments
Linear Regression
A linear regression is constructed by fitting a line through
a scatter plot of paired observations between two
variables. The sketch below illustrates an example of a
linear regression line drawn through a series of (X, Y)
observations:
Regression Equation
The regression equation describes the relationship between two variables and is given by the general format:
Formula 2.40
Y = a + bX +
ε
In this format, given that Y is dependent on X, the slope b indicates the unit changes in Y for every unit
change in X. If b = 0.66, it means that every time X increases (or decreases) by a certain amount, Y
increases (or decreases) by 0.66*that amount. The intercept a indicates the value of Y at the point where
X = 0. Thus if X indicated market returns, the intercept would show how the dependent variable
performs when the market has a flat quarter where returns are 0. In investment parlance, a manager has
a positive alpha because a linear regression between the manager's performance and the performance of
the market has an intercept number a greater than 0.
1. The relationship between the dependent variable Y and the independent variable X is linear in the slope
and intercept parameters a and b. This requirement means that neither regression parameter can be
multiplied or divided by another regression parameter (e.g. a/b), and that both parameters are raised to
the first power only. In other words, we can't construct a linear model where the equation was Y = a +
b2X + ε, as unit changes in X would then have a b2 effect on a, and the relation would be nonlinear.
2. The independent variable X is not random.
3. The expected value of the error term "ε" is 0. Assumptions #2 and #3 allow the linear regression model
to produce estimates for slope b and intercept a.
4. The variance of the error term is constant for all observations. Assumption #4 is known as the
"homoskedasticity assumption". When a linear regression is heteroskedastic its error terms vary and the
model may not be useful in predicting values of the dependent variable.
5. The error term ε is uncorrelated across observations; in other words, the covariance between the error
term of one observation and the error term of the other is assumed to be 0. This assumption is necessary
to estimate the variances of the parameters.
6. The distribution of the error terms is normal. Assumption #6 allows hypothesis-testing methods to be
applied to linear-regression models.
Assume the following experience (on the next page) over a five-year period; predicted data is a function of the
model and GDP, and "actual" data indicates what happened at the company:
To find the standard error of the estimate, we take the sum of all squared residual terms and divide by (n – 2),
and then take the square root of the result. In this case, the sum of the squared residuals is
0.09+0.16+0.64+2.25+0.04 = 3.18. With five observations, n – 2 = 3, and SEE = (3.18/3)1/2 = 1.03%.
The computation for standard error is relatively similar to that of standard deviation for a sample (n – 2 is used
instead of n – 1). It gives some indication of the predictive quality of a regression model, with lower SEE
numbers indicating that more accurate predictions are possible. However, the standard-error measure doesn't
indicate the extent to which the independent variable explains variations in the dependent model.
Coefficient of Determination
Like the standard error, this statistic gives an indication of how well a linear-regression model serves as an
estimator of values for the dependent variable. It works by measuring the fraction of total variation in the
dependent variable that can be explained by variation in the independent variable.
The coefficient of determination, or explained variation as a percentage of total variation, is the first of these
two terms. It is sometimes expressed as 1 – (unexplained variation / total variation).
For a simple linear regression with one independent variable, the simple method for computing the coefficient
of determination is squaring the correlation coefficient between the dependent and independent variables. Since
the correlation coefficient is given by r, the coefficient of determination is popularly known as "R2, or R-
squared". For example, if the correlation coefficient is 0.76, the R-squared is (0.76)2 = 0.578. R-squared terms
are usually expressed as percentages; thus 0.578 would be 57.8%. A second method of computing this number
would be to find the total variation in the dependent variable Y as the sum of the squared deviations from the
sample mean. Next, calculate the standard error of the estimate following the process outlined in the previous
section. The coefficient of determination is then computed by (total variation in Y – unexplained variation in Y)
/ total variation in Y. This second method is necessary for multiple regressions, where there is more than one
independent variable, but for our context we will be provided the r (correlation coefficient) to calculate an R-
squared.
What R2 tells us is the changes in the dependent variable Y that are explained by changes in the independent
variable X. R2 of 57.8 tells us that 57.8% of the changes in Y result from X; it also means that 1 – 57.8% or
42.2% of the changes in Y are unexplained by X and are the result of other factors. So the higher the R-squared,
the better the predictive nature of the linear-regression model.
Regression Coefficients
For either regression coefficient (intercept a, or slope b), a confidence interval can be determined with the
following information:
For a slope coefficient, the formula for confidence interval is given by b ± tc*SEE, where tc is the critical t value
at our chosen significant level.
To illustrate, take a linear regression with a mutual fund's returns as the dependent variable and the S&P 500
index as the independent variable. For five years of quarterly returns, the slope coefficient b is found to be 1.18,
with a standard error of the estimate of 0.147. Student's t-distribution for 18 degrees of freedom (20 quarters –
2) at a 0.05 significance level is 2.101. This data gives us a confidence interval of 1.18 ± (0.147)*(2.101), or a
range of 0.87 to 1.49. Our interpretation is that there is only a 5% chance that the slope of the population is
either less than 0.87 or greater than 1.49 – we are 95% confident that this fund is at least 87% as volatile as the
S&P 500, but no more than 149% as volatile, based on our five-year sample.
The mechanics of hypothesis testing are similar to the examples we have used previously. A null hypothesis is
chosen based on a not-equal-to, greater-than or less-than-case, with the alternative satisfying all values not
covered in the null case. Suppose in our previous example where we regressed a mutual fund's returns on the
S&P 500 for 20 quarters our hypothesis is that this mutual fund is more volatile than the market. A fund equal
in volatility to the market will have slope b of 1.0, so for this hypothesis test, we state the null hypothesis (H0)as
the case where slope is less than or greater to 1.0 (i.e. H0: b < 1.0). The alternative hypothesis Ha has b > 1.0. We
know that this is a greater-than case (i.e. one-tailed) – if we assume a 0.05 significance level, t is equal to 1.734
at degrees of freedom = n – 2 = 18.
For this example, our calculated test statistic is below the rejection level of 1.734, so we are not able to reject
the null hypothesis that the fund is more volatile than the market.
Interpretation: the hypothesis that b > 1 for this fund probably needs more observations (degrees of freedom) to
be proven with statistical significance. Also, with 1.18 only slightly above 1.0, it is quite possible that this fund
is actually not as volatile as the market, and we were correct to not reject the null hypothesis.
1. Variation in the fund is about 75%, explained by changes in the Russell 2000 index. This is true because
the square of the correlation coefficient, (0.864)2 = 0.746, gives us the coefficient of determination or R-
squared.
2. The fund will slightly underperform the index when index returns are flat. This results from the value of
the intercept being –0.417. When X = 0 in the regression equation, the dependent variable is equal to the
intercept.
3. The fund will on average be more volatile than the index. This fact follows from the slope of the
regression line of 1.317 (i.e. for every 1% change in the index, we expect the fund's return to change by
1.317%).
4. The fund will outperform in strong market periods, and underperform in weak markets. This fact follows
from the regression. Additional risk is compensated with additional reward, with the reverse being true
in down markets. Predicted values of the fund's return, given a return for the market, can be found by
solving for Y = -0.417 + 1.317X (X = Russell 2000 return).
Applied to regression parameters, ANOVA techniques are used to determine the usefulness in a regression
model, and the degree to which changes in an independent variable X can be used to explain changes in a
dependent variable Y. For example, we can conduct a hypothesis-testing procedure to determine whether slope
coefficients are equal to zero (i.e. the variables are unrelated), or if there is statistical meaning to the
relationship (i.e. the slope b is different from zero). An F-test can be used for this process.
F-Test
The formula for F-statistic in a regression with one independent variable is given by the following:
Formula 2.41
1. RSS, or the regression sum of squares, is the amount of total variation in the dependent variable Y that
is explained in the regression equation. The RSS is calculated by computing each deviation between a
predicted Y value and the mean Y value, squaring the deviation and adding up all terms. If an
independent variable explains none of the variations in a dependent variable, then the predicted values of
Y are equal to the average value, and RSS = 0.
2. SSE, or the sum of squared error of residuals, is calculated by finding the deviation between a predicted
Y and an actual Y, squaring the result and adding up all terms.
TSS, or total variation, is the sum of RSS and SSE. In other words, this ANOVA process breaks variance into
two parts: one that is explained by the model and one that is not. Essentially, for a regression equation to have
high predictive quality, we need to see a high RSS and a low SSE, which will make the ratio (RSS/1)/[SSE/(n –
2)] high and (based on a comparison with a critical F-value) statistically meaningful. The critical value is taken
from the F-distribution and is based on degrees of freedom.
For example, with 20 observations, degrees of freedom would be n – 2, or 18, resulting in a critical value (from
the table) of 2.19. If RSS were 2.5 and SSE were 1.8, then the computed test statistic would be F = (2.5/(1.8/18)
= 25, which is above the critical value, which indicates that the regression equation has predictive quality (b is
different from 0)
Y = 0.154 + 0.917X
Using this model, the predicted inflation number would be calculated based on the model for the following
inflation scenarios:
-1.1% -0.85%
+1.4% +1.43%
+4.7% +4.46%
The predictions based on this model seem to work best for typical inflation estimates, and suggest that extreme
estimates tend to overstate inflation – e.g. an actual inflation of just 4.46 when the estimate was 4.7. The model
does seem to suggest that estimates are highly predictive. Though to better evaluate this model, we would need
to see the standard error and the number of observations on which it is based. If we know the true value of the
regression parameters (slope and intercept), the variance of any predicted Y value would be equal to the square
of the standard error.
In practice, we must estimate the regression parameters; thus our predicted value for Y is an estimate based on
an estimated model. How confident can we be in such a process? In order to determine a prediction interval,
employ the following steps:
2. Compute the variance of the prediction error, using the following equation:
Formula 2.42
Where: s2 is the squared standard error of the estimate, n is number of observations, X is the value of the
independent variable used to make the prediction, X is the estimated mean value of the independent
variable, and sx2 is the variance of X.
Here's another case where the material becomes much more technical than necessary and one can get
bogged down in preparing, when in reality the formula for variance of a prediction error isn't likely to be
covered. Prioritize – don't squander precious study hours memorizing it. If the concept is tested at all,
you'll likely be given the answer to Part 2. Simply know how to use the structure in Part 4 to answer a
question.
For example, if the predicted X observation is 2 for the regression Y = 1.5 + 2.5X, we would have a
predicted Y of 1.5 + 2.5*(2), or 6.5. Our confidence interval is 6.5 ± tc*sf. The t-stat is based on a chosen
confidence interval and degrees of freedom, while sf is the square root of the equation above (for variance
of the prediction error. If these numbers are tc = 2.10 for 95% confidence, and sf = 0.443, the interval is
6.5 ± (2.1)*(0.443), or 5.57 to 7.43.
1. Parameter Instability - This is the tendency for relationships between variables to change over time
due to changes in the economy or the markets, among other uncertainties. If a mutual fund produced a
return history in a market where technology was a leadership sector, the model may not work when
foreign and small-cap markets are leaders.
2. Public Dissemination of the Relationship - In an efficient market, this can limit the effectiveness of that
relationship in future periods. For example, the discovery that low price-to-book value stocks outperform
high price-to-book value means that these stocks can be bid higher, and value-based investment
approaches will not retain the same relationship as in the past.
3. Violation of Regression Relationships - Earlier we summarized the six classic assumptions of a linear
regression. In the real world these assumptions are often unrealistic – e.g. assuming the independent
variable X is not random.
<< Back
Next >>