Você está na página 1de 7

CS 8 Study Guide

Association
Studies vs experiments:
-Observational:
can compare some aspect of a treatment group and control group
the assignment of people to groups is out of our control
we can determine only an association between the treatment and result
-Randomized control:
can compare some aspect of a treatment and control group
the assignment of people to groups is controlled and randomized so that there are no
systematic differences or confounding factors
we can demonstrate causality between the treatment and the result
Correlation
-The relation between two variables
-The correlation coefficient r measures the strength of the linear
relationship between two variables. Graphically, it measures how
clustered the scatter diagram is around a straight line.
Limitations:
-Correlation only measures association. Correlation does not imply
causation.
- Correlation measures linear association. Variables that have strong
non-linear association might have very low correlation.

r = average(x in standard units * y in standard units)


r is always between 1 and -1
risapurenumbersinceitismeasuredinstandardunits
risunaffectedbyswitchingaxes
Estimation

To make assumptions
-Variability: the value of an estimate varies from one sample to the next, good
estimates have low variability and bias
-Bias is systematically on one side or the other, if you want to judge bias look
at the thing you are trying to estimate first and then look at both sides. If you
are comparing variability you should look at how wide the histogram is.

Programming
Useful expressions
t.num_rows
t.where
t.group
np.average/np.mean/np.std

Arrays
Tables

where

column, row

group[s]

pivot?

join

with_column[s], append_column

with_row[s], append

apply

drop

sort

take

relabel[ed]

Example problems:
a line of code computing the average values of a column
np.mean(exports.column(1))
a line of code that returns a table with only the values that are above average.
exports.where(exports.column(1) > np.mean(exports.column(1))
a line of code that returns True if any exports are less than 1 million dollars.
exports.sort(amount).column(1).item(0) < 1e6
a line of code that returns the proportion of exports that are between 50 and 100 million
dollars.
exports.where(np.logical_and(exports.column(1) >= 50,
exports.column(1) <= 100)).num_rows / exports.num_rows
an expression that filters a table so that some conditions are satisfied
t.where(np.logical_and(
t.column(' ') >= larger number,
t.column(' ') >= smaller number))

average for a column and then assigned to a variable


variable = t.group('column', np.mean)
highest average rating of groups made and then assigned to a variable
variable.sort('rating mean', descending=True).column('column').item(0)
PIVOT example!!!

Functions
def fizzbuzz(num):
if num % 15 == 0:
return 'fizzbuzz'
elif num % 3 == 0:
return 'fizz'
elif num % 5 == 0:
return 'buzz'
else: #SOLUTION
return str(num)
Iteration:
if i % 15 == 0:
print(FizzBuzz)
lif i % 3 == 0:
print(Fizz)
elif i % 5 == 0:
print(Buzz)
else:
print(i)

Conditions
Visualizations
Growth rate
The birthday problem
Sampling and estimation
Sampling: allows us to make statistical inferences. Allows us to investigate both issues
of probability and statistics
-Deterministic sample: when you take elements you want to chose from the data
-Probability sample: is one for which it is possible to calculate, before the sample is
drawn, the chance with which any subset of elements will enter the sample. In a

probability sample, all elements need not have the same chance of
being chosen.
-To draw a probability sample, we need to generate numbers at random. In Python
we can do this by np.random.randint( , )
-Random sample with replacement one of the easiest ways to make a probability
sample: each element in the population has the same chance of
being drawn, and each can be drawn more than once in the
sample.

Estimation: computing a statistic from our sample about a property of the population
Mean:
-The mean is dependent upon all of the values
-It is the average! The sum of all the values divided by how many
values there are in the dataset.
-It is the balance point of the histogram
-Dependent on the direction of the tail, the mean will be pulled in the
direction of the skew.
Median:
-The median is dependent upon what is happening in the center only.
-The half waypoint of the data

Spread
How dispersed our data is
The rough size of deviations from average
Bias vs. variability
SD is how our spread is measured (when you think of spread think of SD)
Standard units: defining the mysterious Z
Z measures how many SDs above average
If z is negative, the value is below average
Less than about 5 typically (in the range -5, 5)
Z= value-average/sd

How to spot the SD on a bell-shaped curve: For bell-shaped


distributions, the SD is the distance between the mean and the points
of inflection on either side.

Probability

Probability is the study of the outcomes of random processes;


processes that include chance. Statistics is just the opposite: it is the
study of random processes given observations of their outcomes.
With replacement: when we pick we put it back, so if we had a population of 100 and we
picked one, we could still pick it again.
Without replacement: you dont put it back so you cant pick it again
Practice problem:
Drawing three cards from a deck of 4 with replacement?
(1/4)^3
without replacement?
(1/4)(12/51)(11/50)
- Prediction
Simple random sample: each item in the population has an equal chance of being picked,
and sampling is done without replacement

Prediction is possible if we believe that a scatter plot reflects the


underlying relation between the two variables being plotted, but does
not specify the relation completely.
Original units:
-Can be used to predict.
Equation of the regression line:
y in standard units = r * (x in standard units)
Regression line:
y = slope * x + intercept
Practice problem:
Fathers av height = 68
Sons av height = 70
Sd = 5
r = .5
1. Given a father who is 75 inches tall, how tall would you expect his son to
be?
(y - 70) = .5*(x - 68)
75-70 = .5*(x-68)
y = .5*x + 36
.5*75 + 36

Given a son who is 62 inches tall, how tall would you expect his father
to be?
(y - 68) = .5*(x - 70)
y-70 = .5*(62-68)
.5*62 + 33

Regression line: unique in the sense that, for a fixed set of data points, there is
only one line that minimizes this quantity
- Simple regression
- Regression Errors
- The fact that the regression line is the line that minimizes mean squared error (and root
mean squared error)
Histograms
-A histogram of proportions of all possible outcomes of a known
random process is called a probability histogram. All the possible
values of the quantity and the probabilities of all those values
-A histogram of proportions of actual outcomes generated by sampling
is called an empirical histogram (The histogram of observed results).
Based on running the random process repeatedly and keeping track
of the value of the quantity each time. It shows all the values of the
quantity and the proportion of times each value was observed among
all the repetitions.
*The relation between an empirical histogram and a probability histogram is
that the more times you run the experiment, the more your empirical
histogram will look like the probability histogram. (Law of averages).
Simulating random processes repeatedly is a way of approximating

probability distributions without figuring out the probabilities


mathematically

Central Limit Theorem: shows that the distribution of the average of a large random
sample will be roughly normal, no matter what the distribution of the population from
which the sample is drawn.

we should expect empirical histograms of sample averages to look roughly


bell-shaped, if the samples are large and we take a lot of them
right skew means the mean is bigger than the median
Measures of spread:
Variance: Average square deviance from the mean
Standard deviation: square root of the variance
Approximate Percentages within SDs:
Percent in Range
All Distributions
average 1 SD
at least 0%
average 2 SDs
at least 75%
average 3 SDs
at least 88.88%

Normal Distribution
about 68%
about 95%
about 99.73%

Standard units: (value - average ) / SD


-used with the variable z
Normal curve: Bell Shaped, smoothed version in Standard Units

HW1
numberofcharacterstothenumberofperiods?
geometrically,thatpointformsthehighestslopelinebetweenitselfandtheorigin(0,0),

from 0 to the first point on the highest point slope

Você também pode gostar