Você está na página 1de 10

Statistics: Statistics is the science of collecting, organizing, analyzing and interpreting data in order to make decisions.

Statistics as a subject is broken into two branches: Descriptive Statistics and Inferential Statistics. Descriptive statistics refers to statistical techniques used to summarize and describe a data set, and also to the statistics (measures) used in such summaries. Ex: Measures of central tendency, such as mean and median, and dispersion, such as range and standard deviation. Inferential statistics means the use of statistics to make inferences concerning some unknown aspect of a population from a sample of that population. Ex: Hypothesis testing, linear regression and principle components analysis. Application of statistics: Reasoning based on probability and statistics give one the ability to cope with uncertainty. It has the power to improve decision-making accuracy and test new ideas. Statistics has been useful in researcher of almost all disciplines. A few fields are: Planning, Population, Health, Family planning, Biology, Business and Commerce, Agriculture, Physical science, Socio-economic study, Environment, Medicine, Psychology, Production industry, Astronomy etc. Importance of Statistics in Banking: If any bank wants to implement any new policy or anything new then they conduct an institutional survey by state-wise, rural and urban, educated or uneducated, female and male and so on... Based on the information gathered they choose which place we can get good response for that. Again based on the probability theory banks decide that how many peoples can deposit in our bank and how long, and how much amount. Based on this, they provide loan and deposit in other institutional finance and borrowing shares. Based on statistics, only they can compare with other banking sectors and find future growth and so on.

Biostatistics: Biostatistics is the application of statistics to the analysis of biological and medical data. Applications of biostatistics: Public health, including epidemiology, health services research, nutrition, and environmental health Design and analysis of clinical trials in medicine Ecology, ecological forecasting Informatics: Informatics is a broad academic field encompassing humancomputer interaction, information science, information technology, algorithms, and social science. It is the study of application of computer and statistical techniques to the management of information. Informatics is concerned with how data is collected and stored, how it is organized, and how it is retrieved and transmitted. It can also include issues like data security, storage limitations and so forth. Some statistical terms: Data: Data is a set of observation, values and elements under investigation. In other words, data are the raw, disorganized facts and figures collected from any field of inquiry. Variable: A variable is a characteristic whose value varies from person to person, object to object or from phenomenon to phenomenon. Ex: age, income etc Qualitative variable: Variables that express a qualitative attribute. Ex: hair color, eye color, religion, favorite movie, gender, profession etc. Quantitative variables: Variables that are measured in terms of numbers. Ex: height, weight etc. Discrete variable: When a variable can assume only the isolated values within a given range, the variable is called discrete variable. Ex: family size, class size etc. Continuous variable: When a variable can theoretically assume any value within a given range the variable is said to be continuous variable. Ex: age, height, weight, time, temperature, price of a commodity etc.

Random variable: A random variable is a variable whose values are random but whose statistical distribution is known. In other word, a random variable is a real-valued function defined over a sample space. Population: An aggregate of all individuals or items of interest in any particular study defined on some common characteristics is called a population. Sample: Sample is the portion or representative part of the population chosen for study. Random sample: It is a subset of the population in which every member of the population has an equal likelihood of being selected. Parameter: A parameter is a value, usually unknown, used to represent a certain population characteristic. Example: population mean. Statistic: A statistic is something that describes a sample and is used as an estimator for a population parameter. Ex: sample mean. Estimator: An estimator is the method or formula for estimating the value of unknown parameters of the population from the sample values. Estimate: An estimate is a numerical value of the estimator obtained from a sample by using the method of estimation. Bias: Bias is a term which refers to how far the average statistic lies from the parameter it is estimating, that is, the error which arises when estimating a quantity. Precision: Precision is a measure of how close an estimator is expected to be to the true value of a parameter. Measures of central tendency: Measures of central tendency are measures of the location of the middle or the center of a distribution. The three most common measures of central tendency are the mean, the median, and the mode. Mean: The mean is the sum of all the scores divided by the number of scores. Median: Median is the number present in the middle when the numbers in a set of data are arranged in ascending or descending order. If the number of numbers in a data set is even, then the median is the mean of the two middle numbers. Mode: Mode is the value that occurs most frequently in a set of data.

Measures of dispersion or variability: Measures of variability provide information about the degree to which individual scores are clustered about or deviate from the average value in a distribution. Variance: The variance is a measure based on the deviations of individual scores from the mean. Standard deviation: The standard deviation is defined as the positive square root of the variance. Standard error: The standard error of a statistic is the standard deviation of the sampling distribution of that statistic. They reflect how much sampling fluctuation a statistic will show. Covariance: Covariance measures the strength of the linear relationship between two variables. Symmetric: Distributions that have the same shape on both sides of the center are called symmetric. Skewness: It measures the deviation of the distribution from symmetry. Kurtosis: It measures the "peakedness" of the distribution. Correlation: Correlation is a measure of the relation between two or more variables. The most widely-used type of correlation coefficient is Pearson r, which measures the strength and the direction of a linear relationship between two variables. The value of r is such that, -1 < r < +1. Rank correlation: A rank correlation coefficient is a coefficient of correlation between two random variables that is based on the ranks of the measurements and not the actual values. Spurious Correlation: It is a false presumption that two variables are correlated when in reality they are not. Spurious correlation is often a result of a third factor that is not apparent at the time of examination.

Regression: It is a statistical measure that attempts to determine the strength of the relationship between one dependent variable (usually denoted by Y) and a series of other changing variables (known as independent variables). The two basic types of regression areLinear regression and Multiple regression. Linear regression uses one independent variable to explain and/or predict the outcome of Y, while multiple regression uses two or more independent variables to predict the outcome. The general form of each type of regression is: Linear Regression: Y = + X + e Multiple Regression: Y = + 1X1 + 2X2 + 3X3 + ... + tXt + e Where, Y= the variable that we are trying to predict X= the variable that we are using to predict Y = the intercept, = the slope, e= the regression residual. Example: GPA may best be predicted as 1+.02*IQ. Thus, knowing that a student has an IQ of 130 would lead us to predict that her GPA would be 3.6 (since, 1+.02*130=3.6). Coefficient of Determination, R2: The coefficient of determination is the ratio of the explained variation to the total variation. It is used in statistical model analysis to assess how well a model explains and predicts future outcomes. One use of R2 is to test the goodness of fit of the model. The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. The coefficient of determination is such that 0 < R2 < 1, and denotes the strength of the linear association between x and y. A value of 1 indicates a perfect fit, and therefore, a very reliable model for future forecasts. A value of 0, on the other hand, would indicate that the model fails to accurately model the dataset. R2= 0.850, means that 85% of the total variation in y can be explained by the linear relationship between x and y (as described by the regression equation). The other 15% of the total variation in y remains unexplained.

Hypothesis: A hypothesis is a statement about a population parameter. Null hypothesis: The statistical hypothesis that states that there are no differences between observed and expected data. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug. We would write, H0: there is no difference between the two drugs on average. Alternative hypothesis: The hypothesis that given data do not conform with a given null hypothesis: accepted only if its probability exceeds a predetermined significance level. For example, in a clinical trial of a new drug, the alternative hypothesis might be that the new drug has a different effect, on average, compared to that of the current drug. We would write , H1: the two drugs have different effects, on average. Hypothesis testing: It is an inferential procedure that uses sample data to evaluate the credibility of a hypothesis about a population. One-tailed Test: A one-tailed test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H0 are located entirely in one tail of the probability distribution. Two-tailed Test: A two-tailed test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H0 are located in both tails of the probability distribution. Example: Suppose we wanted to test a manufacturers claim that there are, on average, 50 matches in a box. We could set up the following hypotheses H0: = 50, against H1: < 50 or H1: > 50 Another alternative hypothesis could be tested against the same null, leading this time to a two-tailed test: H0: = 50, against H1: not equal to 50

Test Statistic: A test statistic is a quantity calculated from our sample of data. Its value is used to decide whether or not the null hypothesis should be rejected in our hypothesis test. Critical value: It is the value that a test statistic must exceed in order for the null hypothesis to be rejected. Critical Region: The critical region CR, or rejection region RR, is a set of values of the test statistic for which the null hypothesis is rejected. If the observed value of the test statistic is a member of the critical region, we conclude "Reject H0"; if it is not a member of the critical region then we conclude "Do not reject H0" Significance Level: The significance level of a statistical hypothesis test is a fixed probability of wrongly rejecting the null hypothesis H0, if it is in fact true. It is the probability of a type I error and is set by the investigator in relation to the consequences of such an error. Significance Level = P (type I error) = Confidence interval (CI): Confidence Interval is an interval or range computed from the sample data that has a known probability of containing the population unknown parameter and the limits defining the interval are called the confidence limits. Type I Error: In a hypothesis test, a type I error occurs when the null hypothesis is rejected when it is in fact true; that is, H0 is wrongly rejected. A type I error is often considered to be more serious, and therefore more important to avoid, than a type II error. The probability of a type I error can be precisely computed as, P (type I error) = significance level = Type II Error: In a hypothesis test, a type II error occurs when the null hypothesis H0, is not rejected when it is in fact false. P (type II error) = P-value: A p-value is a measure of how much evidence we have against the null hypothesis. In simple form, p-value of a test is the smallest value of for which null hypothesis can be rejected.

Sampling: It is the process of choosing a representative sample from a target population and collecting data from that sample in order to understand something about the population as a whole. Sampling frame: A sampling frame is a list of units or group of units of the population to be sampled, organized and arranged in such a manner that every unit occurs once and only once in the list and no unit is excluded from the list. Simple random sampling: A sampling technique in which each element in the population has a known and equal probability of selection is known as simple random sampling (SRS). Example: A person researching education levels within a company takes the full employee list and applies a random number algorithm to this in order to select people to interview. Systematic sampling: It is a method of selecting sample members from a larger population according to a random starting point and a fixed periodic interval. Example: we may choose a random start page and take every 45th name in the directory until we have the desired sample size. Stratified sampling: It is a sampling plan in which the population is divided into several nonoverlapping strata and selects a random sample from each stratum in such a way that units within the strata are homogeneous but between strata they are heterogeneous. Example: Suppose a farmer wishes to work out the average milk yield of each cow type in his herd which consists of Ayrshire, Friesian, Galloway and Jersey cows. He could divide up his herd into 4 sub-groups and take samples from these. Cluster sampling: It is a method of sampling which consists of first selecting, at random, groups called clusters of elements from the population, and then choosing all of the elements within each cluster to make up the whole sample. Example: In a study of the opinions of homeless across a country, rather than study a few homeless people in all towns, a number of towns are selected and a significant number of homeless people are interviewed in each one.

Frequency Distribution: It is the organization of raw data in table form with classes and frequencies. Probability Distribution: A listing of all the values the random variable can assume with their corresponding probabilities make a probability distribution. Sampling distribution: A probability distribution of a statistic obtained through a large number of samples drawn from a specific population is known as sampling distribution. In other words, the sampling distribution is the distribution of a statistic across an infinite number of samples. Normal distribution: Normal distribution is a theoretical frequency distribution for a set of variable data, usually represented by a bell-shaped curve, symmetrical about the mean and has the most probable scores concentrated around the mean. t-distribution: Student's t-distribution (or simply the t-distribution) is a continuous probability distribution that arises when estimating the mean of a normally distributed population in situations where the sample size is small. The t-test is a data analysis procedure that assesses whether the means of two groups are statistically different from each other. F-distribution: The F distribution is the distribution of the ratio of two estimates of variance. The F-distribution is formed by the ratio of two independent chi-square variables divided by their respective degrees of freedom. The F-test is designed to test if two population variances are equal. It does this by comparing the ratio of two variances. Chi-square distribution: It is the distribution of the sum of the squares of a set of variables, each of which has a normal distribution and is expressed in standardized units. The chi-square test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories. Degrees of freedom is the number of independent variables entering into a statistical measure or frequency distribution.

Probability: Probability is the likelihood or chance of an event occurring. Probability=


. .

For example, the probability of flipping a coin and it being heads is Independent and mutually exclusive event: Two events are independent if the events have no influence on each other. Two events are mutually exclusive (or disjoint) if it is impossible for them to occur together. Central Limit Theorem: The theorem states that the sampling distribution of any statistic will be normal or nearly normal, if the sample size is large enough. Factor analysis: It is a statistical technique that uses correlations between variables to determine the underlying dimensions represented by the variables. ANOVA: Analysis of variance (ANOVA) is a collection of statistical models and their associated procedures which compare means by splitting the overall observed variance into different parts. Odds ratio: The ratio of the probability of occurrence of an event to the probability of the event not occurring. The odds ratio is a measure of effect size, describing the strength of association or non-independence between two binary data values. Dummy variable: Dummy variable refers to the technique of using a dichotomous variable (coded 0 or 1) to represent the separate categories of a nominal level measure. In regression analysis we sometimes need to modify the form of non-numeric variables, for example sex, or marital status, to allow their effects to be included in the regression model. This can be done through the creation of dummy variables whose role is to identify each level of the original variables separately. Index number: Index number is a number indicating change in magnitude, as of price, wage, employment, or production shifts, relative to the magnitude at a specified point usually taken as 100. Index numbers are used especially to compare business activity, the cost of living, and employment; For example, if a commodity cost twice as much in 1970 as it did in 1960, its index number would be 200 relative to 1960. Contingency Table: It is the method of displaying a frequency distribution as a table with rows and columns to show how two variables may be dependent (contingent) upon each other.

Você também pode gostar