Você está na página 1de 34

Facilitator Guide SSC/ Q2101 Associate Analytics

ASSOCIATE ANALYTICS
Appendix : Additional Practice
Questions and Solutions

This Student Handbook for the


Associate Analytics program contains
detailed study notes and practice
sessions for the Associate Analytics
program.

Page 1 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Page 2 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Table of Contents Appendix to Associate


Analytics: Additional practice questions and
Solutions to Additional Practice Questions

CORE CONTENT
Basic R Programming (Module 1 Unit 1)
Summarizing Data (Module 1 Unit 1)
Probability Theories (Module 1 Unit 1)
Big Data Analytics (Module 2 Unit 2)
Linear Regression (Module 3 Unit 1)
Logistic Regression (Module 3 Unit 2)
Time Series Modelling (Module 3 Unit 3)

4
11
14
18
19
23
25

Page 3 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Basic R Programming:Solve the below questions:


1.
2.
3.
4.
5.
6.
7.

Use R Studio to calculate the summation of 102473 and 239904.


Now, import the csv file Class.txt using R.
Export the Data Frame Iris to CSV files Iris.txt.
Create a new variable Ratio_Sepal_Petal with ratio of Sepal Length and Petal Length.
Check for Outliers in the Iris data frame.
Create a new data set by combining 2 data sets.
Check for missing data in Iris_new and remove missing data.

The R Studio loos like this:

Summation of 2 nos. - 102473 and 239904

Page 4 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Syntax for summation. Similarly we can


multiply and divide also.
`The Solution or the output

Importing Class.txt file in R Studio

The output can be seen in the environment window of R Studio.

Export Data from R Studio


We use write.csv syntax for exporting data as csv file. The iris data set is present in R as by default.
But before exporting we need to set our working directory by using setwd() syntax.
Setting working directory

Page 5 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

We can check the current working directory by using getwd() command. So here we have set
F:\NASSCOM as the working directory.

Now to export data set IRIS to the working directory we use write.csv() command.

To check whether we created the data set or not we will see the working directory.

And the file looks like this

Page 6 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Creating a new variable Ratio_Sepal_Petal with ratio of Sepal Length and Petal Length

New variable is created


Checking for Outliers in the Iris data frame

Page 7 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Using summary command we can see the basic descriptive overview of a data. We also use boxplot to see
any outlier in data.

Here in Sepal.Width we see there are 3 outliers but they are not very significant.
Creating a new data set by combining 2 data sets Iris and Iris_miss
We use rbind() and cbind() to add append data sets.

Page 8 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

We have added 3 more entries to iris data set.


We use merge() command also to combining data sets.
Practice the above example using merge command.
Checking for missing data in Iris_new and removing missing data
We use is.na command to see if there is any missing data in data set.

Page 9 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

TRUE means there is a missing data while FALSE means there is no missing data.
Another more efficient way to find missing value is given below:

To remove missing data we use na.omit() command

There are no missing values now.

Page 10 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Summarizing Data

A researcher wants to understand the data collected by him about 3 species of flowers.
He wants the following:1.
2.
3.
4.
5.
6.

The summary of 150 flower data including Sepal Length, Sepal Width, Petal Length and
Petal Width. He also wants the summary of Sepal Length vs petal length.
He wants to understand the mean Petal Length of each species.
He wants to segregate the data of flowers having Sepal length greater than 7.
He wants to segregate the data of flowers having Sepal length greater than 7 and Sepal
width greater than 3 simultaneously.
He wants to view 1st 7 rows of data .
He wants to view 1st 3 rows and 1st 3 columns of data.

To summarize data in R Studio we use majorly two functions Summary and Aggregate.
Using Summary command:

We get Min, Max, 1st Quartile, 3rd Quartile, Median, Mean as an output of summary() command.
For getting detailed output of one or more functions we use aggregate() command.
Using Aggregate() command:

Page 11 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

In the above example, we have calculated the mean sepal length of different species. Similarly we can
calculate other functions also like frequency, median, summation etc.
For more details in terms of argument of Aggregate() command we use ?aggregate command to get help.
We also use subset() function to form subsets of data.
Using subset() command:

When we have to use more than 1 condition then we use & as shown below

For getting only few columns of requirement we use select() command in the argument:

Page 12 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

For subsetting data without ant condition just based on rows and columns we use square brackets [].

Page 13 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Probability Theories:
1.

2.
3.
4.
5.

If you throw a dice 20 times then what is the probability that you get following
results:
a. 3 sixes
b. 6 sixes
c. 1,2 and 3 sixes
In Iris data set check whether Sepal Length is normally distributed or not.
Prove that population mean of Sepal length is different from mean of 1 st 10 data
significantly.
Do ANOVA test of 3 different data sets which are subset of Sepal Length.
Create a random walk plot of 100 inputs starting at t=10secs.

To find probability density of the dice getting 3 sixes in 20 trials is given by dbinom() function as shown:

To find probability density of the dice getting 6 sixes in 20 trials is given below

So we finally see that probability of getting 6 sixes in 20 trials is less than probability of getting 3 sixes in
same number of trials.

Now to find the probability distribution of getting 3 sixes in 20 trials is found using pbinom() command.

To find if the Sepal Length is normally distributed or not we use 2 commands- qqnorm() & qqline().

Page 14 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

The qqnorm() shows the actual distribution of data while qqline() shows the line on which data would lie
if the data is normally distributed. The deviation of plot from line shows that data is not normally
distributed.

Like in below example sepal length is almost normally distributed.

T-Test of sample subset of Iris data set.

Page 15 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Here p-value is much less than 0.05. So we reject the null hypothesis and we accept the alternate
hypothesis which says that mean of sample is less than the population mean.
s <p
Also sample mean is 4.86 and degree of freedom if 9 which is sample size -1.
Similarly we can do two sided test by writing alternative= two sided. And also paired sample t-test by
using paired=TRUE as the part of argument.
ANOVA of Iris data with its own subset.

Page 16 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

From ANOVA of the above data we see that degree of freedom of independent variable is 2 and F value
is 2.447.
To create a random walk for 100 trials starting at t=10.

> plot(y,type='l')

Page 17 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Big Data Analytics:


Integrate HBase and HIVE with R Tool to further use Hadoop with R.

In this section we will deal with connecting HIVE and HBase with R Tool.
For connecting HIVE with R we use RHive package which can be installed by 3 ways:

For integrating HBase with R we intall RHBase package. We can also use Revolution R open source to
work with Hadoop system.
Below is a list of software used for this setup.

OS and other tools:


o Mac OS X 10.6.8, Java 1.6.0_65, Homebrew, thrift 0.9.0
Hadoop and HBase:
o Hadoop 1.1.2, HBase 0.94.17
R and RHadoop packages:
o R 3.1.0, rhdfs 1.0.8, rmr2 3.1.0, plyrmr 0.2.0, rhbase 1.2.0

Page 18 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Linear Regression:
1.

2.
3.
4.

Generate a simple linear regression equation in two variables of cats dataset. The
two variables are Heart Weight and Body Weight of the cats being examined in
the research.
Also find out if there is any relation between Heart Weight and Body Weight.
Now check if Heart weight is affected by any other factor or variable.
Find out how Heart Weight is affected by Body Weight and Sex together using
Multiple Regression.

Below is a very simple example of Simple Regression Equation:

The coefficient of correlation is 0.804 which shows that there is a very good relation between Hwt and
Bwt.
Page 19 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Page 20 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

The equation formed will be:-

Hwt = 4.03 Bwt-0.357.

QUICK TIPS:-

So we can say that 65% variation in Heart Weight can be explained by the model.
Page 21 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

P Value is less than 0.05 which means we reject null hypothesis. Degree of Freedom is 142.
For other examples use the link: http://www.ats.ucla.edu/stat/r/dae/rreg.htm
Also refer to the book: - Practical Regression and Anova using R
Now to create a linear model of effect of Body Weight and Sex on Heart Weight we use multiple
regression modeling.

So we can say that 65% variation in Heart Weight can be explained by the model.
The equation becomes y=4.07x-0.08y-0.41

Page 22 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Logistic Regression:
Import a data from web storage. Name the dataset. And now do Logistic Regression to
find out relation between variables that are affecting the admission of a student in a
college based on his or her GRE score , GPA obtained and Rank of the student. Also
check if your model is fit or not.

Page 23 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

In the output above, the first thing we see is the call, this is R reminding us what the model we
ran was, what options we specified, etc.
Next we see the deviance residuals, which are a measure of model fit. This part of output shows
the distribution of the deviance residuals for individual cases used in the model. Below we
discuss how to use summaries of the deviance statistic to assess model fit.
The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes
called a Wald z-statistic), and the associated p-values. Both gre and gpa are statistically
significant, as are the three terms for rank. The logistic regression coefficients give the change in
the log odds of the outcome for a one unit increase in the predictor variable.
o For every one unit change in gre, the log odds of admission (versus non-admission)
increases by 0.002.
o For a one unit increase in gpa, the log odds of being admitted to graduate school
increases by 0.804.
o The indicator variables for rank have a slightly different interpretation. For example,
having attended an undergraduate institution with rank of 2, versus an institution with
a rank of 1, changes the log odds of admission by -0.675.

.Refer to http://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/ for more examples on


Logistic Regression.
Quick Tip:-

Refer to http://ww2.coastal.edu/kingw/statistics/R-tutorials/logistic.html for more advanced examples


of Logistic Regression using R.
Page 24 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Time Series Analysis:


6.
7.

Install the Time Series Package


Use the monthly milk prod data set for time series analysis and predict the
production for next 5 years.

Page 25 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

There is no seasonal component to be removed. So we move ahead with ARIMA modeling. And finally
we check the fitness of model.

Page 26 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

In time series analysis our basic motive is to remove error terms and get a more accurate forecast by using
time varying data.
We get the forecast by using intercept and Std error value. We use tseries package for time series analysis.
For more help we can use ?arima to get better understanding.

Page 27 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Plot of y versus t.

Page 28 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Plot of d.y

The Auto Correlation function (ACF) shows a seasonal component in it. So we will first see the Partial
Auto Correlation Function plot.

Page 29 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

The PACF plot shows that there are more than 1 lags in the data which are significant. So now we will
check the ACF and PACF of the difference variable d.y

Page 30 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Page 31 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Now we will convert the data frame into time series dataset. As this helps us in Time series analysis and
forecasting.

This is the other way of representation of time span distribution. It represents which part of the year we
are talking about and this is how R tool also stores date of Time Series data.

Page 32 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

Now we already have made different models of MA, ARMA and ARIMA.
All these models are for our verification for checking of fitness.

Now we will plot the forecast for next 5 years.


Page 33 of 34

Facilitator Guide SSC/ Q2101 Associate Analytics

So according to the forecast the production of milk is in dotted lines which shows that Milk production
has reached its optimum level and will continue to be same for next 5 years.

Page 34 of 34