dsfdfsdf

© All Rights Reserved

1 visualizações

dsfdfsdf

© All Rights Reserved

- Introduction to Econometrics- Stock & Watson -Ch 7 Slides.doc
- 1000661_634403488742805000
- SERIES DE TIEMPO
- state space models
- 151. M-Banking in Kenyaagri
- Odds Ratio
- Discussion week 10.docx
- Correlation and Regression Analyses
- Investigating the Effect of Construction Management Strategies on Project Greenhouse Gas Emissions Using Interactive Simulation
- Work-family Conflict Articol
- STAT659: Chapter 8
- A COMPARATIVE STUDY OF DIFFERENT DIMENSIONS OF TEACHER’S PARTICIPATION IN SCHOOL ADMINISTRATION AMONG MALE AND FEMALE TEACHERS OF SECONDARY SCHOOLS.
- ch07
- 8.67-72
- 05a4
- Uji Beda Rata2 (Analisis Bivariat)
- Quick Guide to Six Sigma Statistics
- Comparing two tests for two rates.pdf
- Session 6
- Geospatial Estimation of Deepwater Horizon Oil-spill-2016

Você está na página 1de 34

ASSOCIATE ANALYTICS

Appendix : Additional Practice

Questions and Solutions

Associate Analytics program contains

detailed study notes and practice

sessions for the Associate Analytics

program.

Page 1 of 34

Page 2 of 34

Analytics: Additional practice questions and

Solutions to Additional Practice Questions

CORE CONTENT

Basic R Programming (Module 1 Unit 1)

Summarizing Data (Module 1 Unit 1)

Probability Theories (Module 1 Unit 1)

Big Data Analytics (Module 2 Unit 2)

Linear Regression (Module 3 Unit 1)

Logistic Regression (Module 3 Unit 2)

Time Series Modelling (Module 3 Unit 3)

4

11

14

18

19

23

25

Page 3 of 34

1.

2.

3.

4.

5.

6.

7.

Now, import the csv file Class.txt using R.

Export the Data Frame Iris to CSV files Iris.txt.

Create a new variable Ratio_Sepal_Petal with ratio of Sepal Length and Petal Length.

Check for Outliers in the Iris data frame.

Create a new data set by combining 2 data sets.

Check for missing data in Iris_new and remove missing data.

Page 4 of 34

multiply and divide also.

`The Solution or the output

We use write.csv syntax for exporting data as csv file. The iris data set is present in R as by default.

But before exporting we need to set our working directory by using setwd() syntax.

Setting working directory

Page 5 of 34

We can check the current working directory by using getwd() command. So here we have set

F:\NASSCOM as the working directory.

Now to export data set IRIS to the working directory we use write.csv() command.

To check whether we created the data set or not we will see the working directory.

Page 6 of 34

Creating a new variable Ratio_Sepal_Petal with ratio of Sepal Length and Petal Length

Checking for Outliers in the Iris data frame

Page 7 of 34

Using summary command we can see the basic descriptive overview of a data. We also use boxplot to see

any outlier in data.

Here in Sepal.Width we see there are 3 outliers but they are not very significant.

Creating a new data set by combining 2 data sets Iris and Iris_miss

We use rbind() and cbind() to add append data sets.

Page 8 of 34

We use merge() command also to combining data sets.

Practice the above example using merge command.

Checking for missing data in Iris_new and removing missing data

We use is.na command to see if there is any missing data in data set.

Page 9 of 34

TRUE means there is a missing data while FALSE means there is no missing data.

Another more efficient way to find missing value is given below:

Page 10 of 34

Summarizing Data

A researcher wants to understand the data collected by him about 3 species of flowers.

He wants the following:1.

2.

3.

4.

5.

6.

The summary of 150 flower data including Sepal Length, Sepal Width, Petal Length and

Petal Width. He also wants the summary of Sepal Length vs petal length.

He wants to understand the mean Petal Length of each species.

He wants to segregate the data of flowers having Sepal length greater than 7.

He wants to segregate the data of flowers having Sepal length greater than 7 and Sepal

width greater than 3 simultaneously.

He wants to view 1st 7 rows of data .

He wants to view 1st 3 rows and 1st 3 columns of data.

To summarize data in R Studio we use majorly two functions Summary and Aggregate.

Using Summary command:

We get Min, Max, 1st Quartile, 3rd Quartile, Median, Mean as an output of summary() command.

For getting detailed output of one or more functions we use aggregate() command.

Using Aggregate() command:

Page 11 of 34

In the above example, we have calculated the mean sepal length of different species. Similarly we can

calculate other functions also like frequency, median, summation etc.

For more details in terms of argument of Aggregate() command we use ?aggregate command to get help.

We also use subset() function to form subsets of data.

Using subset() command:

When we have to use more than 1 condition then we use & as shown below

For getting only few columns of requirement we use select() command in the argument:

Page 12 of 34

For subsetting data without ant condition just based on rows and columns we use square brackets [].

Page 13 of 34

Probability Theories:

1.

2.

3.

4.

5.

If you throw a dice 20 times then what is the probability that you get following

results:

a. 3 sixes

b. 6 sixes

c. 1,2 and 3 sixes

In Iris data set check whether Sepal Length is normally distributed or not.

Prove that population mean of Sepal length is different from mean of 1 st 10 data

significantly.

Do ANOVA test of 3 different data sets which are subset of Sepal Length.

Create a random walk plot of 100 inputs starting at t=10secs.

To find probability density of the dice getting 3 sixes in 20 trials is given by dbinom() function as shown:

To find probability density of the dice getting 6 sixes in 20 trials is given below

So we finally see that probability of getting 6 sixes in 20 trials is less than probability of getting 3 sixes in

same number of trials.

Now to find the probability distribution of getting 3 sixes in 20 trials is found using pbinom() command.

To find if the Sepal Length is normally distributed or not we use 2 commands- qqnorm() & qqline().

Page 14 of 34

The qqnorm() shows the actual distribution of data while qqline() shows the line on which data would lie

if the data is normally distributed. The deviation of plot from line shows that data is not normally

distributed.

Page 15 of 34

Here p-value is much less than 0.05. So we reject the null hypothesis and we accept the alternate

hypothesis which says that mean of sample is less than the population mean.

s <p

Also sample mean is 4.86 and degree of freedom if 9 which is sample size -1.

Similarly we can do two sided test by writing alternative= two sided. And also paired sample t-test by

using paired=TRUE as the part of argument.

ANOVA of Iris data with its own subset.

Page 16 of 34

From ANOVA of the above data we see that degree of freedom of independent variable is 2 and F value

is 2.447.

To create a random walk for 100 trials starting at t=10.

> plot(y,type='l')

Page 17 of 34

Integrate HBase and HIVE with R Tool to further use Hadoop with R.

In this section we will deal with connecting HIVE and HBase with R Tool.

For connecting HIVE with R we use RHive package which can be installed by 3 ways:

For integrating HBase with R we intall RHBase package. We can also use Revolution R open source to

work with Hadoop system.

Below is a list of software used for this setup.

o Mac OS X 10.6.8, Java 1.6.0_65, Homebrew, thrift 0.9.0

Hadoop and HBase:

o Hadoop 1.1.2, HBase 0.94.17

R and RHadoop packages:

o R 3.1.0, rhdfs 1.0.8, rmr2 3.1.0, plyrmr 0.2.0, rhbase 1.2.0

Page 18 of 34

Linear Regression:

1.

2.

3.

4.

Generate a simple linear regression equation in two variables of cats dataset. The

two variables are Heart Weight and Body Weight of the cats being examined in

the research.

Also find out if there is any relation between Heart Weight and Body Weight.

Now check if Heart weight is affected by any other factor or variable.

Find out how Heart Weight is affected by Body Weight and Sex together using

Multiple Regression.

The coefficient of correlation is 0.804 which shows that there is a very good relation between Hwt and

Bwt.

Page 19 of 34

Page 20 of 34

QUICK TIPS:-

So we can say that 65% variation in Heart Weight can be explained by the model.

Page 21 of 34

P Value is less than 0.05 which means we reject null hypothesis. Degree of Freedom is 142.

For other examples use the link: http://www.ats.ucla.edu/stat/r/dae/rreg.htm

Also refer to the book: - Practical Regression and Anova using R

Now to create a linear model of effect of Body Weight and Sex on Heart Weight we use multiple

regression modeling.

So we can say that 65% variation in Heart Weight can be explained by the model.

The equation becomes y=4.07x-0.08y-0.41

Page 22 of 34

Logistic Regression:

Import a data from web storage. Name the dataset. And now do Logistic Regression to

find out relation between variables that are affecting the admission of a student in a

college based on his or her GRE score , GPA obtained and Rank of the student. Also

check if your model is fit or not.

Page 23 of 34

In the output above, the first thing we see is the call, this is R reminding us what the model we

ran was, what options we specified, etc.

Next we see the deviance residuals, which are a measure of model fit. This part of output shows

the distribution of the deviance residuals for individual cases used in the model. Below we

discuss how to use summaries of the deviance statistic to assess model fit.

The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes

called a Wald z-statistic), and the associated p-values. Both gre and gpa are statistically

significant, as are the three terms for rank. The logistic regression coefficients give the change in

the log odds of the outcome for a one unit increase in the predictor variable.

o For every one unit change in gre, the log odds of admission (versus non-admission)

increases by 0.002.

o For a one unit increase in gpa, the log odds of being admitted to graduate school

increases by 0.804.

o The indicator variables for rank have a slightly different interpretation. For example,

having attended an undergraduate institution with rank of 2, versus an institution with

a rank of 1, changes the log odds of admission by -0.675.

Logistic Regression.

Quick Tip:-

of Logistic Regression using R.

Page 24 of 34

6.

7.

Use the monthly milk prod data set for time series analysis and predict the

production for next 5 years.

Page 25 of 34

There is no seasonal component to be removed. So we move ahead with ARIMA modeling. And finally

we check the fitness of model.

Page 26 of 34

In time series analysis our basic motive is to remove error terms and get a more accurate forecast by using

time varying data.

We get the forecast by using intercept and Std error value. We use tseries package for time series analysis.

For more help we can use ?arima to get better understanding.

Page 27 of 34

Plot of y versus t.

Page 28 of 34

Plot of d.y

The Auto Correlation function (ACF) shows a seasonal component in it. So we will first see the Partial

Auto Correlation Function plot.

Page 29 of 34

The PACF plot shows that there are more than 1 lags in the data which are significant. So now we will

check the ACF and PACF of the difference variable d.y

Page 30 of 34

Page 31 of 34

Now we will convert the data frame into time series dataset. As this helps us in Time series analysis and

forecasting.

This is the other way of representation of time span distribution. It represents which part of the year we

are talking about and this is how R tool also stores date of Time Series data.

Page 32 of 34

Now we already have made different models of MA, ARMA and ARIMA.

All these models are for our verification for checking of fitness.

Page 33 of 34

So according to the forecast the production of milk is in dotted lines which shows that Milk production

has reached its optimum level and will continue to be same for next 5 years.

Page 34 of 34

- Introduction to Econometrics- Stock & Watson -Ch 7 Slides.docEnviado porAntonio Alvino
- 1000661_634403488742805000Enviado porkush85
- SERIES DE TIEMPOEnviado porAlvaro Mamani Machaca
- state space modelsEnviado porErnest Dtv
- 151. M-Banking in KenyaagriEnviado pormanojmis2010
- Odds RatioEnviado porkucinolen
- Discussion week 10.docxEnviado porPandurang Thatkar
- Correlation and Regression AnalysesEnviado pormedo91
- Investigating the Effect of Construction Management Strategies on Project Greenhouse Gas Emissions Using Interactive SimulationEnviado porhusktech
- Work-family Conflict ArticolEnviado porGiulia Atanasiu
- STAT659: Chapter 8Enviado porsimplemts
- A COMPARATIVE STUDY OF DIFFERENT DIMENSIONS OF TEACHER’S PARTICIPATION IN SCHOOL ADMINISTRATION AMONG MALE AND FEMALE TEACHERS OF SECONDARY SCHOOLS.Enviado porIJAR Journal
- ch07Enviado porAmany Salama
- 8.67-72Enviado porPay Sohilauw
- 05a4Enviado porPETER
- Uji Beda Rata2 (Analisis Bivariat)Enviado porMuhammad Nur DelaphanEnam
- Quick Guide to Six Sigma StatisticsEnviado porEnrico Gambini
- Comparing two tests for two rates.pdfEnviado porIsmael Neu
- Session 6Enviado porJoseph TheThird
- Geospatial Estimation of Deepwater Horizon Oil-spill-2016Enviado porparagjdutta
- Muhammad Husaini (Cds 501)Enviado porMuhdHusaini
- IFA 2019 Syllabus.pdfEnviado porasonline
- pearceboyce2006distribution.pdfEnviado pormuhammad riyanto
- Attitudes Towards Biology Jae BsEnviado pormohammed issaka
- 0063-03-0006-0004-3Enviado porSalma Chakroun
- SigmaXL Version 6.2 Workbook.pdfEnviado porGade Jy
- 243saaEnviado porkala
- The Impact of Interactive Physics Animate Media to Concept Understanding of High School StudentsEnviado porIJAERS JOURNAL
- Independent Sample t TestEnviado porApriani Simbi
- DepressionEnviado porPermata Julice Putri

- Serv LetsEnviado porSiddarthModi
- NotesEnviado porSiddarthModi
- CVivaEnviado porSiddarthModi
- Guidelines for Submitting the AbstractEnviado porSiddarthModi
- AptitudeasdEnviado porSiddarthModi
- MEFAEnviado porSiddarthModi
- SEVivaEnviado porSiddarthModi
- OSEnviado porSiddarthModi
- sadfEnviado porSiddarthModi
- xhEnviado porSiddarthModi
- Me FaEnviado porsurendra332
- Information Retrieval SystemsEnviado porSiddarthModi
- r adsEnviado porSiddarthModi
- xhasdEnviado porSiddarthModi
- IRS_Unit-3Enviado porSiddarthModi
- IRS_Unit-2Enviado porSiddarthModi
- IRS_Unit-1Enviado porSiddarthModi
- IRS_Unit_7-8Enviado porSiddarthModi
- Soft Text SearchEnviado porSiddarthModi
- WTEnviado porSiddarthModi
- stm-impEnviado porSiddarthModi
- saimathadocEnviado porSiddarthModi
- ooadEnviado porSiddarthModi
- HTMLEnviado porSiddarthModi
- Cse New CourseEnviado porSiddarthModi
- Associate Analytics M1 SHEnviado porSiddarthModi
- 2006-34Enviado porSiddarthModi
- B.Tech_3-2_R13-Timetable.pdfEnviado porAthul

- Blue Print of HonEnviado porPrincewill Henry Jedidiah Nenziu
- rohan basavarajus resumeEnviado porapi-395265125
- 2013-11-25 S Nechvatal CSM_2014_handoutsEnviado porJ Roberto Meza Ontiveros
- subject verb agreement finalEnviado porapi-214992685
- Ankit SummaryEnviado porapi-3748268
- Multiple Choice Quiz 5Enviado porNathaniel Adjei
- 9709_s15_qp_11Enviado pordev
- XEnviado porAngela Bau
- Demystifying QlikView Clustering Qonnections FinalEnviado porpoderarcano
- 9789241506823_Intro_self-study_engEnviado porJin Siclon
- Managing Emotions at Work Place -FinalEnviado porRJ Rohit Jindal
- A Descendent of Gilbert’s Behavior Engineering ModelEnviado porDumitrascu
- weebly resume kEnviado porapi-417725423
- p6 Eppm Post Install Admin GuideEnviado porLore Cuadra
- Analysis of Shaolin Chin NaEnviado porkhirakis
- 127thmcle.REGFORM - 2019Enviado porRaysunArellano
- Mary GirgisEnviado porMary Girgis
- Timetable ADMEEnviado porstefano
- Khalid Awan Electrical EngineerEnviado porKhalid Awan
- Reframing Cultural Diplomacy International Cultural Politics of Soft Power and the Creative Economy Hyungseok KangEnviado porFernando Uribe
- Data Science to Do from quora postsEnviado porNilpa Jha
- PT-2016 V4Enviado porYahye Abdillahi
- tmp116.tmpEnviado porFrontiers
- Top 100 General Knowledge Questions for CMAT 2017Enviado porvineetuk
- Managerial CommunicationEnviado porPrabha Karan
- aspectsEnviado porSherrey Walter
- G.an TU chon 10 CB.docEnviado porpham thi thu Thanh
- AAUP-Colorado Amicus Brief-Churchill 2011Enviado pordsaitta
- AD SyllabusEnviado porpriya_psalms
- Greenville Her magazine October 2018 GHER-091518Enviado porAdams Publishing Group Eastern North Carolina