Você está na página 1de 42

SAS ROUTINES

Contents
1

Label variable....................................................................................................... 4

Report.................................................................................................................. 4

Define................................................................................................................... 4

Reading from text file........................................................................................... 4

Reading from csv file............................................................................................ 5

Capturing output in file........................................................................................ 5


6.1.1

Creating an output data set.....................................................................5

6.1.2

How to identify output objects.................................................................5

6.1.3

Using object label to create an output data set.......................................5

6.1.4

Turn the listing output of........................................................................5

6.1.5

Output to an HTML file.............................................................................5

95% confidence interval of mean.........................................................................9

Column input........................................................................................................ 9

formatted input.................................................................................................... 9

10

Formatting in proc report-...............................................................................10

11

Proc report...................................................................................................... 10

12

Across variable to group variable horizontally................................................10

13

Computed variable.......................................................................................... 11

14

Summary report.............................................................................................. 11

15

Viewing contents of dataset............................................................................11

16

printing portion of dataset.............................................................................. 11

17

Print by............................................................................................................ 11

18

create new library........................................................................................... 12

19

creating and adding data to dataset...............................................................12

20

importing csv file............................................................................................ 12

21

importing excel file......................................................................................... 13

22

IMPORTING TEXT FILE...................................................................................... 13

23

COPYING DATASET........................................................................................... 14

24

ADDING NEW VARIABLES AND CREATING DATASET........................................14

25

DROP AND KEEP VARIABLES IN NEW DATASET................................................14

26

PRINTING NO OBSERVATION NUMBER.............................................................15

27

SUBSTRING..................................................................................................... 15

28

OTHER STRING FUNCTIONS............................................................................. 15

29

DATE FUNCTION.............................................................................................. 15

30

PRINTING VARIABLES IN DATASET...................................................................15

31

SORT............................................................................................................... 16

32

REMOVING DUPLICATE.................................................................................... 16

33

REMOVE DUPLICATE BASED ON KEY...............................................................16

34

MOVE DUPLICATES INTO NEW DATASET..........................................................17

35

PLOT................................................................................................................ 17

36

Median in proc sql........................................................................................... 18

37

PROC SQL........................................................................................................ 18

38

Proc sql case................................................................................................... 19

39

MERGING TWO DATASETS............................................................................... 19

40

SAMPLING....................................................................................................... 20

41

PRINTING VERTICAL HEADING.........................................................................21

42

MEAN CALCULATION....................................................................................... 21

43

Moving means data into output file................................................................23

44

Merging two data together.............................................................................. 24

45

QUANTILES...................................................................................................... 25

46

TO REMOVE NA VALUES TO NUMERICAL.........................................................26

47

CREATE FREQUENCY TABLE............................................................................. 26

48

create two variable categorical frequency table.............................................26

49

Weight statement............................................................................................ 27

50

order............................................................................................................... 27

51

three variable frequency................................................................................. 28

52

Correlation...................................................................................................... 28

53

Regression...................................................................................................... 28

54

logistic regression........................................................................................... 28

55

test stationarity............................................................................................... 28

56

create a diferentiated time series..................................................................29

57

create ACF and PACF Plots.............................................................................. 29

58

to calculate the ESACF and SCAN function values..........................................29

59

forecast using ARIMA...................................................................................... 29

60

advanced mean concepts...............................................................................30

60.1 Basic mean.................................................................................................. 30

60.2 Selecting Analysis Variables, Analyses to be Performed by PROC MEANS ,


and Rounding of Results........................................................................................ 31
60.3 Selecting Other Analyses.............................................................................31
60.4 Step 4: Analysis with CLASS (variables).......................................................33
60.5 Step 5: Dont Miss the Missings!..................................................................34
60.6 Survey means - How to Estimate a Ratio of Means using SAS.....................34
61

Mean and ratios.............................................................................................. 37

62

SAS QUESTIONS.............................................................................................. 38

1 Label variable
Label varname = label name

2 Report
Proc report data = dataset name;
Column age, weight prints age and wieight in columns

3 Define
Assign formats to variables
Specify column headings and width
Proc report data = dataset <options>;
Define variable/ <usage><attributes><options><justification><Columnheading>;
Run;

4 Reading from text file


data Sample2;
infile 'c:\books\statistics by example\delim.txt';
length Gender $ 1;
input ID Age Gender $;
run;

The LENGTH statement tells SAS that the variable Gender is character (the
dollar sign indicates this) and that you want to store Gender in 1 byte (the 1
indicates this). The INPUT statement lists the
variable names in the same order as the values in the text file. Because you
already told
SAS that Gender is a character variable, the dollar sign following the name
Gender on the
INPUT statement is not necessary. If you had not included a LENGTH statement,
the

dollar sign following Gender on the INPUT statement would have been
necessary. SAS
assumes variables are numeric unless you tell it otherwise.

5 Reading from csv file


data Sample2;
infile 'c:\books\statistics by example\comma.csv' dsd;
length Gender $ 1;
input ID Age Gender $;
run;

The DSD option specifies that two consecutive commas represent a


missing value and that the default delimiter is a comma. Here is the modified
program

6 Capturing output in file


Ods listing;
Ods csvall file = d:\ramesh\output\secind.csv;
Proc means data = mydata.loan_all mean;
Var default;
Class Tenure;
Run;
ods csvall close;

6.1.1 Creating an output data set


6.1.2 How to identify output objects
6.1.3 Using object label to create an output data set
6.1.4 Turn the listing output of
6.1.5 Output to an HTML file
SAS introduced the Output Delivery System (ODS) with Version 7, making output much more
flexible. We show some examples using ODS here. We are going to use the data set below for
the purpose of demonstration.
OPTIONS nocenter;
DATA hsb25;
INPUT id female race ses schtype $ prog
read write math science socst;
DATALINES;
147 1 1 3 pub 1 47 62 53 53 61
108 0 1 2 pub 2 34 33 41 36 36
18 0 3 2 pub 3 50 33 49 44 36
153 0 1 2 pub 3 39 31 40 39 51
50 0 2 2 pub 2 50 59 42 53 61
51 1 2 1 pub 2 42 36 42 31 39
102 0 1 1 pub 1 52 41 51 53 56
57 1 1 2 pub 1 71 65 72 66 56
160 1 1 2 pub 1 55 65 55 50 61
136 0 1 2 pub 1 65 59 70 63 51
88 1 1 1 pub 1 68 60 64 69 66
177 0 1 2 pri 1 55 59 62 58 51
95 0 1 1 pub 1 73 60 71 61 71
144 0 1 1 pub 2 60 65 58 61 66
139 1 1 2 pub 1 68 59 61 55 71
135 1 1 3 pub 1 63 60 65 54 66
191 1 1 1 pri 1 47 52 43 48 61
171 0 1 2 pub 1 60 54 60 55 66
22 0 3 2 pub 3 42 39 39 56 46
47 1 2 3 pub 1 47 46 49 33 41
56 0 1 2 pub 3 55 45 46 58 51
128 0 1 1 pub 1 39 33 38 47 41
36 1 2 3 pub 2 44 49 44 35 51
53 0 2 2 pub 3 34 37 46 39 31
26 1 4 1 pub 1 60 59 62 61 51
;
RUN;

Creating an output data set

Let's say we have a data set of student scores and want to conduct a paired t-test on writing score
and math score for each program type. For some reason, we want to save the t-values and pvalues to a data set for later use. Without ODS, it would not be an easy thing to do since proc
ttest does not have an output statement. With ODS it is only one more line of code.

We will sort the data set first by variable prog and use statement ods output Ttests=test_output
to create a temporary data set called test_output containing information of t-values and p-values
together with degrees of freedom for each t-test conducted.
proc sort data=hsb25;
by prog;
proc ttest data=hsb25;
by prog;
paired write*math;
ods output Ttests=ttest_output;
run;
proc print data=ttest_output;
run;
The SAS System
Obs
Probt

prog

Variable1

Variable2

Difference

tValue

DF

1
0.1389
2
0.5475
3
0.0766

write

math

write - math

-1.57

14

write

math

write - math

0.66

write

math

write - math

-2.37

How to identify output objects

For each SAS procedure, SAS produces a group of ODS output objects. For example, in the
above example, Ttests is the name of a such object associated with proc ttest. In order to know
what objects are associated with a particular proc, we use ods trace on statement right before the
proc and turn the trace off right after it. Let's look at another example using proc reg. The option
listing with ods trace on displays the information of an object along with the corresponding
output. Below we see three objects (data sets in this case) associated with proc reg when no
extra options used. The ANOVA part of the output is stored in a data set called ANOVA. The
parameter estimates are stored in ParameterEstimates. Each object has a name, a label and a
path along with its template. Once we obtain the name or the label of the object, we can use ods
output statement to output it to a dataset as shown in the example above.
ods trace on /listing;
proc reg data=hsb25;
model write = female math;
run;
quit;
ods trace off;
The REG Procedure
Model: MODEL1
Dependent Variable: write
Output Added:

------------Name:
ANOVA
Label:
Analysis of Variance
Template:
Stat.REG.ANOVA
Path:
Reg.MODEL1.Fit.write.ANOVA
------------Analysis of Variance
Source

DF

Sum of
Squares

Mean
Square

Model
Error
Corrected Total

2
22
24

2154.11191
1222.04809
3376.16000

1077.05596
55.54764

F Value

Pr > F

19.39

<.0001

Output Added:
------------Name:
FitStatistics
Label:
Fit Statistics
Template:
Stat.REG.FitStatistics
Path:
Reg.MODEL1.Fit.write.FitStatistics
------------Root MSE
Dependent Mean
Coeff Var

7.45303
50.44000
14.77603

R-Square
Adj R-Sq

0.6380
0.6051

Output Added:
------------Name:
ParameterEstimates
Label:
Parameter Estimates
Template:
Stat.REG.ParameterEstimates
Path:
Reg.MODEL1.Fit.write.ParameterEstimates
------------Parameter Estimates
Variable

DF

Parameter
Estimate

Standard
Error

t Value

Pr > |t|

Intercept
female
math

1
1
1

7.07533
5.95697
0.76991

7.56161
3.07209
0.14323

0.94
1.94
5.38

0.3596
0.0654
<.0001

Using object label to create an output data set

Along with the name of an object, we also see the label for the object. We can use the label to
create a data set just as using the name.
ods output "Parameter Estimates"=parest;
proc reg data=hsb25;
model write = female math;

run;
quit;
ods output close;
proc print data=parest;
run;
Obs
Model
Dependent
tValue
Probt

Variable

DF

Estimate

StdErr

1
0.94
2
1.94
3
5.38

write

Intercept

7.07533

7.56161

write

female

5.95697

3.07209

write

math

0.76991

0.14323

MODEL1
0.3596
MODEL1
0.0654
MODEL1
<.0001

Turn the listing output of

Since we can save our output from a proc to a dataset using ODS, we sometimes want to turn the
listing output off. We can NOT use noprint option since ODS requires an output object. What
we'll do is to use ODS statement here shown as in the example below. It makes sense because
listing output is just a form of ODS output. The statement ods listing close eliminates the output
to appear in the output window. After the proc reg, we turn back the listing output back so output
will appear in the output window again. The
ods listing close;
ods output "Parameter Estimates"=parest;
proc reg data=hsb25;
model write = female math;
run;
quit;
ods output close;
ods listing;

Output to an HTML file

Let's say that we want to write the output of our proc reg to an HTML file. This can be done
very easily using ODS. First we specify the file name we are going to use. Then we point the ods
html output to it. At the end we close the ods html output to finish writing to the HTML file. You
can view procreg.html created by the following code.
filename myhtml "c:\examples\procreg.html";
ods html body=myhtml;
proc reg data=hsb25;
model write= female math;
run;
quit;
ods html close;

7 95% confidence interval of mean


Use clm option in Proc Mean

8 Column input
if you have ID data in columns 13, Age in columns 46, and Gender in
column 7 of your raw data file, your input statement might look like this:
input ID $ 1-3 Age 4-6 Gender $ 7;

9 formatted input
input @1 ID $3.
@4 Age 3.
@7 Gender $1.;

The informat $3. tells SAS to read three columns of character data; the 3.
informat says to
read three columns of numeric data; the $1. informat says to read one column
of character
data. The two informats n. and $n., are used to read n columns of numeric and
character
data, respectively.

10Formatting in proc reportFormat = format;


Format Dollar15.2,
Width = statement
Define revenue/format = dollar15.2
Define flight/width = 7
Space = statement spacing between selected column and the next column to
its left

11Proc report
Display, order, group, across , analysis or computed

Character display by default


Define flight/order Flight/Number width = 6 center
Numeric = analysis variables by default
proc print data=vlib.emp1;
where lastname < 'KAP' and payrate > 30 * overtime;
run;

12Across variable to group variable horizontally


Define flight/

13Computed variable
Computed variable is not part of the dataset

14Summary report
Define flight/group Flight/Number width = 6 center;

PROC SUMMARY DATA=preteen NWAY;


CLASS sex;
VAR age height weight;
OUTPUT OUT=group_averages(DROP = _type_ _freq_)
MIN (age )=Youngest
MAX (age )=Oldest
MEAN(height)=Avg_Height
MEAN(weight)=Avg_Weight;
RUN;

Nway suppress grand total

15Viewing contents of dataset


proc contents data=sashelp.air;
run;

16printing portion of dataset


proc print data=sashelp.air(obs=10);

run;

17Print by
Proc print data = order_finance;
Var payment_gateway payment_mode;
By payment_mode

Print each of the values in var fro each payment_mode

Data set is not mandatory in proc print. If data set is not given, it will print the
lastly created dataset

Example =

18create new library


libname <Library name> '<Library path>';

19 creating and adding data to dataset


data mydata.income;
input income expense;
datalines;

4500 2000
5000 2300
7890 2810
8900 5400
2300 2000
;
run;

20importing csv file


PROC IMPORT OUT= MYDATA.sat_exam
DATAFILE= "C:\Users\Documents\Datasets\SAT_Exam.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;

21importing excel file


PROC IMPORT OUT= MYDATA.sat_exam
DATAFILE= "C:\Users\Documents\Datasets\SAT_Exam.csv"
DBMS=EXCEL REPLACE;
SHEET = SHEET1$
GETNAMES=YES;
DATAROW=2;
RUN;

PROC IMPORT OUT= WORK.add_budget


DATAFILE= "C:\Users\VENKAT\Google Drive\Training\Books\Content\
Regression Analysis\Add_budget_data.xls"
DBMS=EXCEL REPLACE;
RANGE="budget$";

GETNAMES=YES;
MIXED=NO;
SCANTEXT=YES;
USEDATE=YES;
SCANTIME=YES;
RUN;

22IMPORTING TEXT FILE


PROC IMPORT OUT= MYDATA.SAT_EXAM_data_from_text_file
DATAFILE= "C:\Users\Documents\Datasets\SAT_Exam.txt"
DBMS=TAB REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;

23COPYING DATASET
data MYDATA.sat_exam_copy;
set MYDATA.sat_exam;
run;

24ADDING NEW VARIABLES AND CREATING DATASET


data new_data;
set old_data;
<new var statements>;
run;

25DROP AND KEEP VARIABLES IN NEW DATASET


data new_data;
set old_data(Keep=Var1 Var2 Var3);
<Rest of the statements>

run;

data new_data;
set old_data(Drop=Var5 Var6 Var7);
<rest of the statements>
run;

26PRINTING NO OBSERVATION NUMBER


proc print data=market_asset(obs=10) noobs;
run;

27SUBSTRING
New variable= SUBSTR(variable, start character, number of characters)

28OTHER STRING FUNCTIONS


LENGTH
TRIM
UPCASE
LOWCASE

29DATE FUNCTION
Duration_days=INTCK('day',start_date,end_date); /* Finds the duration in days */
Duration_months=INTCK('month',start_date,end_date); /* Finds the duration in
months */
Duration_weeks=INTCK('week',start_date,end_date); /* Finds the duration in
weeks */
MONTH
YEAR

30PRINTING VARIABLES IN DATASET


proc contents data=sashelp.stocks varnum;
run;

proc contents data=sashelp.stocks varnum short;


run;

31SORT
proc sort data=<dataset>;
by <variable>;
run;

proc sort data=<dataset> out = <New Data set>;


by <variable>;
run;

proc sort data=MYDATA.bill out=mydata.bill_top;


by descending Bill_Amount ;
run;

proc sort data=MYDATA.bill out=mydata.bill_top100k;


by descending Bill_Amount ;
where Bill_Amount>100000;
run;

32REMOVING DUPLICATE
proc sort data=MYDATA.bill out=mydata.bill_wod nodup;
by cust_id ;
run;

33REMOVE DUPLICATE BASED ON KEY


proc sort data=MYDATA.bill out=mydata.bill_cust_wod nodupkey;
by Bill_Id ;
run;

34MOVE DUPLICATES INTO NEW DATASET


proc sort data=MYDATA.bill out=mydata.bill_wod nodup
dupout=mydata.nodup_cust_id ;
by cust_id ;
run;
proc print data=mydata.nodup_cust_id;
run;

35PLOT
proc gplot data= <data set>;
plot y*x;
run;

symbol i=none;
proc gplot data= market_asset;
plot reach*budget;
run;

proc gchart data= market_asset;

vbar category;
Run;

proc gchart data= market_asset;


pie category ;
Run;

/* 3D bar chart */
proc gchart data= market_asset;
vbar3d category ;
Run;
/* 3D pie chart */
proc gchart data= market_asset;
pie3d category ;
Run;

36Median in proc sql


PROC SQL;
SELECT MEDIAN(a) LABEL='Median of 1'
FROM threex3
;
QUIT;

We get:
Median
of 1
-------1.1
6
7.7

You were probably expecting to see just a 6, the median of the values in column A.
Instead we have, for each row, the trivial median of a single value of A. In other words,
the processing was horizontal rather than vertical.

The explanation: vertical calculation of medians is not supported in PROC SQL (though
it is in PROC SUMMARY). Thus, there is no ambiguity. In SQL, the only valid
interpretation of MEDIAN with a single argument is that it is a SAS function call, to be

computed horizontally

37PROC SQL
proc sql;
create buss_fin /* This is the new dataset */
as select *
from market_asset
where Category= 'Business/Finance';
Quit;

proc sql;
select Category, sum(budget) as total_budget
from market_asset
group by Category;
Quit;

38Proc sql case


PROC SQL;
CREATE TABLE trip_list AS
SELECT fname,
age,
sex,
CASE WHEN age=11 THEN 'Zoo'
WHEN sex='F' THEN 'Museum'
ELSE '[None]'
END
AS Trip
FROM preteen
;
QUIT;

39MERGING TWO DATASETS


Data Students_1_2;
set students1 students2;
run;

proc sort data=students1;


by name;
run;
proc sort data=students2;
by name;
run;
data studentmerge;
Merge students1 students2;
by name;
run;

data final;
Merge data1(in=a) data2(in=b);
by var;
if a;
run;

data final1;
Merge data1(in=a) data2(in=b);
by var;
if b;
run;

data final2;
Merge data1(in=a) data2(in=b);
by var;
if a and b;
run;

40SAMPLING
proc surveyselect data = sashelp.prdsale
method = SRS
rep = 1
sampsize = 30 seed = 12345 out = prod_sample_30;
id _all_;
run;

41PRINTING VERTICAL HEADING


proc print data=<<data set>> label noobs heading=vertical;
var <<variable-list>>;
by var1; run;
Label: This option prints variable labels as column headings instead of variable
names.
Noobs: This option removes the OBS column from output.
Heading=vertical: This option prints the column headings vertically. This is
useful
when the names are long but the values of the variable are short.
By: The by statement produces output grouped by values of the mentioned
variables

42MEAN CALCULATION
Proc means data=online_sales mean;
var listPrice;
class brand;
run;

proc means data = online_sales nmiss kurtosis var;


run;

class provides mean for each item at brand

What if you want to see the grand mean, as well as the means broken down by
Drug, all
in one listing? The PROC MEANS option PRINTALLTYPES does this for you when
you include a CLASS statement. Here is the modified program

title "Descriptive Statistics Broken Down by Drug";


proc means data=example.Blood_Pressure n nmiss
mean std median printalltypes maxdec=3;
class Drug;
var SBP DBP;
run;

libname example 'c:\books\statistics by example';


title "Descriptive Statistics for SBP and DBP";
proc means data=example.Blood_Pressure n nmiss mean std median
maxdec=3;
var SBP DBP;
run;

Option Description
N Number of nonmissing observations
NMISS Number of observations with missing values
MEAN Arithmetic mean

STD Standard deviation


STDERR Standard error
MIN Minimum value
MAX Maximum value
MEDIAN Median
MAXDEC= Maximum number of decimal places to display
CLM 95% confidence limit on the mean
CV Coefficient of variation

43Moving means data into output file

44Merging two data together

45QUANTILES
Proc univariate data= online_sales ;
var listPrice ;
run

title "Demonstrating PROC UNIVARIATE";


proc univariate data=example.Blood_Pressure;

id Subj; The ID statement is not necessary, but it is particularly useful with PROC
UNIVARIATE

var SBP DBP;


histogram;
probplot / normal(mu=est sigma=est);
run;

the PROBPLOT statement requests a probability plot. This plot shows


percentiles from a theoretical distribution on the x-axis and data values on the
y-axis.
This example program selects the normal distribution using the NORMAL option
after
the forward slash. If your data values are normally distributed, the points on this
plot will
form a straight line. To make it easier to see deviations from normality, the
option
NORMAL also produces a reference line where your data values would fall if they
came
from a normal distribution. When you use the NORMAL option, you also need to
specify

a mean and standard deviation. Specify these by using the keyword MU= to
specify the
mean and the keyword SIGMA= to specify a standard deviation. The keyword
EST tells
the procedure to use the data values to estimate the mean and standard
deviation, instead
of some theoretical value.

Notice the slash between the word PROBPLOT and NORMAL. Using a slash here
follows standard SAS syntax: if you want to specify options for any statement in
a PROC
step, follow the statement keyword with a slash.

/* BOX PLOT*/
Proc univariate data= health_claim plot;
var Claim_amount ;
run;

46TO REMOVE NA VALUES TO NUMERICAL


Data cust_cred_raw_v1;
Set cust_cred_raw;
MonthlyIncome_new= MonthlyIncome*1;
NumberOfDependents_new=NumberOfDependents*1;
run;

47CREATE FREQUENCY TABLE


Title 'Frequency table for Serious delinquency in 2 years ';
proc freq data= cust_cred_raw_v1;

table SeriousDlqin2yrs;
run;

48create two variable categorical frequency table


data respire;
input treat $ outcome $ count;
datalines;
placebo f 16
placebo u 48

test f 40
test u 20
;
proc freq;
weight count;
tables treat*outcome;
run;

49Weight statement
The WEIGHT statement is necessary to tell the procedure that the data are count
data, or frequency data; the variable listed in the WEIGHT statement contains the
values of the count variable
If the data is stored in the record form, the weight is not required. Ie
data respire;
input treat $ outcome $ @@;
datalines;
placebo f placebo f placebo f
placebo f placebo f
placebo u placebo u placebo u
placebo u placebo u placebo u
placebo u placebo u placebo u
placebo u
test f test f test f
test f test f test f
test f test f
test u test u test u
;
proc freq;
tables treat*outcome;
run;

50order
order = data that the sort order is the same order in which the values are
encountered in the data set.
Thus, since marked comes first, it is first in the sort order. Since some is the
second value for
IMPROVE encountered in the data set, then it is second in the sort order. And none
would be third
in the sort order. This is the desired sort order. The following PROC FREQ statements
produce a
table displaying the sort order resulting from the ORDER=DATA option
proc freq order=data;
weight count;
tables treatment*improve;
run;

51three variable frequency


proc freq order=data;
weight count;
tables sex*treatment*improve / nocol nopct;
run;

NOCOL and NOPCT options suppress the printing of column percentages and cell
percentages, respectively

52Correlation
proc corr data=add_budget ;
var Online_Budget Responses_online ;
run;

53Regression
/* Predicting SAT score using rest of the four variables. General_knowledge,
Aptitude,
Mathematics, and Science */
proc reg data=sat_score;
model SAT=General_knowledge Aptitude Mathematics Science;
run;

54logistic regression
Proc logistic data=ice_cream_sales;
model buy_ind=age;
run;

55test stationarity
proc arima data= ms;
identify var= stock_price stationarity=(DICKEY);

run;

56create a diferentiated time series


proc arima data= ms1;set ms;
dif_stock_price=stock_price-lag1(stock_price);
run;
Yt-1
ms1 is the new data set, Lag1 denotes the Yt-1 values, and dif_stock_price is
the new diferentiated series

57create ACF and PACF Plots


proc arima data=ts15 plots=all;
identify var=Series_Values ;
run;

58to calculate the ESACF and SCAN function values


proc arima data= TS13 plots=all;
identify var=x SCAN ESACF ;
run;

59forecast using ARIMA


proc arima data=web_views;
Identify var=Visitors;
Estimate p=1 q=0 method=ml;
Forecast lead=7;
run;

60advanced mean concepts


Two SAS data sets are used to generate the examples youll see in this tutorial. An Early Adopter Release of SAS 9

Software was used to create the code and output, but everything presented in this paper is available in Release 8.0
and higher of the SAS System.
The first data set, ELEC_ANNUAL, contains about 16,300 customer-level observations (rows) with information about
how much electricity they consumed in a year, the rate schedule on which they were billed for the electricity, the total
revenue billed for that energy and the geographic region in which they live. The variables in the data set are:
PREMISE Premise Number [Unique identifier for customer meter]
TOTKWH Total Kilowatt Hours [KwH is the basic unit of electricity consumption]
TOTREV Total Revenue [Amount billed for the KwH consumed
TOTHRS Total Hours [Total Hours Service in Calendar Year]
RATE_SCHEDULE Rate Schedule [Table of Rates for Electric Consumption Usage]
REGION Geographic Region [Area in which customer lives]
The second data set, CARD_TRANS2, contains about 1.35 million observations (rows), each representing one
(simulated) credit card transaction. The variables in the data set are:
CARDNUMBER Credit Card Number
CARDTYPE Credit Card Type [Visa, MasterCard, etc.]
CHARGE_AMOUNT Transaction Amount (in dollars/cents)
TRANS_DATE Transaction Date [SAS Date Variable]
TRANS_TYPE Transaction Type [1=Electronic 2=Manual]

60.1 Basic mean


By default, PROC MEANS will analyze all numeric variables in your data set and deliver those analyses to your
Output Window. Five default statistical measures are calculated:
N Number of observations with a non-missing value of the analysis variable
MEAN Mean (Average) of the analysis variables non-missing values
STD Standard Deviation
MAX Largest (Maximum) Value
MIN Smallest (Minimum) Value
Using the ELEC_ANNUAL Data Set and PROC MEANS, we can see how the default actions of PROC MEANS are
carried out by submitting the following code:
* Step 1: Basics and Defaults;
PROC MEANS DATA=SUGI.ELEC_ANNUAL;
title 'SUGI 29 in Montreal';
title2 'Steps to Success with PROC MEANS';
title3 'Step 1: The Basics and Defaults';
run;
The results displayed in the Output Window are:

Since TOTKWH, TOTREV and TOTHRS are all numeric variables, PROC MEANS calculated the five default
statistical measures on them and placed the results in the Output Window.

60.2 Selecting Analysis Variables, Analyses to be Performed by PROC


MEANS , and Rounding of Results
In most situations, your data sets will probably have many more numeric variables you want PROC MEANS to
analyze. This particularly true if some of your numeric variables dont admit of a meaningful arithmetic operation,
which is a fancy way of saying that the results of calculating a statistic on them results in meaningless information.
For example, the sum of ZIPCODE or the MEAN of telephone number is unlikely to be useful. So, we dont want to
waste time having these values calculated or clutter up our output with meaningless information. Also, we may no

need all of the five statistical analyses that PROC MEANS will perform automatically. And, we may want to round the
values to a more useful number of decimal places than what PROC MEANS will do for us automatically.
Again using the ELEC_ANNUAL data set, here is how we can take more control over what PROC MEANS will do for
us. Suppose we just want the SUM and MEAN of TOTREV, rounded to two decimal places. The following PROC
MEANS task gets us just what we want.

A box has been drawn around the important features presented iin Step 2. First, the SUM and MEAN statistics
keywords were specified, which instructs PROC MEANS to just perform those analyses. Second, the MAXDEC
option was used to round the results in the Output Window to just two decimal places. (If we had wanted the
analyses rounded to the nearest whole number, then MAXDEC = 0 would have been specified.) Finally, the VAR
Statement was added, giving the name of the variable for which the analyses were desired. You can put as many
(numeric) variables as you need/want in to one VAR Statement in your PROC MEANS task.
The Output Window displays:

60.3 Selecting Other Analyses


So far weve worked some of the (five) default statistical analyses available from PROC MEANS. There are many
other statistical analyses you can obtain from the procedure! Here is a complete list

Suppose the observations in ELEC_DATA are a random sample from a larger population of utility customers. We
might therefore want to obtain, say, a 95 percent confidence interval around the mean total KwH consumption and
around the mean billed revenue, along with the mean and median. From the above table, you can see that the
MEAN, MEDIAN and CLM statistics keywords will generate the desired analyses. The PROC MEANS task below
generates the desired analyses. The task also includes a LABEL Statement, which add additional information about
the variables in the Output Window.
Selecting Statistics;
PROC MEANS DATA=SUGI.ELEC_ANNUAL
MEDIAN MEAN CLM MAXDEC=0;
Label TOTREV = 'Total Billed Revenue'
TOTKWH = 'Total KwH Consumption';
VAR TOTREV TOTKWH;
title3 'Step 3: Selecting Statistics';
run;
The output generated is:

60.4 Step 4: Analysis with CLASS (variables)


So far weve analyzed the values of variables from ELEC_ANNUAL without regard to the values of potentially
interesting and useful classification variables. PROC MEANS can do this for you with a minimum of additional
coding. First, we need to understand what the CLASS and BY Statements do when included in a PROC MEANS
task. The CLASS statement does not require that the input (source) data set be sorted by the values of the
classification variables. On the other hand, using the BY Statement requires that the input data set be sorted by the
values of the classification variables.
In most situations, it does not matter if you use the CLASS or BY statement to request analyses classified by the
values of a classification variable. If you are working with a very large file, however, with many classification
variables (and/or classification variables with many distinct values), you may obtain significant processing time
reductions if you first use PROC SORT to sort the data by the values of the classification variable and then use
PROC MEANS with a BY Statement. Unfortunately, I cannot give you a magic number of observations or variables
at which it become more efficient to first sort and then use a BY statement versus using the CLASS statement on a
unsorted data set. Factors such as the actual number of observations, the number of unique values of the CLASS
variables, memory allocation/availability, CPU power, etc. all come in to play and cant really be estimated in
advance. Youll have to use some trial and error to figure out which approach is best for your unique data structures
and computing capabilities.
Having said all of this, lets take a look at how we can obtain the MEAN and SUM of TOTREV classified by REGION
in the ELEC_ANNUAL data set. All we need to do is add the CLASS statement (with REGION as the classification
variable) to the PROC MEANS task, as shown below.

By specifying REGION in the CLASS Statement, we now have the MEAN and SUM of TOTREV and TOTKWH for
each unique value of region. We also have a column called N Obs, which is worthy of further discussion. By

default, PROC MEANS shows the number of observations for each value of the classification variable. So, we can
see that there are, for example, 5,061 observations in the data set from the WESTERN Region.
How does PROC MEANS handle missing values of classification variables? Suppose there were some observations
in ELEC_ANNUAL with missing values for REGION. By default, those observations would not be included in the
analyses generated by PROC MEANSbut, we have an option in PROC MEANS that we can use to include
observations with missing values of the classification variables in our analysis. This option is shown in Step 5.

60.5 Step 5: Dont Miss the Missings!


As we saw in Step 4, PROC MEANS automatically creates a column called N Obs when a classification variable is
placed in a CLASS or BY Statement. But, observations with a missing value are, by default, excluded (not portrayed)
in the output analysis. There are certainly many instances where it would be useful to know: a) how many
observations have a missing value for the classification variable and b) what the analyses of the analysis variables
are for observations that have a missing value for the given classification variable. We can easily obtain this
information by specifying the MISSING option in the PROC MEANS statement. Heres how to do it:

60.6 Survey means - How to Estimate a Ratio of Means using SAS


This section describes how to use SAS to estimate a ratio of means for all adults and for males
and females separately. To illustrate this, the sum of calcium from milk is divided by the sum of
total calcium for each population group as an example.
Sorting is not a necessary first step in SAS as it is in SUDAAN. Therefore, properly weighted
estimated means and standard errors, using complex survey design factors (e.g., strata and PSU),
can be obtained with the single SAS procedure PROC SURVEYMEANS.

60.6.1.1
60.6.1.2

Use SAS to Estimate How Much Dietary Calcium Consumed by Adults,


Ages 20 Years and Older, Comes from Milk
Sample Code

*-------------------------------------------------------------------------;
* Use the PROC SURVEYMEANS procedure in SAS to compute a properly weighted;
* estimated ratio of means for all persons ages 20+ and by gender.
;
*-------------------------------------------------------------------------;
* Run analysis for overall subpopulation of interest;
proc surveymeans data=DTTOT;
where usedat=1 ;
strata SDMVSTRA;

cluster SDMVPSU;
weight WTDRD1;
var D1MCALC DR1TCALC;
ratio D1MCALC / DR1TCALC;
title " Ratio of Means -- All Persons ages 20+" ;
run ;
*-------------------------------------------------------------------------;
* Use the PROC SORT procedure to sort the data by gender.
*-------------------------------------------------------------------------;

proc sort data =DTTOT;


by RIAGENDR;
run ;
* Run analysis by gender within subpopulation of interest;
proc surveymeans data=DTTOT;
where usedat= 1 ;
strata SDMVSTRA;
cluster SDMVPSU;
weight WTDRD1;
var D1MCALC DR1TCALC;
ratio D1MCALC / DR1TCALC;
by RIAGENDR;
title " Ratios of Means -- by Gender" ;
run ;

60.6.1.3

Output of Program

Ratio of Means -- All Persons ages 20+


Data Summary
Number of Strata
15
Number of Clusters
30
Number of Observations
4448
Sum of Weights
205284669
Statistics
Std Error
Lower 95%
Upper 95%
Variable Label
N
Mean
of Mean
CL for Mean
CL for Mean
--------------------------------------------------------------------------------------------------d1mcalc
Calcium (mg)
4448
101.162167
7.647887
84.861081
117.463253
DR1TCALC Calcium (mg)
4448
880.130855
16.722099
844.488545
915.773166
---------------------------------------------------------------------------------------------------

Ratio Analysis
Numerator Denominator
N
Ratio
Std Err
95% Confidence Interval
----------------------------------------------------------------------------------------------d1mcalc DR1TCALC
4448
0.114940
0.006826
0.100390
0.129490
----------------------------------------------------------------------------------------------Ratios of Means -- by Gender
Gender - Adjudicated=male
Data Summary
Number of Strata
15
Number of Clusters
30
Number of Observations
2135
Sum of Weights
98664010.2
Statistics
Std Error
Lower 95%
Upper 95%
Variable Label
N
Mean
of Mean
CL for Mean CL for Mean
------------------------------------------------------------------------------------------------d1mcalc
Calcium (mg)
2135
122.142347
8.719800
103.556533
140.728162
DR1TCALC Calcium (mg)
2135
998.359501 21.809584
951.873474
1044.845528
-----------------------------------------------------------------------------------------------Ratio Analysis
Numerator Denominator
N
Ratio
Std Err
95% Confidence Interval
----------------------------------------------------------------------------------------------d1mcalc DR1TCALC
2135
0.122343
0.007148
0.107107
0.137579
----------------------------------------------------------------------------------------------Gender - Adjudicated=female
Data Summary
Number of Strata
15
Number of Clusters
30
Number of Observations
2313
Sum of Weights
106620659
Statistics
Std Error
Lower 95%
Upper 95%
Variable Label
N
Mean
of Mean
CL for Mean
CL for Mean
-----------------------------------------------------------------------------------------------d1mcalc
Calcium (mg)
2313
81.747649
9.880726
60.687380
102.807918
DR1TCALC Calcium (mg)
2313
770.725113
15.292108
738.130756
803.319469
-----------------------------------------------------------------------------------------------Ratio Analysis
Numerator Denominator
N
Ratio
Std Err
95% Confidence Interval
----------------------------------------------------------------------------------------------d1mcalc DR1TCALC
2313
0.106066
0.011329
0.081919
0.130213
-----------------------------------------------------------------------------------------------

Highlights from the output include:

The ratio of mean calcium from milk to total calcium, for all persons ages 20
and older, is 0.11 (with a standard error of 0.01). The corresponding values
for males and females, respectively, are 0.12 (0.01) and 0.11 (0.01).

Note that, even though this analysis did not incorporate a domain statement,
the results are exactly equal to those obtained using SUDAAN and its
SUBPOPN statement because the subgroup of interest was one for which the
weighted NHANES sample is representative.

61Mean and ratios


need to find a ratio of two mean values, that I have found using proc means.
proc means data=a;
class X Y;
var x1 x2;
run;

Then I get the output mean values for variables x1 and x2 in the two categories of X and Y,
but it is x1/x2 for each category that I am interested in, and doing it by hand is not really a
solution
You need to precompute x1/x2 or postcompute x1/x2 (Depending on whether you want
mean(x1/x2) or mean(x1)/mean(x2), which can have different answers of x1 and x2 have
different numbers of responses).
So either (... means fill in what you have already)
data premean;
set have;
x1x2 = x1/x2;
run;

proc means ... ;


class ... ;
var x1x2;
run;

or
proc means ...;

class ... ;
var x1 x2;
output out=postmeans mean=;
run;

data want;
set postmeans;
x1x2=x1/x2;

run;

62SAS QUESTIONS
1. To create a raw data file:
Use the SET statement
Format:
DATA _null_;
SET dataset;
_null _ allows the DATA step to be used to without
creating a SAS data set

2. Result
a.
b.
c.

formats for SAS formats can be specified:


Traditional SAS listing
HTML document
Both a listing and HTML document.

3. The appearance of the output can be controlled, specifically (for SAS listings):
a. line size (the maximum width of the log and output)
b. page size (the number of lines per printed page)
c. page numbers displayed
d. data and time displayed
4. Variable length identifies the number of bytes used to store the variable.
Length is dependent on type:
a. Character variables can be up to 32,767 bytes long
b. Numeric variables have a constant default length of 8 with
c. an infinite number of digits possible
d. Numeric variables have a constant length because they
e. are stored as floating-point numbers

f.

A diferent length can be specified for numeric variables.

5. Data set has two parts: a descriptive portion and a data portion that the data
set can locate

6. Descriptive Portion - Contains information about the data set, including:


a. data set name
b. data type
c. creation date and time
d. number of observations
e. number of variables
f. number of indexes
g. Contains information about the variables in the data set, including:
i. name
ii. type
iii. length
iv. format
v. informat
vi. label.
7. Drop statements
If a variable should be processed but not appear in the
new data set, use the DROP= option in the DATA
statement
If a variable should not be processed nor appear in the
new data set, use the DROP= option in the SET
statement
8. Observations in the input data set are read as they appear in the physical file,
or sequentially. Sequential reading can be bypassed using a POINT= option
9. The criteria used by one-to-one readings to select data:
a. The new data set contains all variables from all input data sets
b. If data sets have variables of the same name, the values from the last
data set overwrite the values read from previous data sets
c. The number of observations in the new data set will be the same as
the number of observations in the smallest original data set
d. Observations are combined based on their relative position, that is,
the first observations from each data set are combined, and so on.
10.Raw data can be organized in several ways:
a. arranged in columns, or fixed fields
b. arranged without columns, or free format.
11.Raw data can contain:
a. standard data without any special characters
b. nonstandard data with special characters.
12.SAS has three input styles
a. column input
b. formatted input
c. list input.
13.Informats

a. The $w. informat allows character data to be read. The dollar sign
indicates character only data. The w represents the field width, or
number of columns, of the data. The period ends the informat.
b. The w.d format allows standard numeric data to be read. The w
represents the field width, or number of columns, of the data. If a
decimal point exists with the raw data, that acts as one decimal. The
period acts as a delimiter. The optional d specifies the number of
implied decimal places (not necessary if the value already has decimal
places).
c. The COMMAw.d will read nonstandard numeric data, removing any
embedded:
i. blanks
ii. commas
iii. dashes
iv. dollar signs
v. percent signs
vi. right parentheses
vii. left parentheses.
14.Record Formats of external file define how data is read by column input and
formatted input processes - The default value of the maximum record length
is determined by the operating environment. The maximum record length can
be changed using the LRECL=option in the INFILE statement
15.A List input can read standard and nonstandard data in a free-format record
16.By default, List input does not have specified column locations, so:
a. all fields must be separated by at least one delimiter
b. fields cannot be skipped or re-read
c. the order for reading fields is from left to right
17.List input - By default several limitations exist on the type of data that can be
read using list input:
a. Character values that are longer than eight characters will be
truncated
b. Data must be in standard numeric or character format
c. Character values cannot contain embedded delimiters
d. Missing numeric and character values must be represented by a
period or some other character
18.List input - The default length of character values is 8. Variables that are
longer than 8 are truncated when written to the program data vector. Using a
LENGTH statement before the INPUT statement will define the length and
type of the variable.
19.List input missing values:
a. If missing values occur at the end of the record, the MISSOVER option
in the INFILE statement can be used to read them. The MISSOVER
option will prevent the SAS from going to another record if values
cannot be found for every specified variable in the current line.
b. MISSOVER only works with missing values at the end of a record.
c. To begin to read missing values in the beginning or middle of the
record, the DSD option in the INFILE statement can be used.
d. DSD changes how delimiters are treated when using a list input:

i. by setting the default delimiter to a comma


ii. treating two consecutive delimiters as a missing value
iii. removing quotation marks from values.
e. If multiple delimiters are used in the original file, they can be identified
by using the DLM=option.
f. Modifying List inputs allows it to be more versatile. Two modifiers can
be used:
i. the ampersand is used to read character values that contain
embedded blanks
ii. the colon is used to read nonstandard data values and
character values longer than eight characters
20.Creating customized layout
a. Using the ID statement in conjunction with the ID and SUM statements
will show the BY variable heading only once: If the variable specified by
the IN statement is the same as the BY statement, then:
i. The OBS column is suppressed
ii. The ID/BY variable is printed in the far left column
iii. Each ID/BY value is printed at the start of each BY group and on
the same line as the group's subtotal
b. Each BY group can be printed on a separate page using the PAGEBY
statement: Format: PAGEBY BY-variable;
c. To double space the report, use the DOUBLE option in the PROC PRINT
statement.
21.Foot notes and titles - TITLE and FOOTNOTE statements are global
statements and are in place until they are modified, canceled, or the SAS
session ends. When redefining a title or footnote, all higher-numbered titles
or footnotes are canceled. A null TITLE or FOOTNOTE statement has no
number or text.
22.To temporarily assign a label or format to the data output, use the LABEL or
FORMAT statements within the PROC PRINT step. To permanently assign a
label or format to the data , u
se the LABEL or FORMAT statements within
the DATA step.
23.PROC REPORT LISTING - The appearance of the headings found in the list
report can be changed using two options:
a. HEADLINE underlines all column headings and the spaces between
them
b. HEADSKIP creates a blank line beneath all column headings or after
the underline if the HEADLINE option is used.

Você também pode gostar