Você está na página 1de 206

Page No.

1
SAS is produced by the SAS Institute in Cary, NC. It is the most powerful and comprehensive statistics
software available. We should avoid calling SAS a "program," since we write "programs" in SAS. But it
is also not appropriate to refer to SAS as a "language" like C++, Fortran, etc. SAS actually contains
several computer languages within it. Even "application" doesn't seem to fully describe SAS, so maybe
we should just use the term SAS gives itself at the top of its output, "The SAS System."
In 1976 the SAS Institute came out with its first software, a mainframe-based statistical analysis
software package. Since then, SAS has enjoyed phenomenal growth. Its software is available for all
the major platforms, and has grown beyond statistics to a variety of data management and business
applications. To read more about SAS, see their website at www.sas.com. A history timeline is given
here.
SAS is also a community. Users groups (SUGI, NDSUG, RRVSUG) provide opportunities to interact with
other SAS enthusiasts, hold conferences and publish articles about SAS programming techniques.
There is even an annual SAS ballot, in which users can vote for the changes and enhancements they
would like to see SAS work on.
SAS Windows
PC SAS is a Windows GUI for running SAS. When you open the program, you will notice three
windows (probably overlapping) called “Editor,” “Log,” and “Output.” An example appears below. The
windows have been arranged so you can see them all.

The editor window is used to write programs. Text output will appear in the output window, and the
log window will contain error messages and other program execution information. The typical process
is to type a program in the editor, then submit it (running man icon) and look at your log and output
results. After a program has been submitted, it remains in the editor, so you can make modifications
and submit it again.
You can save your programs and open previously saved programs from the file menu. However, the
"Save" command will apply to the active (top) window. You can save editor, log, or output files,
accordingly. An extension of .sas, .log, or .lst (respectively) will be added automatically unless you
supply some other extension. (Using .txt may be helpful if you intend to open the file with Notepad or
Page No.2
a word processor, but most of the time it is best to stick with SAS's defaults.) There is one little catch
you should remember: In order to open a saved program, an editor window must be active when you
select "Open" from the file menu. If you have closed your editor window, you must first open a new
one by selecting "Enhanced Editor" from the "View" menu, then proceed to the "File-->Open" dialog..
If you have font problems (incorrect characters) when you open an output file in a word processor, try
changing the font to "SAS Monospace." This should work if you are on a computer with SAS installed.
If SAS is not installed, you may not be able to get all of the characters correct, but any monospace
font, like Courier, will straighten out most of the formatting.
Take a look at the program statements in the editor window. Notice that each line ends with a
semicolon. SAS uses semicolons to define the end of a statement. It doesn't matter how the text is
arranged, whether there are extra spaces, indentions, extra lines, multiple statements on a line, or
statements split across lines. Such formatting can, and should, be used to enhance readability for
humans, but to SAS all that matters is where the semicolons are.
Enterprise Guide is a new environment for running SAS. Although the same programs work and
produce similar results, there are some differences in the appearance and behavior of the interface. If
you purchase the "SAS Learning Edition," this is the interface that will be presented after it is installed.
An example appears below. In Enterprise Guide, your work is organized into projects, which appear in
a collapsible tree structure on the left. The three windows mentioned above do not open
automatically. In order to type in a program, you open a code window which replaces the editor.
Output windows open as needed when programs run, but not necessarily automatically. Output, log,
and code windows (also datasets) can be opened by double-clicking icons in the project tree. The
contents do not accumulate in the log and output windows as they do in PC SAS. When you submit a
program, you have the choice of overwriting previous output or starting a new node in the project
tree. This keeps your results more organized. Note, the "running man" icon that is used for "submit"
in the old version is replaced by a sheet of paper with a down-arrow beside it.
You can use the regular interface described under "PC SAS" with Learning Edition, but you have to find
the sas.exe file in the program files directory. Make a shortcut to this file and place it on your desktop
to use for starting SAS.

Help
When you need more information about making SAS do what you want, there are several sources you
can access. The first is the "Help" facility installed on your computer and found at the right end of the
menu bar. Unfortunately, because SAS is a huge and powerful product, the help can be a challenge to
navigate. The appearance and content of your help menu may vary depending on the version of SAS
you are working with. The examples here use version 9.1.3. Under "Help," select "SAS Help and
Documentation." Then you should see a window like this:
Page No.3

Click on the plus sign by SAS Products and the tree opens like this:

There is an important lesson right here in looking at this list in help. SAS is not just one program. It is
a system of interconnected modules ("products"). In some ways it is like Microsoft Office, which has a
number of components like Word, Excel, PowerPoint, etc. The list you see under "SAS Products" is not
Page No.4
necessarily all of them, and you do not necessarily have all those listed available to you. In this
course, we will only be dealing with a few of them, and most of the time, we will be in "Base SAS."
Click the plus sign by "Base SAS" and you will see:

Again we have a long list of choices. The challenge is to learn where to look for the information you
need, because there is simply so much that it can easily cause "information overload." As you begin
learning, the most helpful sources of information will be under "SAS Command Reference," "SAS
Procedures," or "SAS Language Dictionary." Spend some time looking around in "Help" and
familiarizing yourself with what is there. Notice that there are also index and search facilities.
Unfortunately, these often return too many "hits" from modules other than those you are interested in.
Thus, there is no substitute for learning to navigate the help tree!
SAS also provides SAS OnlineDoc 9.1.3 for the Web which is similar to the help but may be more up-
to-date. In fact, you can download it in PDF form and print whatever you want. The Online version
has a search facility which allows you to restrict your search to a particular module or procedure. In
some cases this can be extremely helpful for avoiding unwanted hits.
Program Organization
A statement in SAS is sort of like a complete sentence (although the analogy won't extend as far as
having a subject and verb) or a single command. Every statement must end with a semicolon, which
signals the end of a command. Inside a statement you will find such things as keywords, options,
and user-supplied names. SAS is pretty flexible in its ability to interpret words within a statement.
For example, you can separate words with spaces, tabs, or returns--any white space will be treated the
same way. You can split statements across multiple lines or put multiple statements on one line. We
will try to develop a style that is easy to read and follow by using indentation and comments. SAS is
also forgiving, in that it tries to figure out things that might be mistakes. For example, you can put
two statements together without a space, because as long as a semicolon is there, SAS can tell where
one ends and the next begins.
SAS is designed for data analysis, so programs are organized into “steps” that correspond roughly to
steps you go through in analyzing data. However, these steps are not just a matter of programming
style. They are blocks of code that SAS treats as a whole. Some information is passed from one step
to the next, but you should think of steps as independent units.
Page No.5
There are data steps and proc steps. The main purpose of data steps is to create data sets. Proc
steps may perform analysis tasks or other actions. Every data step begins with the keyword "data"
and every proc step begins with the keyword "proc." (How logical is that?) A "run" statement can
signal the end of a step (we will deal with exceptions later) and triggers execution of statements
preceding it. A step always ends when another statement that begins with "data" or "proc" is
encountered, signaling a new step. Most of the time there must be a "run" statement at the end of a
program or the last step will not be executed. Some procs also require a "quit" statement in order to
stop them completely.
Some commands are called "global statements" because they are not really part of a step. Some
examples are "options," "title," and "libname" statements, which will be described later. The effects of
these commands typically hold across many steps.
We should understand the flow of events after a program is submitted. In most cases, SAS first reads
all the commands in a step, checks them for errors, then executes the instructions before going on to
the next step. If errors are found, SAS writes error messages and warnings to the log. If there are
serious errors, SAS will not execute the statements in the step. If another step follows, SAS will go on
and try to run the next step. This may produce unexpected results if a later step depends on output
from a previous step that had errors. So, errors do not always cause SAS to stop reading and
executing code. This is why it is important to ALWAYS CHECK THE LOG. You may get output that
looks fine, but was produced in spite of errors that caused the results to be wrong!
A step could be just one line of code, but it could also be many lines. The program shown below has
four steps. Each begins with a keyword, either "data" or "proc," and ends with a run statement (even
though it is redundant to place a "run" before another data or proc step, it turns out to be useful in
some cases). Don't worry about the meaning of the other statements right now, but study the
structure of the steps. Note how the editor separates the steps with horizontal lines and adds color
codes to various elements of syntax.

Incidentally, SAS has two editors. The one that opens by default in PC SAS is the "Enhanced Editor."
We may explore uses of the other editor later, but for now we'll stick to this one. If you ever need to
open another editor window (such as if you have closed one, or you want to have two programs
showing at once), choose "Enhanced Editor" rather than "Program Editor" in the "View" menu.
Exercises:
The following may be copied into a word processing document, which may be edited to answer the
questions and then submitted. There is no need to run SAS programs for these exercises.
1. How many steps are there in the following program? Mark the beginning and end of each step and
label it as a data or proc step.
data one;
Page No.6
do x=1 to 1000;
y=int(ranuni(0)*6+1);
output;
end;
run;
proc freq;
tables y;
data two;
set one;
z=7-y;
run;
proc freq;
tables z;
run;
2. Suppose the program above was modified so that an error occurred in the second step. What
would probably happen when SAS tried to process the remaining steps? (Be sure you have #1 right!)
3. Find three errors in the following program, based on the information in this lesson. (Assume there
are no statements split across multiple lines.)
data thirsty;
infile "c:\drinkexp.txt"
input subject brand rating;
procprint;run;
data two;
set one;
Chapter 2
Data Sets
SAS stores data in “SAS data sets,” using its own internal database format. If you are familiar with any
database software (Oracle, Sybase, Dbase, MS Access, etc.), you will find that data sets correspond to
tables in a database program. The rows are called observations (database "records"), and the columns
are called variables (database "fields"). In the example data set below, x, y, z, and w are variables,
while 1, 2, 3, and 4 are observation numbers. It is also becoming more common to use the database
terms with SAS, so it is good to be aware of both terminologies. Rows=Observations=Records.
Columns=Variables=Fields.

However, data sets contain more than just variables and observations. Such things as how values are
to be printed, labels to use instead of variable names, and sorting and indexing information can also be
included.
As you learn about SAS, you will see that SAS gives you a great deal of control over almost any aspect
of your work that you might think of. However, greater control is obtained by using more complex
program statements. So, SAS has many "default" settings that save you from extra work and
headaches as long as you are satisfied with what is specified by the default. Therefore, many standard
actions can be accomplished by very short SAS commands. When there is need for more control,
additional commands are available. The first of these "defaults" that we will encounter concerns where
data sets are stored.
Libraries
SAS stores data sets in libraries. (They are really just computer directories or folders, but the term
comes from the mainframe language.) There is a default library called “work.” This is where the data
will be placed if you don't specify another location. However, work is a temporary library, that is,
Page No.7
when SAS is closed, all data sets in work are deleted. However, if you want a data set to be saved
permanently, such as in your “My Documents” folder or anywhere else you wish, you can tell SAS to
designate a permanent library using a libname statement. The libname statement will look
something like this:

There are three parts to this command. The first part is the keyword "libname" which tells SAS you
want to define a new library. The second part, "myownlib" is the name that SAS will use to refer to this
library. This is called the libref. You can supply any name you like here, as long as it meets the
guidelines for allowed names, which are: you can use letters, numbers, and underscores, but can't
start with a number. A libref can only be eight characters long, though SAS variable names can be up
to 32 characters. The third part of the statement is the pathname (DOS style) showing the actual
location of the folder or directory you want to use. It is enclosed in quotes (single or double). On
some Windows systems, 'Desktop' and 'My Documents' can be used and will automatically be
assigned to the correct path. 'A:' usually works to assign the floppy drive, but only if there is a disk in
the drive (applies to any removable volume). Note: The libname statement does not create a folder. It
essentially creates a shortcut or alias to an existing folder. We call the library "permanent" because the
data is not deleted when SAS closes, but the libref itself will normally have to be recreated in future
SAS sessions.
The illustration below shows the SAS Explorer pane (from the left side of the SAS window, note the tab
at the bottom) displaying the currently defined libraries. Some of these libraries are standard in SAS.
The one called "Rao" has been created with a libname statement. Note that "Work" is explicitly listed.

If you double-click the "Work" file drawer, the data sets in "Work" become visible. The spreadsheet
icon in the example below represents a data set named "One." If you double-click on "One" it will open
in a "Viewtable," a spreadsheet-like view.
Page No.8

The viewtable has two modes, "Edit" and "Browse." In browse mode, you cannot change the data. To
switch modes, go to the "Edit" menu and select the mode you want. A new data set can be created by
selecting "Table Editor" under the "Tools" menu.
(Note: In order to go back to previous windows in the explorer pane, you press the folder with the up
arrow on it. But this will disappear if the "Explorer" is not the active window. Click on the "Explorer"
window to bring it back.)
(Note: There is also another "Explorer" under the "View" menu. This opens a window much like a
Windows Explorer.)
When you create a SAS data set, you give it a name, such as "One" in the example above. Names can
contain letters, numbers, and underscores, but cannot start with a number. They can be up to 32
characters long. You can refer to a data set in the work directory by its name alone, because that is
the default location. But data sets are actually identified by “two-level names,” where the library is
given first, followed by a dot, then the dataset name. In other words, the form is libref.datasetname.
Since work is the default library, datasetname alone is equivalent to work.datasetname. In order to
store a dataset permanently, specify a two-level name, with a first level being a defined libname other
than work.
In rare cases, you may want to change the default library to something other than work. To do this,
use the special libref "user" (libname user 'C:\folder1\myfolder';). This allows you to use a one-level
name with a permanent library that you specify. It does not change which libraries are permanent and
temporary. To create temporary data sets when the default library has been changed, use a two-level
name with "work" in the first position.
There is also a way to export data in some common file formats. See "File-->Export Data."
In Enterprise Guide, the datasets appear in the project tree under the code that created them. They
will be saved when you save your project. (But when you run code you may get a choice of whether or
not to replace the existing data. This may affect what is saved in the project.)
Getting Data Into SAS
Most of the time, when you begin working on a project in SAS, your data will not be in a SAS data set.
In order for SAS to perform analysis tasks, the data will be need to be brought into a SAS data set.
Because data may come in so many different forms, SAS is very flexible and provides a variety of ways
to do this. We will keep our first examples simple, but be assured that SAS can handle very complex
data reading tasks!
The simplest way to get data into a SAS data set is using “instream data.” This means the data is
included “in the stream” of the programming statements that will load it. Here is an example. An
explanation of the program follows.
Page No.9

The first two lines are comments. A comment is simply text inserted into computer code that the
programming language will ignore. Comments are often used to explain what is happening in the
program (for others or for future reference). Sometimes they are used to temporarily disable some
statements in the program. Sometimes they are used to "beautify" or enhance the readability of the
program. In SAS, there are two ways to write comments. A statement that begins with an asterisk
and ends with a semicolon is a comment. This type of comment will only work for one statement at a
time. If a larger section of a program is to be commented, that is, multiple statements in one group,
you can use "/*" to begin the comment and "*/" to close it. Even semicolons are ignored by this
syntax.
The third line contains a libname definition. The library will be called “mysaslib” and the actual folder
location on your computer is given in the quote marks.
The fourth line begins with the keyword "data," which tells SAS this is the beginning of a data step.
We can see by the two-level name that the dataset will be called “myfirst” and will be stored in the
“mysaslib” library.
The fifth line, the input statement, tells SAS what variables to put into the data set. Each variable
name may need to be followed by a code that tells SAS what kind of variable it is and how to read it.
These codes are called informats. However, there is a default for this, called "standard numeric,"
which is just an ordinary number in decimal form, with no commas. Since "age" fits this description,
we do not have to include an informat for it. On the other hand, "name" is a character variable. The
"$" tells SAS to read "name" as character variable eight characters long. Character variables have no
numeric value, and can contain letters, numbers, and most other symbols. They are also known as
"strings."
The “cards” statement tells SAS that the list of data to read is coming next ("datalines" and "lines"
may be used as synonyms for "cards"). The data are organized in a straightforward way. Each
observation is on one line and the variable values are separated by spaces. This is called “list input.” A
semicolon on a line by itself indicates the end of the data. (The data are not program statements, so
there are no semicolons in the data list.)
In addition to comments, we can use indenting to make our programs more readable. For example,
the statements under a data or proc statement can be indented to make the the steps look like an
outline. Data given in cards should be placed along the margin, though.
Next, SAS encounters a proc statement, and will therefore compile and execute the data step before
going on. The dataset “myfirst” is now created and populated with three observations.
SAS continues by compiling the proc step, which consists of only the “proc print” statement. Without
any other commands specified, this will cause the default action of printing the most recently created
data set, which, of course, is “myfirst.” Print does NOT mean "print to the printer." Proc print produces
a formatted "printout" of the data set in the output window. You can save or print (really) this output
using File Menu commands. The result looks like this:
Page No.10

In case you do not want to use the default (last created) data set, or just want to your program to be
more obvious to the reader, you can specify the data set that proc print will use this way:

Using Titles
SAS provides several ways to modify the appearance of the output it produces. Notice that in our
example the heading "The SAS System" together with the time, date, and page number appear at the
top of the page. "The SAS System" is the default page title. You can supply your own titles by using
title statements, as shown here:

You can have multiple lines of titles. Just add more title statements with higher numbers. Title
statements are global; they don't belong to a particular proc and are in effect until changed or
deleted. Redefining a title deletes all previously defined titles of that number or higher. To delete a
title without replacing it, just include a blank title statement, like "title3;" . This will delete the old
title3, as well as title4 or any other higher-numbered titles.
Producing HTML (web page) output
Page No.11
Both PC SAS and Enterprise Guide can produce HTML or text output. Enterprise Guide displays HTML
by default:

Pretty, isn't it? To get HTML in PC SAS, you can go to "Tools-->Options-->Preferences," click the
"Results" tab, then check "Create HTML."
Exercises
Submit one Word document with all answers either typed in or copied in.
1. Indicate which of the following are valid names for variables and librefs (two questions).
a. sales
b. SALES
c. Sales
d. Sales.12
e. Sales.Month.12
f. 12MonthSales
g. Month12Sales
h. More_Sales
i. More&More
j. Moore_and_Moore
k. Month12
l. _Month12_
2. If a database has a table with 500 records and 14 fields, how many observations and variables
would a SAS dataset containing the same information have?
3. If a spreadsheet had 5 columns and 4 rows, how many observations and variables would a SAS
dataset containing the same information have?
4. Looking at the screen print below, answer the following questions:
a. Is this a temporary or permanent data set?
b. How many observations are there?
c. How many variables are there?
Page No.12
d. How many records are there?
e. How many fields are there?
f. How many rows are there?
g. How many columns are there?

5. Copy the data below into the SAS editor (use copy and paste) and write a data step to read it,
followed by a proc step to print it (to the output window). Make sure the printout matches the original
data. The variables are Name, Age, and Grade. Age and Grade are to be read as numeric variables.
Allow the data to be saved to the work directory. (Submit Program, Log, and Output.)
Marissa 13 7
Andy 7 1
Martha 9 3
John 10 4
Larry 11 6
6. Copy the following data into the SAS editor and write a data step to read it into a data set. Print
the results in html format. The variables are Field, Fertilizer, and Variety. Submit the Program, Log,
and the html output instead of the normal output.
1 A Magnus
2 B Arbin
3 A Carver
4 B Visser
5 A Turnip
6 B Danun
7. Create a libref for a folder on your hard drive and another for a floppy disk. (If you don't have a
floppy disk, you may use a pen drive). Modify the program in problem 5 so that the data set is saved
in each location. Use the explorer window and the viewtable to verify that it is actually there. (Submit
your program and log for this problem.)
8. Submit only the program statements (editor) for this problem. Note that to "reference a data set"
means to tell a proc, like proc print, which data set to use.
a. Write a libname statement that assigns the name "SaleData" to a library that corresponds to folder on
the C drive with the same name. Then show how you would reference a data set called "January" in
that folder.
b. Write a libname statement that makes the folder in part a the default library and show how you would
reference the data set "January" in that folder. Show how you would now reference a temporary
dataset called "February".
Page No.13
Chapter 3
Keeping Up Appearances
Suppose your output doesn't fit on the page the way you like when you print it or paste it into a word
processor. An options statement can be used to adjust the number of lines and columns used in the
page formatting. In the example below, "ps" stands for "pagesize" (you can also spell it out if you
like). This is the number of lines on a page. Next, "ls" stands for "linesize" (you can spell this out
too), which is the width of a line in characters. If you don't want the date and time displayed, you can
include "nodate," and if you'd like to reset the starting page number, "pageno=n" will do that, where
you replace "n" with the number you want. If page number is not reset, the page numbers keep
incrementing in the output window, even if you clear it. There are many more options available (see
the help or manuals).

Reading from a File


Now, back to reading data. Our example data set is small, and easy enough to type in. If the data set
is large, it may not be convenient to type everything into the program. In this case, the data may be
saved in a text file outside of SAS. Let's say we had the same data saved in a file called F:\sample.txt.
The following program would then have almost the same effect as the previous one (can you tell what
will be different?):
Page No.14

(Since the library isn't specified, this data will be saved in work.)
There are some details to take note of here. It may seem that the infile statement replaces the
"cards" section, but that is not what really happens. The infile statement comes before the input
statement, whereas the "cards" section comes after the input statement. (In fact, "cards" must always
be placed last in the data step.) When you include a "cards" statement, SAS automatically assumes
there is a default infile statement that says "infile cards;" before the input statement. In other words,
SAS treats the "cards" section just like an external file. This is confirmed by the fact that it does not
appear in the program statements copied to the log. You can explicitly include the infile cards
statement, especially if you need some of the optional commands available with infile. This will be
discussed further in a future lesson.
There is also an import wizard available from the File Menu. It will load Excel and other popular file
types too. (One caveat for Excel imports: You must look under the "options" button in the wizard,
where there is a check box to indicate whether the first line contains data or variable names.) In
Enterprise Guide, this can be accomplished by going to the "Insert" menu and choosing "data."
Labels for Variables
SAS allows variable names to be 32 characters long. You can have upper and lower case letters,
numbers, and underscores, but cannot start with a number. Programming statements do not
distinguish between upper and lower case, but the cases are remembered for use in output. This
flexibility takes care of most variable naming needs. However, there are times when even more
flexibility is desired. Perhaps you want true spaces (not underscores) between words, or special
characters that are not allowed. Or maybe you'd like to use a short name like "LEye" in your program,
but want the output to say "Left Eye Acuity." For these situations, SAS allows us to assign labels to
variables. Labels can be up to 256 characters long and may include almost any text symbols. Labels
are assigned in the data step and are stored with the data set. Many SAS procedures, like proc print,
can use the labels in producing output. In the example below, note the syntax of the label statement
in the data step: The statement begins with the keyword label, followed by a variable name, equal
sign, and the label in quote marks. More labels may be assigned in the same statement. They are
listed one after the other separated by spaces (no commas). To use the labels in proc print, the option
label is added to the proc print statement. The example shows the results both without and with the
label option.
Page No.15

Character Informats
So far we've discussed reading fairly straight-forward data consisting of numbers or short words. We
will now explore more complex data types.
Consider this example:
Page No.16

No errors in the log....


Page No.17
Why isn't "Amphitheater" complete? The default length for character variables is eight. SAS has only
read eight characters from the data even when more characters are present. We need to tell SAS to
make the x variable hold more characters. Try this:

The "$12." expression in the input statement is called an "informat." Think of it as an "input format"
that tells SAS what to expect the data to look like. The dollar sign signifies that it is a character
variable, and the 12 is the length of the field to be read, and also the length of the resulting variable.
All informats have a period in them. This is part of the syntax that SAS uses to recognize an informat.
Let's look at four more examples to demonstrate the behavior of informats. These examples will use
two character variables. In the first example, we show what happens with only dollar signs to indicate
that they are character variables. Of course, the values are now cut off at eight characters, but
otherwise the data are read correctly:
Page No.18

Now suppose we attempt to fix the length problem by putting in informats. Then, SAS reads the full
12 characters for x, regardless of whether or not there are spaces included:
Page No.19

To fix this, we need SAS to treat a space as a delimiter just as it does when a dollar sign is used alone.
The "colon modifier" placed in front of the informat will give this result. In fact, the dollar sign alone
is an abbreviation for ":$8.".
Page No.20

The character informat can also be used to create variables shorter than eight characters. For large
data sets, this can result in considerable space savings. For example, the data set might contain a
variable that is a one-character code (such as M or F for male or female). Using an informat of $1.
would then be appropriate, and would save seven bytes of storage for every observation.
Running out of Line
An additional problem occurs when using these techniques to read from an external file. A character
informat at the end of a line causes SAS to try to read all of the characters for the width of the field
specified by the informat. If the line is shorter than that, errors will be reported in the log and the data
will not be read correctly. One solution is to add the option "truncover" in the infile statement after
the file path. The meaning of "truncover" is something like "keep reading over the whole field even if
the line is truncated." (This problem does not occur with instream data.) A colon modifier with the
informat can often solve this problem, too.
There is a related option, "missover" that can also be used when there are not enough variable values
at the end of the line and you want the remaining ones set to missing in the SAS data set. Without
"missover", SAS would go to the next line and try to continue reading the input variables for which it
did not find values. The two options behave almost identically. The difference is that missover will set
short values to missing when an informat (no colon) is used, while truncover will not. Truncover also
does the same thing when more than one value is missing at the end of a line.
Exercises
From now on, include titles in all your output that give your name, the lesson number, and problem
number. Unless otherwise directed, do not change data sets, and turn in your editor, log, and output.
1. Download the file at this link. It contains four variables, FirstName, LastName, Age, and Score.
Write a SAS program to read this data from the external file and print it to the output window. Save
the data set in a library that you specify (not work). Experiment with title and options statements to
change the appearance of your output.
Page No.21
2. Copy the data below into the SAS editor and write a data step to read it, followed by a proc step to
print it (to the output window). Make sure the printout matches the original data. The variables are
first and last name, sex, and age. Include labels in your data set and output. Also include a two-line
title and suppress the printing of the time and date on each page.
Andy Stewart M 47
Martha Gustafson F 55
Marissa Maneschevitz F 32
John Fitzgerald M 28
Jacqueline Martin F 33
3. Download the file at this link. Similar to a previous exercise, it contains four variables, Age, Score,
LastName, and FirstName. Examine the data file carefully, then write a SAS program to read this data
from the external file and print it to the output window. Include an appropriate title.

Lesson 4: Numeric Formats and Informats

SAS stores all data with only two variable types, character and numeric. We have seen that character
variables can have different lengths. In contrast, numeric variables (including integers and dates) are
almost always stored in 8-byte floating-point form. These numbers have a precision of about 16
significant digits. Some other languages refer to this as a "real" data type (which is not
mathematically correct, of course). Therefore, we do not have to be concerned about how numbers
will be stored. We do, however, have to think about how to read and write them.
Let's begin with a simple example. Here we see a data set with three numbers. There are no
complications in reading these numbers; the variable x in the input statement is a numeric variable by
default (no informat). Observe that in the output, the first two observations were written in the same
form they appeared in the data, but the third was not. Because it is such a large number, SAS
defaulted to printing this number in scientific notation. The "E18" is interpreted as "times 10 to the
18th power."
Page No.22

The appearance of observation 3 in the output can be changed by adding a format statement to proc
print, as shown below. The actual format code is "19.". The period or decimal point is part of the
code and is standard syntax for all formats and informats. Character formats and informats start with
"$". In most formats and informats there is a number right before the period that indicates the field
width. For numeric variables only, a number after the period indicates the width of the decimal portion
of the number.
Notice that 19 digits are now displayed, however, the last two are not the same as the original data.
There has been some rounding error.
Page No.23

Commas can be included in the output. Note that the field width had to be increased to accommodate
the commas.

Or perhaps you want dollars:


Page No.24

Let's turn to reading numbers in various formats. First, we should note that we cannot read numbers
that are not in "standard format" without an informat, a code that tells SAS how to interpret the
number it is reading. See what happens in this example:
Page No.25

The commaw. (dollarw. is the same) informat reads numbers with commas, as well as dollar signs and
some other imbedded symbols. The w stands for the field width, but be sure to count the commas
(and dollar signs) when determining the number of columns needed. Also, make use of the colon
modifier just as with character informats, if the field widths vary. (If you are working with other
currencies or European style numbers, check the documentation for alternative informats.)
Page No.26

Handling decimal places correctly can sometimes be tricky. Here we see a basic example with no
informat. In addition, no format has been specified in proc print, so SAS chooses a format that it
"thinks" is "best." Note that the third observation is rounded, but this is only because of the printing
format that was used, and does not mean the number is rounded off in the data set. If you add a
format statement with a 9.4 format to the proc print step, all the original digits will be displayed.
Page No.27

Here is an example that is incorrect. Numeric formats and informats can specify a number of decimal
places by putting a number after the period. The informat below, 4.2, is saying that SAS should read a
field of width 4 with two decimal places.
Page No.28

One problem is that the decimal is part of the field and needs to be counted, which caused the second
and third observations to be cut short. We should have 5 instead of 4 for the field width.. Secondly,
when a width for the decimal portion of the number is specified in an informat, it means that those
decimal places are to be assumed whenever no decimal point appears in the number. It does not over-
ride an existing decimal point. The first observation is thus interpreted as 11.22, that is, assuming
that the last two digits should be in the decimal portion. This may or may not be correct, so great care
must be taken when using this method. Normally, it would only be used when the data is known to be
recorded with an implied decimal. Usually such data are not mixed (with and without decimals). The
most common situation in which mixing might occur is when a variable is a percent or proportion, and
has been recorded inconsistently, using both notations. In that case, you might want 22 and .22 to
mean the same thing. A 3.2 format would accomplish the correct result.
Here the field width is corrected, the decimal is left off of the informat, and a format is included in proc
print to display all the decimals.
Page No.29

Formats can be permanently stored with the data set. If this is done, the formats are then available to
any proc that can use them, without writing another format statement. The program shown below
uses the "5." informat, but also stores the "9.4" format in the data set. There is no need for another
format statement in proc print. However, if you want to use a format other than the one stored with
the data, you can still specify it in proc print (perhaps "format x 7.2;"). It should also be mentioned
here that SAS syntax allows specifying a format for several variables at once. For example, "format x
y z 7.2;" would apply the 7.2 format to all three variables, x, y, and z. Or, you can use "format x 7.2 y
z 5.;" which will apply the 7.2 format to x and the 5. format to y and z.
Page No.30

While this covers the most frequently needed informats for numbers, there are many other special
cases. Check the SAS documentation (under Base SAS, Language Reference: Dictionary) for other
formats and informats.
Date Formats and Informats
Since SAS has only two data types, you may wonder what we do with dates. While it is possible to
store dates in character form, doing so would make calculations with dates very difficult. Dates are
stored as numbers, precisely, the number of days since (or before) January 1, 1960, which is "day
zero."
Page No.31

You cannot read a date without an informat, except perhaps in the rare event that it is already coded
as the number of days since 1/1/1960. Dates can be written in many ways, and SAS can read almost
any of them, with the right instructions. In the example above, the dates are given in the most
common American format, month/day/year. The informat uses the codes mm for month, dd for day,
and yy for year. In some other countries, dates are written in the form day/month/year, so in SAS we
simply switch the order accordingly, to "ddmmyyw. Once again, w stands for the width of the field that
is being read, including the delimiters. SAS does not require a specific delimiter when interpreting the
date, so it does not matter if it says 1/1/1960 or 1-1-1960 or 1.1.1960. The width is usually 10 to
accommodate two digits for month, two for day, and four for year, plus two delimiters. However, it
could be 8 if only two year digits are used, and it could be 6 or 8 if no delimiters are used (010160 or
01011960). SAS will interpret these variations correctly as long as they are not ambiguous (1160 for
January 1, 1960 would not work), as in this example:
Page No.32

Of course, we don't want our printout to give dates like "0" or "16604," since they are not very
meaningful to human readers! Therefore, we should include a nice format to make it readable.

There are many formats to choose from. See the SAS documentation for more. Here are some more
examples. In the program below, the dates appear in the data in three different formats. The first one
is like those discussed above, the second is what SAS considers a "standard" date, and the third is a
Julian date (used in many businesses--it's the year followed by the number of the day in the year).
Page No.33
Three proc print steps follow, to demonstrate the dates with no format, and with six different formats.
As with character and numeric formats, date formats can also be stored with the data set.

Note that SAS also has informats and formats for time values and date-time combination values. If
you have need of these, or want to explore the many other possibilities, check out the SAS
documentation.
Interpreting Two-digit Years
In the years leading up to 2000, the government and businesses became very concerned about the
"Y2K" problem. Many programs stored years using two digits, because most business software did not
need to deal with dates outside of the twentieth century. In 2000, this all changed. SAS never had a
"Y2K" problem for storing dates, since a date is just the number of days before or after January 1,
1960. However, there can still be a problem when reading dates with two-digit years from a file or
Page No.34
instream data. SAS interprets two-digit years as belonging in a specific 100-year interval. The first
year of this interval is called the Year Cut Off, and the SAS system option "YearCutOff=n" is used to set
it in an options statement. The default value of YearCutOff is set by the administrator when SAS is
installed, typically 1920.

Exercises
1. Copy the raw data below into a SAS program. a) Write a data step to read these data into three
variables: Invoice, Amount, and Quantity. Using proc print, display the data so that all the Amount
values are formatted like the Amount value in the first observation of the raw data. b) Then, revise the
data step so that a format for Amount is stored with the data set, and show the results in proc print,
without using any format statement in proc print. Use appropriate titles that identify which part of the
exercise the output comes from.
12244 $1,499.99 144
32189 $20,000 1
92314 49.28 3
2. Copy the data below into a SAS program. Write a data step to read them into a SAS data set. The
variables are capitol, state, capitol population, and state population. Store labels with the data set.
Print the data with proc print, displaying labels, using appropriate formats and a title.
Bismarck ND 56,344 633,837
Pierre SD 13,939 764,309
Helena MT 26,718 917,621
Madison WI 218,432 5,472,299
3. Download this file. The data contain the names of some of the past presidents of the United States
together with their birth and death dates. The data are aligned in columns, as shown in the example
below (the longest name). Save the file to your computer and use an infile statement to read the data
from the file into a SAS data set. Read the entire name into one variable, using a character variable
length of 23. Store formats for the dates with the data set. Write three proc print steps that result in
three different sets of date formats. Use appropriate titles.
William Henry Harrison 02/09/1773 04/04/1841

Lesson 5: Input Styles


It's time to dig a little deeper into the technical details of data reading. Before we do that, it is
important to make clear what we mean by "reading" and "writing" in a data step. "Reading" is the
process whereby SAS interprets raw data from a file or an instream cards section of the program, or
when it accesses an existing data set. Thus, "reading' means SAS is bringing data values into
computer memory to process in the data step. "Writing" is the process of saving the finished data to a
SAS data set file on a hard drive (or other computer storage device). It is NOT the same as "printing."
which is a term by which we usually mean "use proc print to display text in the Output Window."
Imagine that you are a computer and are instructed to read a file. What do you actually do?
Well, as a human, how do you read a book? First you have to open it. You find the starting place, and
you begin reading, which at the most fundamental level means you read a character, interpret it, and
move on to the next one, and repeat. Upon reaching the end of a word, you interpret the word, and
move on to the next one. Upon reaching the end of a line, you move down a line and go back to the
left and read the first character, and so on, until you reach the end of the book, at which point you
close it and stop. Well, something like that, anyway. For us, so much happens automatically, we don't
have to think about it. But, a computer doesn't think at all. It merely follows a sequence of
instructions, and this sequence, in some ways, is similar to processes we humans follow when reading
a book.
When SAS opens a file to read it, it creates two pointers, which are nothing more than numbers for
keeping track of position in the file. One pointer is a position in a line of text (column), the other is the
line number in the file. (We are assuming a standard text file here. SAS can also read binary files, but
that is another subject.) The pointers are initially at column 1, line 1. As SAS reads data, the pointer
moves along the line (not literally, of course, the number just changes). SAS needs to know when to
Page No.35
begin reading a value and when to stop. What is obvious to us is not necessarily obvious to a
computer. Consider the following line:
12 33 42 51 24
Do you see five numbers? How about this:
1233425124
If I told you this was a string of five two-digit numbers, you would know immediately how to interpret
it. We are now going to consider what must be done to tell SAS how to interpret lines of data in ways
somewhat analogous to how a person would interpret them.
SAS has four input styles. It is easiest to learn about them one at a time, but they can actually be
used together (mixed input) on the same file in almost any order. The input styles are: list, column,
formatted, and named input.
List input is perhaps the most straightforward. The data values are simply given in a "list," one after
the other, with a delimiter between them. The delimiter is often just a space (that's the default), but
may be a comma, tab, or other character. List input, in the simplest case, can be recognized if all you
see in the input statement is variable names and dollar signs following the character variables.
Typically, the data will not be aligned in columns.
To make our example easier to follow, the data is shown in a "cards" section, but we will discuss it as
though it were really an external file. Some differences involved in actually using an external file will
be dealt with later. Notice that the data values are separated by a single space and are not aligned in
columns. The input statement contains only variable names and one dollar sign.

When SAS processes the input statement, it creates, in memory, an "input vector," which is temporary
storage for all the variables in one observation. At this point, the attributes of the variables (length,
character vs. numeric) are fixed. As the data step goes through its commands, it fills up the fields of
the input vector, and when it gets to the end, it outputs one observation to the data set. If there is
more to do, it goes back and starts over, first clearing the input vector, then filling it up again. One
pass like this is known as a data step iteration.
The default length for character variables is eight characters, so in our example, that is the length of
the "name" variable. Age and height are numeric variables, also taking up eight bytes of memory, but
that has no bearing on the number of characters to be read. As reading begins, the pointers are set to
line 1, column 1. SAS is going to read up to eight characters into the variable "name." However, in the
list input style, if first checks to see if the current character is a space (or other blank character). If so,
it advances to the next, and keeps going until it comes to a non-blank character. Then it starts reading
characters into the input vector. When it comes to another blank, it stops. If the data value has more
than eight characters, SAS will read the first eight and save them, but will then continue advancing the
pointer until a space is encountered, then stop. (Longer character variables can be specified using the
"colon modifier," ":$w." as has been shown in a previous section. The colon causes SAS to read in list
style even when there is an informat present. There is also an "ampersand modifier" which allows use
of a character informat and reads imbedded spaces. Multiple spaces then signal the end of the field.)
Page No.36
The second variable, "age," is a number. With list input, SAS can only read numbers in "standard"
format, which essentially means there can be only numerals, decimals, and minus signs present. SAS
again advances the pointer to the first non-blank character and begins reading the number until it
encounters a space. The number of columns read is immaterial this time, as SAS will simply store the
number with as much precision as it can, regardless of how many numerals were given. When it
reaches a space, SAS saves the number to the input vector and stops. The third variable, "height," is
read the same way, except that we are now at the end of a line. Upon reaching the end of the line,
SAS saves the value, moves the column pointer back to 1, and advances the line pointer to the next
number. As it has come to the end of the input statement as well, it considers the observation
complete and writes the contents of the input vector to the output data set. It then clears the input
vector and starts over, repeating until the end of the file is reached.
Missing values for character variables require special care with list input, since spaces are interpreted
as delimiters and cannot then represent missing values. Missing values may have to be coded using a
word or symbol (e.g., "missing"). Options that change the delimiter (discussed later) can also solve
this problem. Numeric missing values are represented by a decimal point (period). If values are
missing at the end of a line (without a placeholder of some kind), the infile option missover can be
used to set the remaining variables to missing and go on to the next line. If this is not done, SAS will
try to read the variables from the next line.
Column input requires the data to be arranged in aligned columns. The key element of column input
is that you give a range of columns after each variable name, or after the dollar sign of a character
variable, in the form of start-end. Informats cannot be used in the input statement with column input.
(They can be specified in a separate informat statement, introduced in the next lesson.) There is no
need to worry about the column pointer in this case, as SAS changes it according to the ranges
specified. The ranges need not be in order. They can even overlap or be re-read.

The next example illustrates some of the advantages of column input. Notice that the character
variable will read whatever is in the associated columns and is automatically set to the length specified
by the range of columns. Even spaces and special characters are correctly read and stored. Missing
numeric values need not be coded with a period, as spaces will be correctly interpreted as missing.
Page No.37

One of the best features of column input is that no delimiters are required. Many mainframe programs
produce this kind of file, which has no extra spaces or delimiters. This can greatly reduce the size of a
file. As long as the column positions are known, there is no difficulty with reading this data in SAS.
Page No.38

Formatted input relies on informats and pointer controls to determine where and what to read into
each variable. In the example below, the @n syntax gives the starting column for each variable. It
repositions the column pointer to that column. The "name" variable has a default length of eight. The
numeric variables are read until a space is encountered. It is important to note that the syntax here is
the @n comes in front of the variable name, and the informat comes in its usual position after the
variable name.

Variables need not be in left-to-right order. Informats may be used as needed.


Page No.39

Even though I have shown the @n for each variable here, it is not really necessary in all cases. The
initial "@1" is not required, as the pointer starts out in column 1. With formatted input, the line
pointer will end up at the next character past the field that was just read. IF this is the start of the
next variable, you are ready to go and do not need to reposition the pointer. The syntax "+n" may be
used to advance the pointer a specified number of columns. Just as with column input, delimiters are
not necessary when using formatted input.

Because formatted input and column input cause SAS to read a specified number of columns, a
problem occurs if the end of the line varies, as in the example of the dates given below. When the
data are instream, SAS does not have a problem. But if the file is external, a line that is too short will
cause an error. The truncover option in the infile statement will cause SAS to treat short lines as if
there were spaces in the file to fill out the remainder of the field. If there is no data at all for the field,
Page No.40
the observation will be set to missing. The syntax is demonstrated here with instream data, but is only
necessary for external files.

It is often helpful with formatted input, if there are many variables, to line them up vertically in the
input statement, with the @'s all in front and the informats after. I have not shown an example, but
with a large number of variables, this makes it much easier to prevent and correct mistakes.
The log below is the result of trying to read this same data from an external file without the truncover
option. Observe that SAS is looking for the "bday" variable at the beginning of line 2, and since what is
actually there is a name variable, it declares the data to be invalid. The reason for this is that the
"bday" given in line 1 is only 7 characters in length. Since there are not enough characters, SAS goes
to the next line to look for the variable. After failing to read "bday," the end of the input statement is
reached, so SAS tries to move the next line to start a new input cycle, but finding the end of the file, it
quits and writes only one observation to the data set.
Page No.41

Including the truncover option in the infile statement fixes this problem.

Named input is used when each variable is identified in the data with a variable name and the equals
sign. This input style is not very common. While named input can be mixed with the other styles, it
must be last on the line. You cannot change back to another style.
Page No.42

However, the idea demonstrated here is very useful. In some data sets, it is difficult to identify where
the variables are in the line. If they are "signaled" by some identifiable text, we can use that text to
determine the position of the pointer and use formatted input. In the example below, we have given a
character string in quotes with the @ sign, where we had the column number in previous examples.
SAS scans from the current column until it matches the text in quotes after the @ sign. Note that a
trailing space was included in the quotes in this example. An alternative would be to leave out the
space and increase the field width to 3. In any case, you want to make sure the pointer ends up in the
right place, which will be right after the text in quotes, and the field width matches the location of the
value to be read. In cases like this, a lot of effort is required to determine what consistent properties
of the data source can be reliably used. Careful study of the file and testing of various scenarios is
required, or the result may be misread data, quite possibly with no warning of the errors.
Page No.43

If the variable values are not in a consistent order in the data, we can deal with that too. By using
"@1" to reset the pointer back to the beginning of the line, we can make sure nothing is missed:

To finish this off, here is an example of mixed input styles. All four styles, list, column, formatted, and
named, are demonstrated in one input statement.
Page No.44

Exercises:
For each problem below, write a SAS program that creates a data set using instream data. Copy and
paste the given data segments into the SAS editor. This is important in order to keep the data in the
same format as presented. DO NOT adjust the data, such as by adding spaces or aligning columns.
Include a proc print step in the program, and turn in the editor, output, and log files. Use appropriate
titles and formats, if necessary, in your output.
1. The variables are Name, Age, Height, Sex, and JoinDate. Assume that no values will be longer than
those included here, and use the shortest possible length for the character variables. Use list input
(with colon modifiers where needed).
John 21 70 M 2/14/97
Jo 18 62 F 3/27/99
Mark 32 68 M 6/22/98
Linda 25 65 F 12/14/97
Carey 27 59 F 8/20/98
2. The variables and the columns they are found in are LastName (1-15), FirstName (16-30), Address
(31-55), City (56-65), State (66-67), and Zip (68-72). Assume that character variables may fill the
entire field width. Use column input.
Johnson Michael 121 1st St S Brookings SD57006
Big Hammer Beatrice 45031 271st Ave S Moorhead MN56560
Helms-Marquart Charlotte 302 N Mason-Dixon Ave Somewhere DC01221
Cutler George Rural Route 2 Zap ND58563
3. In this problem, the same variables are used as in the previous problem, but additional variables
are included, namely the Social Security Number, Ownership Status, and Move-in Date. You can see
where these new variables are added. The field widths of the variables that were in the previous
problem are unchanged. Use formatted input to read this data, but change the order of reading to
First Name, Last Name, Social Security Number, Ownership Status, Move-in Date, Address, City, State,
and Zip. Use at least one example of the "+n" syntax to move the pointer.
503118596Johnson Michael Own 121 1st St S 01051974Brookings SD57006
471559684Big Hammer Beatrice Rent45031 271st Ave S 10221987Moorhead MN56560
362995874Helms-Marquart Charlotte Own 302 N Mason-Dixon Ave 07091991Somewhere
DC01221
Page No.45
474843859Cutler George RentRural Route 2 12161996Zap ND58563

Lesson 6: Creating Variables


Creating Variables with Assignment Statements
You can create variables that are not in the data that you are reading. In the following program,
avgrowth is calculated from the other variables in the data. Second, dummy is assigned a constant
value. Since the value assigned to it is a character string, indicated by the quotes around it, dummy
will be a character variable of length 3, because that is the length of the first value assigned to it.
Unless a length statement is used to set the length of a character variable before it is used, the length
will be determined by the first value assigned to it. Numeric constants can also be assigned, simply by
putting a number on the right side of the equal sign (no quotes).

Obviously, variables are created by the input statement, but they are also created if they are specified
in a length, attrib, format, or informat statement (see below). They can also be created by array
definitions (a later topic), or by assignment statements, such as those in the example above. An
assignment statement is made up of a variable name, an equal sign, and an expression representing
the value to be assigned to the variable. The variable can appear in its own assigned expression, such
as x=x+1, or x=log(x). A very special form of assignment statement, called a sum statement, or an
accumulator, is an exception to this syntax. In the example below, p and q are accumulators. Their
values are incremented, starting from zero, by the amount specified, for each succeeding observation.
The accumulator p+1, below, is essentially the same as p=p+1, except that it is initialized to zero,
which does not automatically happen if you use p=p+1. (See also the retain statement.)
Page No.46

The arithmetic operations and mathematical functions used in assignment statements for numeric
values are quite intuitive. The syntax is similar to that used for formulas on a graphing calculator or
spreadsheet. The arithmetic operations are "+" (add), "-" (subtract or negative), "*" (multiply), and
"/" (divide). Exponents are given with a double asterisk, such as "3**2" (three to the second power).
Parentheses are used in the usual manner for controlling order of operations. Many functions are
available, and their names can often be guessed because of their similarity to standard mathematical
notation. All functions have at least one argument enclosed in parentheses. Some examples are
sqrt(x) for square root of x (where x can be a number, variable name, or other expression that
evaluates to a non-negative number), log(x) for natural log, and exp(x) for the exponential function ("e
to the x"). There are also some constants, such as pi, given by the function constant(pi). For more
detailed information about functions, see the SAS Documentation under "Base SAS/SAS Language
Reference: Dictionary/Dictionary of Language Elements/Functions and CALL Routines." (Note: Some
of the documented functions may not work in The Learning Edition.) Here are a few more examples:

Since dates are numbers, you can do simple things like subtract two dates to find the number of days
between them, without any problem. However, for more complicated tasks, SAS has quite a few date-
Page No.47
related functions. For example day(x) returns the day of the month, month(x) returns the month
number for a date, and qtr(x) returns the quarter number. If you have to do any serious computations
with dates, check the SAS documentation for available tools. Remember, SAS also has date-time
values, and functions to go along with them, as well.

For character variables, there is an operation called concatenation, indicated by "||" (two vertical bars),
that puts two character strings together. There are many, many functions for character variables. We
will just look at a few: substr(source, position, length) which extracts a substring, trim(source)
which eliminates trailing blanks, length(source) which calculates the length of the value excluding
trailing blanks, and upcase(source) which changes all the letters to upper case.
In the program below, a length statement has been used for the city variable, to allow up to 15
letters. Note that this method would not work for city names that have more than one word, like "New
York City." The st (state) variable has been given an informat for two characters, but a colon modifier
is used so that the pointer will move on to the beginning of the zip code. Zip codes should always be
character variables, otherwise those that start with zero will be shortened.
Page No.48

The first assigned variable, addr1, is created by simply concatenating all three variables. Note the
(possibly undesirable) result, with the "extra" spaces between city and state, and the lack of spaces
between state and zip code. The spaces are there because the variable length is, in fact, 15, and the
unused positions are filled with spaces. Concatenation uses the whole variable, including spaces.
In addr2 we have removed the trailing spaces from city by using a trim function. Now there are no
spaces between any of the combined variables.
In addr3, we have included punctuation and spaces between the variables. Notice that the
concatenation operation works with constant expressions enclosed in quotes, as well as variables.
Spaces are preserved just as written between the quotes, including the one space after the comma and
the two spaces in front of the zip code.
The upcase function is demonstrated in addr4, which converts addr3 to all uppercase characters.
Following that, the substring function is used to create a four-letter abbreviation, by extracting the
first two letters of city and combining them with the state code. Note the order of the three
arguments, first the source variable, then the starting position, then the number of characters to
extract.
The last assignment statement shows how we can combine various functions to perform a specialized
task. The idea here was to find the middle character of the city variable, defined to be the actual
middle character for odd lengths and the letter immediately prior to the middle for even lengths. The
substring function is used to extract the character, but the starting position must be calculated. The
length function divided by two would be almost right, as it works fine for even lengths, but for odd
lengths gives a half, like 4.5 for "Brookings." Since the middle character is the next higher whole
number, we can use the ceiling function, one of several rounding functions available, this being the one
that always rounds any decimal value up to the next integer.
Length, informat, and attrib statements
Page No.49
An alternative to specifying informats in the input statement is to use an informat statement. The
informat statement has the same syntax as the format statement. It doesn't do anything that can't be
done in the input statement, but it might be convenient to keep things organized, as in this example:

Numeric variables have a default length of 8 bytes in SAS. As we have seen, there is also a default
length of 8 for character variables, if they are read using a $ informat, or if an informat is used, the
length depends on n in "$n.". In a later section, we will see that if character variables are created
using data step programming statements, they get their lengths from the first value assigned to them.
The length statement can be used to override the default lengths for both character and numeric
variables.
It's not often we want to change the length of a numeric variable. Sometimes space can be saved
when the values are integers. The allowed lengths are from 3 to 8 for PC SAS. A length of 3 will
accommodate accurate integer values from -8192 to 8192. A length of 4 works to slightly over 2
million. It is not recommended to use shortened numeric values when fractions (decimals) are
involved.

In the above example, you can see that the length statement has syntax similar to the format or
informat statements. However, the "dot" is not required. Here it has been left out for the numeric
length and included for the character length, just for an example. The dollar sign, however, is required
for character variables. The length statement must occur before the first use of the variable in the
program, or it will not have any effect.
Another way to use the length statement is shown below. This example sets the default numeric
length to 3. Unless you specify other lengths, all numeric variables in this data set will have length 3.
(This only works for numeric variables.) The character variables will have length 8.
Page No.50

Another way to do this is with the attrib statement, which is more complicated and allows you to set
the lengths, formats, informats, and labels all in one command:

Exercises:
For each of the following exercises, copy and paste the data given in the problem into the SAS editor.
Write a data step to read the data and create the new variables described, then print the results using
proc print, using appropriate titles.
1. These numbers represent dimensions of cardboard boxes, length, width, and height, in inches.
a. Calculate the volume of each box in cubic feet.
b. Calculate the amount of cardboard needed to make each box in square feet, assuming that the top and
bottom flaps meet in the middle (this results in a double layer of cardboard for both top and bottom).
c. Suppose the cost of cardboard is $.05 per square foot, and there is a fixed cost of $.25 for
manufacturing each box regardless of size. Calculate the cost of manufacturing each box.
d. Calculate the cost per cubic foot of volume.
32 18 12
16 15 24
48 12 32
15 30 45
20 30 36
2. This problem will provide a little practice in writing complicated formulas in SAS, paying attention to
order of operations. Use the data below, with variables a, b, and c, and apply the following formulas to
create two new variables called root and trunk. The first observation's results are -1 and 2.094,
respectively.
Page No.51

,
165
4 -20 2
12 22 -11
3 -15 -9
3. Read the following data into three variables, making sure to get complete names. Use the
character functions and operators to extract initials from the following names so that they look like
"J.F.K." Then create an abbreviation for each name that looks like "J-n F-d K-y".
John Fitzgerald Kennedy
Martha Helena Goetz
Frederich Anthony Sailer
Albert Blake Codwell

Lesson 7: Data Set Options, Set and Merge Statements

Data Set Options


"Options" are used in SAS in various places, and with various different kinds of syntax. For this
reason, the concept of options can be a bit confusing. An earlier lesson introduced the idea of a global
system options statement, where we can set things like linesize, page numbering, and so forth. Data
set options are something different. They are commands added to a basic data step statement to
refine or modify the work of the data step, or they can be used to modify how a data set is used in a
proc step. (The term, "options," is also used for any optional command in any SAS statement. This is
why we have to distinguish between system options, data set options, and other options.)
We begin with the drop and keep options. The examples below show how these two options can be
used to accomplish the same thing. The choice between them is purely a matter of convenience. The
drop option lists the variables to be omitted from the data set, while the keep option lists those which
are to remain in the data set. There could be many reasons why you might not want all of the original
variables in your data set. In the next example, we use the original variables for calculations, but do
not include them in the data set. In other cases, we may create variables that are only used in the
program and do not need to be saved.
Page No.52
Take note of the syntax used here, as it is unique for data set options. All the options to be applied to
a particular data set are enclosed in parentheses, following the name of the data set to which they
apply. An equal sign after each option is followed by the list of items involved. Here they are part of
the data step, but they are NOT called DATA STEP options, but rather DATA SET options, and they can
be used any time a data set is referenced, such as in this proc print statement:

(A var statement under proc print would be more appropriate, but this works. Var statements are
covered later.)
The Set Statement
The set statement is used to create a new data set from an existing one. In the simplest example, it
merely copies a data set. The following program will create a new data set called two that is an exact
copy of one, providing one exists.

Now suppose that a data set called one has been created, as shown in the data step below, and that
you later wanted to create a new data set called two, with some new variables, but leaving the old one
unchanged. The set statement can be used to copy the original data set and add new variables.

SAS will now go through all the observations in one, and place in the new data set two the original
variables together with the new ones that are defined.
There are two data sets now. Suppose we want to print both of them. Most procs, including proc
print, will use the most recently created data set by default. Thus the second statement below
would print only two. To specify another data set, use the syntax shown in the first statement below.
Each “proc print” statement prints one data set.
Page No.53

Using Data Set Options with Set Statements


Next we show how data set options can be applied to either input or output data sets when using the
set statement to read another data set. In the example below, data set one is first created to serve as
the input data set. The data sets two and three will be exactly the same, but the processes by which
they are produced differ. When two is created, all four variables from one are read into memory, then
when the data are written to two the variables x and y are eliminated. When three is created, only
sumxy and prodxy are read into memory from one, and subsequently saved to three. If there are
many variables and observations involved, the second method is preferred because it is more efficient
(saves system resources). However, you cannot drop variables from an input data set if you intend to
use them in calculations!

The firstobs and obs options specify the starting and stopping observation number to read. They do
not apply to output data sets, but are used with set statements and in procs where data sets are
referenced. In the first example, a second data set is created using these options to make a subset of
the original data. Note the observation numbers in the output.
Page No.54

In the next example, the option is given in the proc print statement. Compare the observation
numbers. In both cases, the observation numbers correspond to the observations actually stored in
the data. They are not renumbered by proc print.

Finally, we discuss the rename option. At times it will be necessary to change the names of variables.
As with the other options, this can be done when reading or writing the data. When used with an
output data set, it renames the variables when the data set is saved, and does not affect the names
used within the data step, as shown below. Also, the rename option requires another set of
parentheses for the list of variables to be renamed. The name change is specified as "old name=new
name."
Page No.55

When the rename statement is used with an input data set, the new names are in effect during the
data step, as in this example:

Using Firstobs and Obs in the Infile Statement


Firstobs and obs can also be used in the infile statement when reading from an external file or cards.
Here, the syntax is different, since they are not connected with a data set, and the options are simply
typed on the line following the filename or cards keyword. There is one other difference--in this case,
the numbers refer to the starting and ending line in the data, which is not always the same as the
observation (e.g., observations might take up two lines each).
Page No.56

Concatenating Data Sets Using the Set Statement


The set statement can also be used to combine data sets in various ways. The first way is called
concatenation, and is simply combining them in order, one after the other. The example below uses
two data sets, but there can be more. When using concatenation, all of the observations of all of the
listed data sets are included in the result. All variables from all the sets are included, with missing
values assigned in cases where a variable does not occur in the original.

Suppose the data sets have different variable names.


Page No.57

You could use data set options to re-align the variables in a case like this.

Combining Observations with the Set Statement


The second method of combining data sets is called "one-to-one reading" and may be thought of as
a side-by-side version of concatenation. The programming difference is that while in concatenation the
source data sets are listed in one set statement, in one-to-one reading each source data set is given in
a separate set statement. If the same variable names occur in more than one source data set, the
values of the later sets overwrite the earlier ones. If the data sets do not have the same number of
observations, the result is cut off at the length of the shortest set. The example below shows two data
sets of different lengths and the same variables. The x variable is changed using a rename option
during input. Thus the source variable is overwritten with the second data set values, but because of
the name change, we now have x and y coming from the original x variables.
Page No.58

Interleaving Data with the Set Statement


The third method is called interleaving. This is like concatenation, except that the observations are
combined so that they are sorted, instead of having one data set placed after the other. The
variable(s) on which the sort is done are called by variables, and the interleave is done by adding a
by statement to the concatenation program. The by variables must be sorted before the interleave is
done, so if they are not in order to begin with, we use proc sort to do the job. Proc sort will also have
a by statement, listing the by variable(s), just like the data step that does the interleave.
Page No.59

Merging Data Sets


Now, if you're thinking ahead, you must realize that we should be able to take this to another level--
how about putting the data side-by-side while matching up variables? That's called a merge. And
that's where we're going next. But before getting to the programming, there are some things that
need to be explained. We need to be very careful about how things match up. For convenience, we'll
visualize the data sets laid out side by side, so that we have a right and a left data set. The simplest
case occurs when there is exactly one item on the left to match exactly one item on the right. This is
called a one-to-one merge. If there is exactly one on the left to match more than one on the right,
we call it a one-to-many merge, and if the roles are reversed, a many-to-one merge. Finally, if
there are multiple instances of variable values on both sides, it is called a many-to-many merge.
The latter should usually be avoided, but an example will be provided to show what happens.
Our first example will be one-to-one reading again, but this time using a merge statement. The
difference is that the data set continues building until the end of the longest data set is reached. (In
the following examples the variables have been given unique names to avoid needing rename
options.)
Page No.60

Next, we do a one-to-one merge. The data sets will be matched by the name variable, therefore this
variable may be called the match key. The match key is identified using a by statement, and it must
be the same in both source data sets. Just as in interleaving, the data sets must be sorted. To
simplify the example, the data are entered in sorted order. Note that if something doesn't have a
match, it is included, with missing values where the matching data should be. Also, this is a good time
to mention that SAS character variables are case sensitive, so that "john" and "John" are two different
values. If your data might have mixed case, the upcase function can be used to convert everything to
upper case to ensure it matches.
Page No.61

Now consider a one-to-many merge. We will have one instance of each name on the left, and multiple
instances of some names on the right. There is no change to the program (only the data). The result
is that one observation is produced for each observation on the "many" side, but if there is no match,
again the relevant observations on either side are included with missing values.
Page No.62

Here is the result of a many-to-many merge. The technical details of what happens are a bit
complicated, but we can say, in a simplified way, that SAS matches lines side-to-side until one side
runs out of the matching observations, then it repeats the last observation from the short side until the
long side runs out of the matching value. (If you go back to the previous examples, you can see that
they are special cases of this more general result.) There are very few situations where this behavior
is desirable. Most merging is done with one-to-one or one-to-many relationships.
Page No.63

As a final comment, note that while there is a useful purpose for having two or more set statements,
there is nothing similar for merge statements. You can put two or more merge statements in a data
step, with or without a by statement, and not get an error message, but the resulting data values may
be intermingled in "unexpected" ways.

Exercises
1. Refer to the data in Exercise #1 of Lesson 6. For each of the following, include a proc print to
display the results. Use appropriate titles. This can all be done in one program.
a. Create a data set with only the three given variables. Name the variables x, y, and z. (This is the only
time you will use cards.)
b. Create another data set, using a set statement, that takes the SAS data set in Part a as input, renames
the variables length, width, and height during reading (on the way in), and calculates the volume and
cost as in Lesson 6.
c. Do as in Part b, except this time rename the variables when the data set is written (on the way out,
this does not mean in proc print).
d. Do as in Part b, except this time use obs and firstobs to bring in only the third and fourth observations
from the original data set.
e. Using the data set created in Part b as input, create a new data set that contains only the volume and
cost variables.
2. Refer to Exercise #2 of Lesson 5.
a. Begin with the same input statement as was used in Lesson 5, then create a new name variable of the
form "Lastname, Firstname" and another of the form "City, ST zip#". Use an option to eliminate the
original five variables used to create these new variables.
b. Create another data set where you use obs and firstobs in the infile statement to read the second and
third lines of data only.
Page No.64
3. Download this data. Do not change the file or copy it into your editor. Look at it with a text editor,
such as Notepad. There are two sections, with headings that say "1." and "2." (These lines are not to
be considered observations).
a. Write two data steps, using this one file as the source in both cases, but using firstobs and obs to
control the starting and ending line so that each section is read into its own data set. The data have a
city name and state abbreviation. Read these as two variables, "City" and "State." Use a length
statement to make the city variable long enough not to cut any of the names short and make the state
abbreviation of length 2. Create a new variable that forms a single five-character abbreviation using
the first three letters of the city and the two letters of the state abbreviation. Create another variable
that concatenates the city and state with a comma and one space immediately following the city. It
should look something like this:
Obs city state shortcity longcity
1 Bismarck ND BisND Bismarck, ND
b. Concatenate the two data sets and print the result. (Do not sort the data before this step.)
c. Next, interleave the two data sets. Remember they must be sorted first. Use "state" as the by
variable. Print the results but use a data set option within proc print to show only the "longcity"
variable and observation number.
d. Next, merge the two data sets using "state" as the matching variable and applying the rename option
as needed (within this data step). Use an option in the data statement so that only the two original
city variables and the state remain in the new data set. Print the result (with no data set options used
in proc print this time).
4. Copy the SAS code below into the editor to start with. Assume that these data sets represent an
inventory list that is being revised at each step. The prices change each time, but the "itemno" is
revised between new1 and new2 only. Write a program that does the following, and print each of the
data sets you create.
a. Merge new2 and new3 with itemno as the match key, and show old and new prices.
b. Merge new1 and new3 with name as the match key, including the old and new values of "itemno" and
price in the result..
data new1;
input itemno name $ price;
cards;
325 PrintCrd 211
276 KeyPad 37
842 PnclHldr 8
422 PaprShrd 132
523 Basket 29
;
data new2;
input itemno name $ price;
cards;
333 PrintCrd 399
277 KeyPad 25
802 PnclHldr 12
417 PaprShrd 122
515 Basket 17
;
data new3;
input itemno name $ price;
cards;
333 PrintCrd 386
277 KeyPad 25
802 PnclHldr 11
417 PaprShrd 135
515 Basket 15
Page No.65
;

Lesson 8: Proc Print, Proc Sort and ODS


Getting More From Proc Print
Proc print is used to display data in the output window. The output can be saved, printed, or copied
into a word processor or other program. To make the formatting come out right in a word processor, a
monospace font (all letters take up the same amount of space) must be used. A computer with SAS
installed should have the "SAS Monospace" font available; otherwise, choose something like "Courier."
In previous lessons we have made frequent use of proc print, but we have only introduced a few
options, namely label and data=. Actually the output of proc print is highly customizable. The
documentation can be found in the documentation under "Base SAS Procedures Guide/Procedures/The
PRINT Procedure." We will explore some of the more common options and statements here.
Recall the example from a previous lesson, where we showed that data set options can be used in a
proc print statement.

As a demonstration of data set options, that was good, but it is not the way this is usually done. Proc
print has a var statement which is used to list the variables that are to be printed. The proc step
below will produce the exact same output. The var statement can also be used to change the order of
the columns, as they will be printed in the same order they are listed in the var statement.

The noobs (no observation number) option can be added to the proc print statement to suppress
printing the column of observation numbers.
Page No.66
Or, perhaps there is a variable in the data set that uniquely identifies each observation and is not the
same as the observation number, such as a customer number, Social Security Number, or subject
number in an experiment. In that case, add an id statement (that's pronounced eye-dee). The id
statement causes the specified variable to be printed at the left instead of the observation number.
Note that the id variable is not included in the var statement, if one is used (including it will cause it to
be printed twice).

The label option was introduced in an earlier lesson. Consider the following program and output, with
labels made up of several words. By default, SAS fits the labels on the column headings as best it can,
using spaces to split the labels onto multiple lines. In this example it is not quite satisfactory. We
notice that although "First Name" occupies two lines, "Last Name" does not. (data)

The split= option can be used to control where the breaks in the labels occur. When using the split
option, the label option does not have to be specified because it is implied. Any character can be used
to control the split. Often a space will do.
Page No.67

That fixes the "Last Name" problem, but now the "Question" headings take up four lines. Maybe that's
not what we want. By specifying a different split character we can control where the splits occur.

Due to the wider headings, the variables are now printed in two sections. Note that the id variable is
repeated in the second section (obs will do the same if used). Even if there are many observations,
SAS will, by default, break each page into sections like this, if all the columns don't fit on one line. You
can use a rows=page option to force each page to be all one section.
Page No.68
SAS has many rules for trying to decide what the best way to fit things on a page will be. Sometimes
it prints the headings vertically. The direction can be forced using a heading= option. The
arguments are v (or vertical) and h (or horizontal).

The sum statement causes SAS to calculate the sum of one or more variables. (data)

The by statement causes proc print to generate separate reports for each value of the by variable.
(Data must be sorted by the by variables.)
Page No.69

And you can print subtotals for the by groups like this:

Getting More From Proc Sort


Data sets can be sorted using proc sort. A by statement lists one or more variables to use as sort-
keys. The keyword descending can precede any variable for which the sort order is to be reversed. A
data= option specifies the data set to sort; otherwise the most recently created data set is used.
Here is a simple example that sorts by one column. The original data set is replaced by the sorted one.
Page No.70

In the example below, an option to create a new data set called two is included, along with a data set
option specifying which variables to keep. This leaves the original data one as it is.

Any number of sort keys can be specified. The next example shows a sort by school, then score within
school, in descending order.
Page No.71

Proc sort can also eliminate duplicate observations while sorting. The noduprecs and nodupkey
options can be added to the proc sort statement. Noduprecs eliminates observations that are exactly
alike for all the variables, while nodupkey eliminates those that have the same values in the sort key
variables, even if there are differences in other variables.
Introduction to ODS
ODS stands for "Output Delivery System." When a SAS procedure produces output, it is actually
producing data, which is then passed to the Output Delivery System, which determines what should be
done with the output data.
ODS has "destinations," which refer to the type of output to be produced. For example, the "listing"
destination is the output window. In this lesson, we will demonstrate two other destinations, "html"
and "pdf." Another important one is "rtf" which produces documents you can import into word
processors. Destinations are opened and closed as shown below. The ods keyword, followed by a
destination name, will open the destination. To close a destination, include the keyword "close." In
the example, the listing destination is closed, meaning no output will go to the output window. Then
the html destination is opened, meaning html will be created. After proc print does its job, the
destinations are returned to their previous state (always a good idea). We see a nice html document in
the results viewer.
Page No.72
ODS can send output to multiple destinations. (We didn't need to close the listing destination--it was
just done to provide an example.) It can also create files, as shown below.

SAS usually displays the html or pdf document in the Results Viewer window, even if it is written to a
file. Here is a pdf displayed in the Results Viewer.

If you're preparing documents for publication, presentation, or the web, you may not be satisfied with
the way this output looks. ODS allows us to modify the way documents look by applying styles. For
example, the journal style is designed to be compatible with the requirements of many scientific
journals:
Page No.73

A set of styles is installed with your SAS system. This little program will display the styles available on
your system:

Exercises:
The file usedcars3.txt contains part of an inventory of used cars from a local car dealer. The variables
are year, make, model, color, miles, inventory number, and price. Read this data into a SAS data set.
Include descriptive labels for all the variables, at least some of which should be made up of several
words (separated by spaces) for this exercise. Include formats for miles and price. Save this data set
for the next lesson (think about what that means--hint: permanent). Don't forget to include titles that
identify which question your output belongs to.
1) Print the data without labels and with the inventory number as an ID variable.
2) Print the data with labels, including inventory number as an ID variable. Use a statement (not data
set option) so that only inventory number, make, model, year, and price, in that order, are printed.
3) Look at your headings in part 2. They will be printed either vertically or horizontally. Use an option
to force them to print the other way.
4) Now take the label statement you put in the data set and insert it in proc print. Revise this label
statement to include a split character and print all of the variables (without using an ID statement)
with the split labels. Do not just replace spaces with a split character. Make sure to do something that
will demonstrate the difference that using a split character makes. Continue using these labels for the
rest of the parts below.
5) Sort the data in order of price from highest to lowest and print the result.
6) Print the data with columns for all the variables except color, but separate the observations into
groups for each color, without any sums or subtotals.
7) Print the year, make, model, color, and price, but include subtotals and a grand total of the price for
each make.
8) Sort the data in alphabetical order of make, and within make, in order of miles from lowest to
highest. Include a nice title, suppress the page number and date, and put the result in a new data set
and print it using the ODS pdf destination with journal style. Save the pdf file and submit it separately.
Page No.74

Lesson 9: Proc Rank, Proc Contents, and proc Format

Ranking Observations
Proc rank is used to generate rankings for observations. Ranks may be useful in their own right, or
they may be needed for non-parametric statistical methods. The procedure is fairly straightforward.
A var statement names the variables to be ranked, and a ranks statement names the variables that
will contain the ranks. Both data= and out= options are available as in proc sort, but there is a
difference in default behavior that sometimes causes confusion. Suppose I submit the following
program:

Whereas proc sort would have given us a sorted data set one, proc rank didn't put the ranks in one.
Where did they go? A look in the log shows us that a new data set called data1 was created. Proc rank
is one of several SAS procedures that follow this convention: if you do not provide data set names for
new data sets, they will be named sequentially as data1, data2, etc. Proc rank will not over-write an
existing data set unless you supply a name.

If we had not specified a data set for proc print, it would simply have printed data1 since it is the most
recently created data set. Specifying the data set is a good idea, though, because we can easily make
Page No.75
mistakes by not paying attention to which data set is being processed. Here, we name the output data
set, with ranks, two, and let proc print display it by default.

Look at the scores and the ranks. Are they what you expected? Perhaps not, as we often think of the
highest score as being "number one," but here the lowest score is "number one." This is an
ascending ranking. If you want the ranks to go the other way, you need a descending option to
reverse the order of the ranking. Proc rank doesn't allow ascending and descending ranks in the same
proc rank step. You can overcome this by using two steps, taking the output of the first as input for
the second.

Notice that srank and grank are both produced in the first rank step, and both give lower ranks to
larger numbers (descending). The second step takes the ranked data set, two, as its input, and adds
srank2, and ascending rank for the scores.
Did you notice the values for grank? You may wonder how or why you get ranks that are not whole
numbers. This happens because some values are tied. In fact, there are two of each grade. SAS has
taken all the tied observations and averaged their ranks. You can use a ties= option to specify what to
do in case of ties. The possibilities are high, low, and mean. If you use high or low, it will take the
highest or lowest rank of the tied cases. Be careful! The result you want may be affected by whether
you are ranking in ascending or descending order.
Page No.76
Getting Information About the Contents of a Data Set
Proc contents displays information about the variables in a data set, as well as various characteristics
of the data set. The information you are most likely to be interested in is the third section on variable
attributes. The variables appear in alphabetic order. The "#" column indicates the order of the
variables in the file, while the "pos" column gives the actual position in bytes from the beginning of the
line. "Type" should be obvious, and "Len" is length, of course.

Proc contents doesn't have many options, but here are a couple of them. Short gives a very short
version of the output, which is actually just a list of the variables. Varnum causes the variables to be
displayed in the order of their position instead of alphabetically.
Page No.77

Custom Formats and Informats


Proc format creates custom formats and informats. As we have seen, informats are used in reading
data and determine how a value will be stored. Formats are used to determine how a value will be
printed. Custom formats and informats allow grouping of values, for example, ranges of numbers
could be recorded or printed as "Low," "Med" or "Hi" values. We will focus on formats, although similar
commands can be used to produce informats.
Your format will need a name. There are some rules that must be observed, in addition to the normal
rules for SAS names (you can use letters, numbers, and underscores, but can't start with a number).
First of all, the name you choose cannot be the same as that of an existing format supplied by SAS.
The length cannot exceed 32 characters, but this includes the "$" that must begin a character format,
and an "@" prefix automatically appended by SAS to user-defined informats. (You may see this in the
log.) Also, character format names cannot end in a number. Well, it's not likely you'll want to make
any names that long anyway, and the "$" requirement is familiar, so if you make it a practice not to
end with a number, you shouldn't have too much trouble. To avoid duplicating a name SAS already
has, it is a good idea to include a short character combination that is unusual--perhaps your initials,
business acronym, etc. as part of the name.
Suppose you have a data set like the following, with a product number (a character variable) and a
price. You'd like to print a report that contains the product description and the price.

A good way to do this would be to create a format that associates a description with each product
number. In proc format, the value statement is the actual command that defines a format. (A similar
invalue statement defines an informat.) In the example below, the expression following the key
word "value" is the name of the format. Note that there is no period at the end of the format name
here, but the period is used when the format is associated with a variable, as in the format statement
under proc print. The expressions after the name are called value range sets. In this case each
variable value is assigned one formatted value, but there are other possibilities. The formatted values
can be up to 32,767 characters long, but some procedures only use the first 8 or 16 characters.
Page No.78
The original data did not contain labels, so here we show that label statements can be added in proc
print (and some other procs) as well.
A real world application like this would probably involve thousands of items. It would not be good to
rebuild the format every time it was used, so user-defined formats can be permanently saved and
accessed when needed. The simplest way to do this is to use the special libname library. If you
include a libname statement like that shown below, together with the library=library option in proc
format when creating the format, then put the same libname statement in any program that uses the
format, SAS will store the format in the specified directory and will search for it there when you want
to use it.

The following program will find and use the format created above.

In the next example, we have a list of students in various grades. An informat is created to classify
the values into categories representing the school level. This is a character informat, since the
resulting values will be character strings. It may be tempting to think that the numbers representing
grades are numeric, but they are treated as character values too. This is important when specifying
ranges, because numbers and alphabetic expressions do not sort the same way. The first value range
set, which defines the Elementary category, represents grades 1, 2, and 3. Multiple values can be
listed on the left side, separated by commas, or ranges can be specified, using a dash. SAS accepts
these values without quote marks around them, but quotes can be included if desired, such as "1", "2",
"3". There is an important reason why this range was not given as 1-3. That is because in character
sort order, 10, 11, and 12 come between 1 and 2! Using 1-3 would indicate that students in grades 10-
12 were to be classified as Elementary, which is incorrect. Furthermore, these values are defined
again later, which produces an error and the informat is not created. Similarly, for the High Schl
category, a range of 9-12 cannot be used. SAS complains with an error message in the log, that "Start
is greater than end" in this case. If a value occurs in the data that is not defined in the format
procedure, SAS uses a default informat, as occurs with the "K" grade level. You can also use other as
a range for anything that does not fit what has been listed.
Page No.79

The next example shows a numeric format. The ranges include the words low and high, which can be
used for unspecified or infinite lower and upper bounds. Also, the less-than sign is used as a way of
excluding endpoints. Each of the ranges (except the last) given here will exclude the upper endpoint.
For example, a score of 79.99999 would get a value of 2, but a score of 80 would get a value of 3. If
you want to exclude a lower endpoint, put the less-than sign before the minus, such as "60<-70."

Exercises:
Make use of the used cars data set created in the previous lesson. There should be no data step in this
lesson.
1) Rank the prices so that the highest is number 1, and the miles so that the lowest is number 1, and
send the result to a new data set and print it. (Note: this has to be done in two steps because all
ranks in one step will go the same direction.)
2) Run proc contents on the data set created in the previous problem, displaying the variables in the
order they exist in the data set..
Page No.80
3) Create a format that combines the colors into three categories, "light," "dark," and "other" for those
that aren't specifically assigned. Use your own judgment in classifying the colors. Be sure to leave at
least one out so there is something for the "other" category.
4) Create a format to classify the miles into categories of "high," "medium," and "low." Use your own
judgment to define these categories.
5) Print the data using your new formats. Keep these formats for use in later exercises.

Lesson 10: Basic Statistics with Proc Univariate, More ODS


Summarizing Data with Proc Univariate
Proc Univariate and Proc Means are procedures in Base SAS that calculate statistics one variable at a
time (they do not explore relationships between variables). The two procedures have quite different
listing output but many similar capabilities. Proc univariate is the more extensive of the two. In order
to demonstrate these procedures in a meaningful way, a larger data set than those we have seen
previously will be needed. The data set we will use is shown below. A text file containing the data is
here. The data set contains three variables, a group variable with values 1, 2, and 3, a discrete
variable x with values 1-6 (a die toss) and a continuous variable y.

First, we look at a very simple proc univariate step. The "var" statement lists the variables for which
analysis is to be performed. If "var" is omitted, univariate will give analyses for all numeric values in
the data set. Incidentally, this would include any for which the analysis is silly, such as the "group"
variable in this example. Thus, specifying variables to be analyzed is a good idea.
Page No.81
The results include a fairly detailed summary with all kinds of statistics for the variable, spread over
two pages.
Page No.82

Proc univariate has many options and optional statements. We will explore a few of the more common
ones. For more, see the documentation under Base SAS/Base SAS Procedures Guide: Statistical
Procedures.
In the middle of the first page of output, above, note the section titled "Tests for Location: Mu0=0."
These are statistical hypothesis tests where the null hypothesis is that the mean of the random variable
is equal to zero. The small p-values indicate that the null hypothesis should be rejected, and the
conclusion drawn that the mean is not zero. Perhaps you would like to test whether the mean is some
other value, say, 100, for example. You can add the following option to the proc univariate statement.
The higher p-values indicate that this null hypothesis cannot be rejected.

Two other options in the proc univariate statement are normal and plot. The normal option produces
the section on tests for normality, and the plot option gives the stem and leaf, box and whisker, and
normal probability plots below.
Page No.83

The notation below the stem and leaf plot, which says "Multiply Stem.Leaf by 10**+2" means that
if you read the numbers like 6.9 for the first one, you should multiply that by 10 to the second power,
so it is really 690. This data is badly skewed, so the box plot is not at all symmetrical. It usually has a
dashed line through the middle of the box for the median. The "+" represents the mean. To interpret
the normal probability plot, look at the band made up of "+" signs. The asterisks are the data, and if
they mostly fall within the band, the data may be considered normal. In this case, the data are not
normal, as the normality tests also show, since the low p-values indicate the assumption of normality
should be rejected.
Optional Statements in Proc Univariate
Like proc print, proc univariate has a by statement, which will produce separate analyses for each
value of the variable specified. In this case, the result is three sets of output for each value of "group"
(results not shown).
Page No.84

The graphics shown above are somewhat rough, but proc univariate can also produce high resolution
graphs, such as a histogram, which is displayed in a graph window. If a "var" statement is used, the
histogram variable must be included in the listed variables. The "/normal" portion is an option to the
histogram statement that superimposes a normal curve on the histogram. This demonstrates again
that the normal distribution is not a good fit to this data. (Other distributions can be specified for the
curve.)

The "qqplot" statement produces a high resolution version of the qqplot. Here is an example with the
exponential distribution. A qqplot should fall in a nice straight line if the distribution is a good fit.
Obviously, we are not having too much luck fitting a distribution for y!
Page No.85

Producing an Output Data Set


Univariate can also produce a data set containing the statistics seen in the output. If this is the only
goal, a "noprint" option in the proc statement is a good idea. This suppresses the usual listing output
in the output window. A "var" statement must be used with the output statement to determine
which variables will be used for the output data set. The desired statistics must also be specified.
There is a long list of these; again, see the help or documentation for details. In the example here,
standard deviation and mean have been requested.
Note that the syntax of the output statement requires a keyword for each requested statistic, followed
by an equals sign, followed by a list of variable names for the statistics, one for each variable in the
"var" statement. There will be only one observation, unless a "by" statement is also given, in which
case there will be one for each value of the "by" variable, as shown in this example.

Using ODS to Control Output


We saw that proc univariate creates several sections, each with its own heading and a table of
information (except the graphs). In ODS, each of these sections is an output object. An output
object generally has two parts, the data component, and the table definition. The data component
is obviously the data that will be displayed in the table, and the table definition is a set of instructions
that describes how to format the data. Each output object has a name and can be accessed separately
Page No.86
through ODS. To see information about the output objects your procedure is producing, you can issue
the following ODS commands and look at the results in the log (only part shown here):

Notice that the name of each object corresponds roughly to the label in the output. In most cases, just
the spaces are eliminated. This makes it fairly easy to identify the object name. Sometimes in the
output there will be more specifics, like the section on "Tests For Location" which also gives the null
hypothesis value in the output, but that is not part of the name or label. In any case, having the
name, we can now use the ODS select statement to choose which objects to print or send to any ODS
destination. Alternatively, and ODS exclude statement can be used to eliminate unwanted objects
with similar syntax.
Page No.87

ODS can also be used to save objects to SAS data sets. They are then available for use in data steps or
other procedures.

Exercises:
Use the data set of used cars inventory from previous lessons for the following problems:
1) Use proc univariate to analyze the price variable in the used cars data. In addition to the default
output, produce tests of normality, and low-resolution plots (box plot, stem & leaf, and qqplot).
2) Use proc univariate to analyze the miles variable, change the null hypothesis value for the tests of
location to 50,000, and use ODS commands to display only the tests of location in the output window.
Page No.88
3) Use proc univariate to analyze the price variable. Use ODS commands to print only the "Moments"
object in the output window and to save the "Moments" object to a SAS data set and print it.
Lesson 11: Proc Means and Proc Freq

Summarizing Data with Proc Means


Proc Means can provide some of the same information as proc univariate, but has different output
formatting and different options. For relatively simple reporting of summary statistics, proc means
provides a more compact output. The example below shows proc means with a by statement and the
first section of the output. The function of the by statement is the same as in proc print or proc
univariate.

The statistics for the listing are requested as options in the proc statement (see documentation for
complete list). Those shown above are printed by default. The next example shows how some others
might be requested. And, like proc univariate, proc means produces an output data set. Note that the
statistics for the listing and for the output data set need not match.
Page No.89
Proc means allows a shortcut in the output statement if only one statistic is requested and the same
variable names as the original variables are desired for the output statistics. This does not work in
proc univariate.

Proc means also has a class statement. This is somewhat like a by statement but the results are
grouped together.

If you have more than one class variable, there is also a way to get summarizations at more than level
of combinations of classes. This is done with the types statement. The example below gives the
overall summary, represented by the empty parentheses, a summary by group, and a summary for
each combination of group and x. Only part of the output is shown. In the output window, the
different levels are placed into different tables.
Page No.90

When the class and types statements are used together with an output statement, all the different
combinations go into the data set. The _type_ variable indicates which level it is, such as overall,
where _type_=0, group, where type=2, etc. Here again, we show only part of the output.

Summarizing Categorical Data with Proc Freq


Proc Freq produces frequency tables for numeric or character variables. The "tables" statement is
used to specify which variables to use in the table(s). If no tables statement is given, a one-way table
for each variable in the data set will be produced (this is not usually a good idea). Multiple tables can
be specified in one tables statement, and multiple table statements can be given. The data for this
example is here.
Page No.91

Several options are available in the tables statement and are listed after a slash if used. The example
below shows nocum and nopct options, which suppress the cumulative statistics and percents. The
nofreq options will suppress the frequencies.

Two-way tables are requested using an "*" symbol notation, as shown below. The first variable will be
listed vertically. The upper left cell gives the key for the numbers in the table. Three-way and higher
tables can be requested, if desired. Proc Freq then produces a collection of two-way tables, one for
each of the additional values of the other variable(s).
Page No.92

Some useful options for two-way tables are norow, nocol, nofreq, and nopct. These are used to
suppress each of the four numbers in the table cells, and are especially helpful if the tables are large.

If a table is to be built from a continuous variable, proc format can be used to group the values in a
suitable way.
Page No.93

Proc freq has many more capabilities. It can produce output data sets and many statistical tests and
measures of association. See the documentation for further information, under "Base SAS/Base SAS
Procedures Guide: Statistical Procedures." Here is one example of using proc freq to conduct a chi-
square test of independence.

And if you want the results in an output data set:


Page No.94

Exercises:
Use your permanent usedcars dataset as the source for the following problems. Use your saved
formats whenever you are asked to use your formats from the previous lesson.
1) Using proc means:
a) Display the mean and standard deviation (only these two statistics) for the miles variable in the
output window.
b) Display the mean and median of the price, but using color as a class variable.
c) Use your formats for color and miles and display the mean and median prices, with both color and
miles as class variables, in their formatted form.
d) Produce an output data set that gives the mean and standard deviation of the miles for each make
of car, using a by statement. Print the result.
e) Produce an output data set that gives the mean and standard deviation of the price for all the cars
and for each make of car, using class and types statements.
2. Using proc freq:
a) Display a frequency table of the makes.
b) Display a two-way table of color by make, showing only the counts in each cell, and include tests of
independence. Align colors vertically and makes horizontally in the table.
c) Use the format for classifying miles that you created in Lesson 9 to make a table of make by mile-
groups. Align makes vertically and miles horizontally. Print the counts and row percents (for each
make, percent in a mile-group).
Lesson 12: Proc Chart, Proc Plot, and Proc Corr
Bar Graphs with Proc Chart
Proc chart is a procedure that produces text-based bar charts as well as pie and star charts.
Page No.95

The vbar command produces a "vertical bar chart," as seen above. The discrete option, listed after the
slash on the vbar statement, indicates that the variable is to be treated as a discrete value, with one
bar for each value of the variable. (This does not apply to character variables.) Without this option,
SAS groups numerical values in evenly-spaced classes like a histogram. This is appropriate for the y
variable, as shown below. Note that the horizontal axis now indicates that the labels are midpoints of
the respective ranges.
Page No.96
That this is inappropriate for variables that are actually discrete (unless there are a large number of
values), is apparent from the chart for x done without the discrete option, where you can see the data
have been strangely grouped with non-integer midpoints (3 and 4 are combined with the label 3.6).

Complementing the vbar statement is the hbar statement, for "horizontal bar chart," of course. In
addition to the horizontal display, the hbar statement also displays some statistics for each class.

Another possibility is the block statement, shown here with a subgroup= option which subdivides the
columns by another variable value. First, a warning: these "3-D" charts have limits because they
require a certain amount of space on the page. If SAS cannot fit your requested chart on a page, you
may get an error message in the log and a substitute graph in your output. You may be able to adjust
Page No.97
your linesize and pagesize options, or change the number of bars to display, or turn dimensions around
in order to get better results.

A group= option is also available to produce side-by-side bars, and they can even be used in
combination. Here is part of a block chart of y using a group option with the variable "group."

Here is a vertical bar chart using both the group option for a side-by-side arrangement and the
subgroup option for vertical stacking, thus combining three variables in one chart.
Page No.98

Some other options you can try include "levels=n" which specifies the number of bars, "space=n"
which regulates the space between bars, "midpoints=list" where you specify the midpoints you want
for the bars, and "freq=variable" which is used when one of your variables already contains the
frequency to be used in making the height of the bars. For complete information about the chart
procedure and other available options see the SAS Online Documentation under "Base SAS/Base SAS
Procedures Guide/Procedures/The Chart Procedure."
And finally, here is an example of a pie chart.
Page No.99

Scatterplots with Proc Plot


Proc plot produces text-based two-dimensional plots. Note that there are special characters used in
the output window for the axes; if you copy and paste these, you need to use the SAS Monospace font
to have it come out right.
The data set shown below is used in these examples, and can be downloaded here.
Page No.100
A simple plot of y by x is produced by the following code. Note that the first named variable goes on
the vertical axis, and the second on the horizontal axis. By default, SAS prints characters to represent
the data points. Sometimes more than one point occurs in one place. Higher letters are used to
indicate the number of points represented by that letter. In this example it didn't happen, but SAS
prints a legend to explain it, anyway.

Another thing to notice is that the title bar of the editor window says "PROC PLOT running." Proc plot
is one of a few procs that does not exit when it encounters a run statement. Instead, it stays active
and waits for more plot statements. It will exit if followed by another step, otherwise, a quit
statement can be used at the end, as shown in the next example. (Although "run;" is not necessary
in this example, it would be good practice to include it, before the "quit;".)
If you want some other symbol, instead of the default A-B-C for the data points, you can assign it as
shown below. In this case, if there are two or more points in one place, SAS prints a message telling
how many points are hidden, but you cannot tell which ones they are. (There is an example further
down.)
Page No.101
You can also create a kind of three-dimensional plot by assigning symbols based on a third variable.
SAS only uses the first character of the value, but if this is a problem, a custom format can be used to
define more useful symbols.

It is also possible to display a relationship between one variable and two or more others. Usually we
put the common variable on the horizontal axis. This is done by putting two or more plot expressions
in one plot statement, and using an overlay option, which follows the "/". In the example below, a
by statement has also been added, and the output is shown for group=2 only.
Page No.102

Sometimes it's nice to have reference lines on a plot. Proc plot has the vref= and href= options to
provide these. Now the v and h do not refer to the orientation of the line, but to the position. Thus a
vref is a vertical reference, which is a horizontal line at a particular vertical height.
Page No.103

Information about more options for proc plot can be found in the documentation under "Base SAS/Base
SAS Procedures Guide."
Correlations with Proc Corr
Proc corr computes measures of association between two variables. The primary one, of course, is
correlation (Pearson product-moment correlation), but Spearman's rank-order, Kendall's tau-b,
Cronbach's Alpha, and others are also available. For details of these, see the documentation, under
"Base SAS/Base SAS Procedures Guide: Statistical Procedures."
Using proc corr can be as simple as writing one line, much like proc print. By default, proc corr
calculates correlations for all pairs of numeric variables in the data set. Some of these may not be
sensible (such as for an id variable). The output included simple statistics (n, mean, standard
deviation, sum, min, and max) and a matrix with all the variables listed vertically and horizontally, so
that one finds the desired correlation by looking at the intersection of a row and column. In fact, the
layout resembles that of an actual correlation matrix as used in statistical theory, if only the top
number of each entry is considered. the bottom number is a p-value for the hypothesis test whose null
hypothesis is that the correlation is zero (there is no correlation), and whose alternative hypothesis is
that the correlation is not zero (there is correlation).
Page No.104

If there are variables you do not need to correlate, or if there are too many variables to look at all at
once, some modifications can be made. A var statement can be added to determine which variables
will be included.

In addition to the var statement, a with statement can be used to make the output more compact.
The "var variables" are listed horizontally, and the "with variables" are listed vertically.
Page No.105

Exercises:
1. Using proc chart or proc plot and the used cars data from the previous lessons:
a) Make a vertical bar chart for the makes.
b) Make a histogram for price with seven bars (histograms for continuous variables should have no
spaces between bars).
c) Produce a plot that relates year and price, with price on the vertical axis.
2. The following numeric data are values of the variables x and y respectively.
cards;
23 35
27 36
24 35
21 32
29 36
28 39
;
a) Copy this data into the SAS editor and write a data step to read it. The prediction equation
(regression line) relating y to x for this data is given by yhat=20.60811 + 0.58784x, where yhat is
the name for the predicted value. Define this variable and include it in the data set.. Also include the
natural log of x and the natural log of y in the data set.
b) Use proc means to find the means of y and yhat (you might notice something interesting).
c) Make an overlay plot that shows the y values and the predicted values with two different symbols,
plotted against the x variable.. Include a vertical reference line at the mean of y, as found in part b.
d) Find the correlations between all five variables (default output of proc corr).
e) Display the two by two correlation table of x and y.
f) Use var and with statements to display only the correlation of the log of x with the log of y.

Lesson 13: Proc Reg

No doubt one of the most widely used (and therefore abused) statistical procedures is regression. We
are not going to learn how to do regression here. However, because of its popularity, we will use proc
reg to demonstrate some of the typical syntax of statistical programs. Proc reg is not part of Base
SAS, It is part of the statistical package called SAS/STAT. Therefore the documentation is found under
"SAS/STAT," then "SAS/STAT User's Guide," then "The REG Procedure."
The data used for these examples is here. The file contains the variables x, z, and y and has two
header lines. The idea is to use proc reg to derive an equation for the prediction of y based on x and
z. This is often called the "model" or "prediction equation." The model statement in proc reg is used
to define the form of the equation. The dependent variable (the one to be predicted) is given, followed
by an equal sign, and then the independent variables (those on which prediction is based) are listed
after the equal sign. Technically, the statement "model y=x z;" specifies an equation of the form

Where y-hat is the predicted value of y, and the beta-hats are the estimated coefficients of the
equation, with beta-hat-naught being the intercept. In the SAS output the beta-hats are called
"parameter estimates." Here is the program and output for this data:
Page No.106

We are not going to go into the interpretation of these results here. What we want to do is study how
additional statements and options are used customize the results of the regression procedure. While
the options, statements, and syntax vary for different statistical procedures, learning about proc reg
will give you some of the general ideas, and thus give you some background for learning other
procedures when you need them.
Note that the model statement specifies the form of the equation that will be fit to the data. There can
be more than one model statement. If that is the case, it might be helpful to give an identifying label
to each model. This label will be printed at the top of the output for each model. By default these are
numbered, as in the above example, where the model is called "MODEL1." Labels can be up to 32
characters, with no spaces, followed by a colon, and placed in front of the model statement, as follows:

This produces one section of output for each model. The first is identical to that shown above, except
for the heading that says "Model: Full" instead of "Model: MODEL1." The second section looks like this:
Page No.107

Proc reg, like proc plot, does not automatically quit running when it encounters a run statement.
Unless another proc follows, it will wait for more statements to be submitted. For example, if you
added the following lines to the program above, left them selected as shown, and clicked submit, SAS
would produce the output for the next model, without re-running the rest of the program. (Any
selected text in the Enhanced Editor is submitted without the rest of the program, a source of great
irritation when done accidentally!) If you want to make proc reg quit, issue a "quit;" statement at the
end of the program. One of the minor benefits of this is that it leaves the Output window on top,
rather than bringing the Editor back up. Experiment with this a bit and you'll see.

Sometimes it is convenient to have the results of the regression, such as the parameter estimates and
other statistics, in a data set. Proc reg uses an option in the proc statement, "outest=", to do this.
Other options can be added to control what statistics are included. (You might notice that the editor
has some trouble with the color coding on these, but even if they aren't blue, they still work.)
Page No.108

Notice the naming of some of the variables, how they begin and end with an underscore. It is
important to include these underscores when referencing the variables, since they are part of the
name.
Next, we will give some example options for the model statement, which are placed after a slash.
Some of these options control what goes in the output, and some affect the modeling process. The
"noint" option is used to fit a model with no intercept (recall the intercept is automatically included in
the examples above). The "VIF" option adds a "Variance Inflation" column to the parameter table, and
the "P" option gives a table of "Output Statistics" that includes predicted values of y (y-hats) and the
"Residual," which is the difference between y and y-hat..
Page No.109

Another statement in proc reg is the output statement. This creates a data set, but unlike the
"outest=" option in the proc statement, which gives observations for each model, this data set will
contain output statistics for each observation in the data, such as printed in the example above. There
is no slash in the output statement, the options simply follow the word "output." You should specify a
data set name with the "out=" option, and then list the statistics you want included, such as predicted
values, studentized residuals, etc. Each statistic has a keyword that requests it, then you must specify
a variable name to use in the output data set. Thus "p=yhat" means to include the predicted value
using the variable name "yhat." Some other examples are "r=" for residuals, "ucl=" and "lcl=" for the
upper and lower confidence limits of the prediction. See the Online Documentation for the complete
list.

Of course you can take this data set and make plots with proc plot. But proc reg also has its own plot
statement built in. You can plot any of the variables in the original data set, plus the same new
variables that are available in the output statement. These are named like the keyword that specifies
them in the output statement, followed by a period. Thus, the predicted values are given by "p." and
the studentized residuals are given by "student.", for example.
Page No.110

Exercises:
Use the used cars data from previous lessons. In proc reg, do the following (This should all be done in
one program, with one proc reg step):
a) Compute a regression model for price based on miles and age of the car, and a second model for
price based on miles alone. Use labels for the models.
b) Create a data set which contains the parameter estimates and rsquare values for each model.
c) Create a data set containing the predicted values and residuals (as well as the original data, which
is included automatically).
d) Plot price vs miles and price vs year using a plot statement in proc reg.
e) Plot the residuals against the predicted values for each model using plot statements in proc reg
(note: these plots use the residuals and predicted values produced by the immediately preceding
model statement).
f) Print the data sets created by proc reg.
Lesson 14: Proc Transpose and Proc Report
Proc transpose is used when you need to "turn your data on its side." Basically, it turns observations
into variables and variables into observations. Suppose we had the following sales data in Excel:
Page No.111

Now we can clearly see that this data consists of four columns and five rows. The header row across
the top and the first column are not really data. However, when we read this into SAS, we will need to
consider the first column to be a variable also, or we will lose the designation for the week that goes
with each row. The following program will do:

And the result looks like this:

Now suppose what we really wanted was for each salesperson to be an observation, and each week to
be a variable. The id statement tells proc transpose which column holds the new variable names. If no
id is given, SAS will use default variable names like _COL1_ etc. Also notice that SAS puts the old
variable names in a new column called _NAME_. This variable name can be changed using a "name="
option in the proc transpose statement.
Page No.112

Now suppose we have another variable, say "City," and within each city we have weekly sales data for
our four salesmen. The number of weeks is reduced here for brevity and also to show how certain
missing values are handled. We want to transpose the variables again, but this time within each city.

At first glance, proc report looks like it does the same thing as proc print. In the simplest form, the
output can be very similar to proc print. But you might say proc report is like proc print on steroids--it
has far more capabilities. The example below shows how the default settings of proc print and proc
report compare. Proc report doesn't print observation numbers, doesn't skip a line below the headings,
and aligns the columns differently. But that's just the beginning. The data for this example is here.
Page No.113

Notice the "nowindows" option in the above example. We will always use this option in this lesson.
Proc report comes with its own window-based user interface, which will come up automatically if you
do not specify "nowindows." We will not be studying the report window in this lesson.
The column statement is used to specify which variables are printed and the order in which they
appear. A define statement is used for each variable to specify the the options associated with that
variable. In the example below, name is defined as a group variable. Notice that the names are no
longer repeated on each line (there will be more significance to the group variable later). For each
variable, an expression in quotes has been included--this is the same as a label, and can be controlled
with split characters similar to what we have seen in proc print. Notice that the last column has the
label words split in an awkward way. This is due to a column width of 5, which comes from the format
given in the define statement for the last column. Observe the three different formats that were used,
and the effect of each.
Page No.114

Now for a few more options to dress up the output. First, in the proc report statement, we add the
options "headskip" and "headline". They add a line under the headings and a "skip" or blank line
beneath them. The formats have all been changed to accommodate the money amounts properly, but
two new options are demonstrated in the define statements. For the cheese variable, a "width=10"
option is included. This expands the column for that variable to 10, so that the column is now wider
than the format specifies. For the cream variable, a "spacing=5" option is added. This puts five blank
spaces in front of the cream column. Notice how this affects things differently, compared with
widening the whole column.
Next, we have two new statements, break and rbreak. Break works together with the "group"
designation given for name. It breaks the report for each change in the value of the group variable
listed in the break statement. Notice that it says "break after". This gives the location of the break
information, which can be either before or after the group it refers to. After the slash come some
options. The "ol" means overline--it prints a line above the break line. There is also "ul" which means
underline. Then we have "dul" which means double underline. This is done using equals signs. There
is also "dol" for double overline. "Skip" means to insert a blank line after the break line, and
summarize means to print summary statistics for the numeric columns. The default is sum, but
many other statistics are available. The specific statistic is given as an option in the define statement,
"sum" for sum, "min" for minimum, "max" for maximum, "mean" for mean, etc.
The rbreak command means "report break" so it provides summarization for the whole report. Similar
options apply, except no variable is specified since it applies to the whole report and not the grouping
of any variable. This provides the grand totals in our example.
Page No.115

The next example shows how you can duplicate a column for the purpose of getting different summary
statistics. In the column statement, new columns are introduced using the syntax
"variable=newvariable". The new variables are then given their own define statements, where the
summary statistic "mean" is now included. The other thing that is changed is that the columns have
been grouped, inside parentheses, and each group has a label specified as the first item in the
parentheses. This provides a group heading for all the variables in the parentheses. Also, the labels
have a minus sign at the beginning and end. This causes a line to be drawn out to the width of the
column group for each label. (The headings are incorrect)
Page No.116

You can also calculate columns from the other columns. The new column must be listed in the column
statement, and will have a define statement as well. To actually calculate the values, a compute
block is used. It begins with the statement "compute newvariable" and ends with "endcomp;". The
formulas for calculation use data step syntax, for the most part. However, there is a two-level name
for some variables, as shown here. The variables are given as "variable.sum" because they are used
as sum variables for summarization purposes. If we did not have the group variable, this would not be
necessary.
Page No.117

Perhaps the idea of the two-level variable names, like cream.sum, will be more clear if we leave out
the detail levels and show these variables only as summary variables. Taking the month variable,
which relates to the detail rows, out of the column statement and deleting its define statement will do
that. We also no longer need the break statement, and add an overline to the rbreak statement since
all the dividing lines were in the break statement before.
Page No.118

Proc report has many more features and capabilities. We have only introduced some of the basic ideas
here. If you are ever in a situation that calls for the generation of periodic reports on the same kind of
data, it is worthwhile to spend the time to create a nice report that can be used over and over.
Exercises
1. Begin with this data step:
data one;
input trt $ s1 s2 s3 s4 s5;
cards;
Cont 4 5 5 4 6
Fast 5 5 6 6 5
Drug 7 7 6 5 6
;
Suppose this is data from a designed experiment with three treatments and five subjects. To analyze
the data in SAS, we will need it rearranged so that there are 15 observations, each having the
treatment, subject identifier, and just one subject response. Use proc transpose to rearrange the data
in this way.
2. Begin with this data step:
data one;
input trt $ subject $ t1 t2 t3;
cards;
Cont s1 4 5 5
Cont s2 5 4 6
Cont s3 6 6 5
Fast s1 5 5 6
Fast s2 5 6 7
Fast s3 7 8 8
Drug s1 7 7 8
Drug s2 6 8 9
Drug s3 5 7 9
Page No.119
;
Use proc transpose to turn the subject and time around within each treatment, calling the new time
variable "time" and the using the subject values for the new subject variable names.
3. Refer to the used cars data from previous lessons (usedcars3.txt). Create a report that groups the
cars by year and gives the average mileage and price for each year. Include nice headers and at least
one example of each of the format, width, and spacing commands. Give each column a heading in the
define statements.
4. Again using the used cars data, create a report that summarizes the prices for each make, giving
the total and mean but omitting detail information.

Lesson 15: Tabs and Other Delimiters, Controlling Observations, Variables, and Output Data
Sets
There are essentially three ways that data can be identified in a raw data file. Either 1) The position of
each value in the file must be known, either by columns or some other scheme, 2) The values must be
identified by a name or other symbol, or 3) The file is "delimited" by a symbol that tells where one
variable stops and the next begins. The latter case is the subject of this lesson. We also discuss some
issues about missing values in list input.
We have already dealt with delimited files where the delimiter is a space. This is what we were doing
whenever we worked with the list input style. Now consider the program below. Notice that there are
two missing values, indicated by a dot (period). Although all the data values are numbers, the second
variable has been defined as character in the input statement. In the output, the missing character
value is blank, but the missing numeric value is represented by a dot (i.e., the decimal point of a
number that isn't there). If column input is used, blanks in the data will be correctly interpreted as
missing values, but with list input that would cause a problem.

The data in the above program was entered using tabs between the values. The SAS editor does a
funny thing with tabs--although you can move the cursor back and forth over them, and the cursor will
"jump" just as you expect, when SAS executes the program, it interprets the tabs as a series of
spaces. If you try reading the data above with column input, you will find that the data are located in
columns 1, 5, and 9. However, when the data are read from an external file, we encounter a much
different situation.
Page No.120

Fortunately, there is an option for the infile statement, called expandtabs, to fix this. It still requires
that the "dots" be in place for the missing values, though.

Of course, our delimiters need not be tabs or spaces--they can be almost any character. Commas are
probably the most common alternative, though sometimes we see hyphens or slashes used. In such
cases, we can use the "dlm=" option, or spell it out, "delimiter=". Here we have replaced the tabs
with commas in the external file.

There are special delimited files that have more complex rules. You may have seen "csv" files, for
example, that can be produced and read by Excel. That's an abbreviation for "comma separated
values," which implies that they use a comma as a delimiter. But there is more to it that that. Missing
values are not represented by dots or spaces, just two delimiters in a row. There is also a feature that
text with embedded commas will be saved inside quote marks, so that the imbedded commas are not
interpreted as delimiters. To read this kind of file, use the "dsd" option in the infile statement. The
"dlm=" option can be used with it, in the rare event that the delimiter is something other than a
comma. Here is a csv file, created in Excel, with the same data and empty cells for missing values.
Page No.121

The data import wizard will also import csv and tab-delimited files, as well as a number of other types,
so you may want to check that out as well.
In a previous lesson, the obs and firstobs options were used to control which observations were
included in a data set In this lesson some ways to fine-tune the selection of observations and
variables are introduced. A text file containing the data below can be downloaded here.

You can decide whether or not an observation is included based on the value of a variable.

Or you can go the other way.

You can use comparison operators, like <, >, <=, >=, and their corresponding character abbreviations,
lt, gt, le, and ge. If you use a character variable for the test, enclose the value in quotes and
remember that case matters.
Page No.122

You can use logical operators like "and" and "or," and control order of operations using parentheses.

When reading from another SAS data set, such as when using a set statement, the same syntax can be
used, or, you can use "where" in place of "if."

The use of if and where can be summarized as follows:


IF: Use in any data step, not in proc steps
WHERE: Use only when input is a SAS data set, in data or proc steps.
A data set can create more than one data set at a time. Perhaps you want to create a temporary data
set for immediate use in the work directory, but also want to save it to a permanent library, as shown
in the example below. Note that the data set names were placed on separate lines and indented in
order to make it more clear to the reader what is being done.
Page No.123

Suppose, on the other hand, that you want to create different data sets. The following program will
create three data sets, each containing different variables, but the same number of observations as the
complete data.

You can also send different observations to different data sets. To do this, construct a logical test
involving "if-then" clauses together with the "output" keyword to specify the target data set.
Page No.124
Notice that the three data sets being created are still named in the data statement. Data set options,
such as keep and drop, can be used here just as in the last example. The input statement comes next,
then the output instructions. As each observation is read from the cards (or infile, or source data set),
the three if statements are processed, and whenever the logical condition is met, the observation is
output to that data set. Some observations are output to more than one data set, because more than
one condition is met.
If the intention is to split the data into mutually exclusive and exhaustive sets, then "if-then-else"
clauses can be used to good effect. An "else statement" is only executed if the previous "if statement"
is not (the logical test fails).

Exercises:
Use the example data given at the beginning of the lesson to do the following. Print all data sets.
1. In one data step, create three data sets containing the following combinations of variables:
a) Name and age,
b) Height and weight, and
c) Name and height.
2. In one data step, create three data sets containing the following groups of observations with all
variables.:
a) All observations for names that come after "E" in alphabetical order,
b) All observations with ages of 9 or 10, and
c) All observations where the height is less than or equal to 48 and the weight is greater than or
equal to 90.
In one data step, create three data sets containing mutually exclusive observations (use if-then-else):
a) All observations where the weights are less than 75,
b) All observations where the weights are at least 75 but less than 100, and
c) All observations where the weights are at least 100.

Lesson 16: Multiple Lines, Multiple Observations


There are a number of situations where the lines in a raw data file do not correspond to the
observations in a SAS data set. In this lesson, we explore ways to read data files that have multiple
observations on one line and multiple lines for one observation. Since the latter is usually simpler, we
will start with that.
Consider the following data step (data here):
Page No.125

By default, when SAS runs out of data on one line but still has variables to read, it goes on to the next
line. In past lessons, we found that sometimes this causes problems, but here, it does what we want.
In fact, it doesn't matter whether some observations take up one line and some take up two, as long
as each new observation starts at the beginning of a line. If an informat causes problems with a short
line, the truncover option or the colon modifier can be used, as demonstrated in previous lessons. Do
not use missover, though!
As long as you read all of the variables (in order), the above method works fine. There are some other
tricks available if you do not want to read everything on a line, or want to read things in a different
order. A slash in the input statement tells SAS explicitly to go to the next line. In this case, if there
were more variables on the line after first and last name, the slash would still cause SAS to go to the
next line after reading Lname. Of course, now there is no flexibility in terms of having some lines
broken and some whole--all observations must be on two lines.

Another kind of notation explicitly numbers the lines in an observation. Again, there must be a
consistent pattern in the data, so that all observations occupy the same number of lines. Both of these
examples give the same output as shown above.
Page No.126

The "number" notation allows skipping around within the lines that make up an observation, as shown
here: (This example needs to be fixed, it is reading the same variables twice. This works best with
column or formatted input.)

Now let's look at the opposite case--where there is more than one observation in one line of the source
file. Consider the following data step:

In this case, only two observations were included in the data, because SAS automatically goes to a
new line when it reaches the end of the input statement. What we need is a command to stop it from
doing that.
But first, a word about iterations. When a SAS data step reads from an external file or another data
set, there is an automatic loop in the data step that repeats over and over until the input data runs
out. One pass through this loop is called an iteration. Many things can happen during an iteration,
including the creation of new variables by assignment statements, and the elimination of observations
by "if" or "where" statements. In the simplest cases, one iteration reads one line from the raw data
Page No.127
and outputs one observation, but many modifications are possible. In general, the commands for an
iteration begin with the input statement (or set or merge) and go to the end of the data step, excluding
cards. An iteration ends with an implicit output or the last of the data step commands if there is an
explicit output statement. There can be other loops inside an iteration, as we shall see momentarily.
But the iteration itself is an implied loop that repeats as long as there is input data available.
To hold the input line, that is, keep the data pointer from moving to the next line, we place an "at"
symbol ("@") or "double at" ("@@") at the end of the input statement. The "@" holds the line during
the current iteration only, while the "@@" holds the line across multiple iterations. In the example
above, the iteration consists only of reading name, ht, wt, and age, then outputting an observation.
Therefore, we need to hold the line for the next iteration, which requires the "double at."

In Lesson 14 we used the following example in an exercise on proc transpose. We first read the data
into five variables, then, transposed it so that there was a treatment and one response for each
observation (together with a "name" column).

Now, we will see how to read this directly in the form we need. In the example below, the lines from
"input" to "end;" determine one iteration. We first read the "trt" variable, then hold the line for this
iteration. Then we start a "do loop." The "do loop" allows us to repeat some commands according to
a pattern. It begins with the keyword "do" and ends with an "end" statement. The variable s is the
index of the loop. It will hold the values from one to five, in turn, as the loop is executed. Each time
through the loop, the input statement directs SAS to read a value of y, hold the line (for the current
iteration), after which the output statement writes an observation to the data set. When the loop is
finished, the iteration is complete, and SAS moves to the next line in the data and begins another
iteration.
Page No.128

Next, we put together the do loop and an if-then-else construct. Note how the size values are
defined. You must be careful with the length of the variables in this case, since it will be set by the
first value given in the program. (Another way to handle it is to use a length statement before the
input statement.)

Exercises:
Write data steps to read the following data (copy and paste it into cards just as it is). Print the results.
1. A series of y values:
13 25 22 17 19 11 16 18 21
14 17 20 18 15
2. Name, age, and grade.
Kelly 9 3 John 10 3 Mark 10 4 Joan 11 4
April 8 2 Larry 9 2 Daniel 11 5
3. Treatment and four responses for each. Read in univariate style, with a treatment and one
response per observation.
ctrl 77 69 72 79
trad 87 96 89 82
new 89 94 96 81
4. Treatment, gender (male), three responses, gender (female), three responses. Read in univariate
style, with a treatment, gender, and one response per observation.
ctrl M 77 69 72 F 79 81 72
trad M 87 96 89 F 82 99 85
new M 89 94 96 F 81 83 87
Page No.129

Lesson 17: Generating Random Data


SAS provides several functions for generating pseudo-random data. The most popular are the
functions that provide normal and uniform distributions. Uniformly distributed values may be
generated by the uniform(seed) function (alias ranuni), which gives random numbers in the interval
[0,1). Standard normal values are generated with the normal(seed) function (alias rannor). In both
cases, using a seed of 0 gives a random start based on the system clock. For publication, specify a
seed of your choice so that others may duplicate your values.

Note that there is no source data (either raw or other SAS data set) for this data step. There is only
one iteration of the data step, therefore we control the entire process using a do loop. The next
example shows how to generate a simulated die toss. The die tosses are, of course, integers from 1 to
6. So we need to convert from a uniform interval on [0,1) to a uniform discrete distribution with
values from 1 to 6. Multiplying the uniform values by 6 gives the interval [0,6). The "int" function
takes the integer part and discards the decimal, so now we have integers from 0 to 5. Adding one
gives the desired result.

If you want to generate random normal values to simulate a population, you need to know the
standard deviation and mean. You multiply the standard normal values by the standard deviation, then
add the mean. Say we wanted heights of male college students, and believed the mean was 70 inches
and the standard deviation was 5 inches. Then, the following program would give a good simulation.
Page No.130

Perhaps it would have been more satisfying to write the equation above as x=70+normal(0)*5. The
result is the same, of course. But we like to think of the mean as the value around which the
population varies, so it makes sense to start with the mean, then add the term that creates the
variation.
A similar strategy can be used to obtain function values for a series of numbers. For example, suppose
you wanted to make a table of the probabilities for a binomial distribution with n=10 and p=0.2. The
following program gives the cumulative probabilities. Notice that the loop counter (index) is actually a
variable we want to keep, and is used in the calculations.

Or, suppose you'd like to graph a parabola in SAS. In this example, the loop counter is used in the
calculations too, but this time, we don't increment it by 1, but in steps of 5 each time the loop
executes.
Page No.131

Suppose we'd like to simulate a discrete distribution with unequal probabilities for each value, such as
the following:
x P(x)
1 0.1
2 0.2
3 0.4
4 0.2
5 0.1
This can be done by "cutting up" the uniform interval and assigning different values to different sized
parts of it.
Page No.132

Exercises:
1. Generate 1000 tosses of two dice, calculate the sums, and make a bar chart for the sums.
2. Simulate 10,000 observations from the following distribution and print a frequency table of the
results.
x P(x)
0 0.2
1 0.3
2 0.5
3. Suppose the population of male college students has a mean height of 69 inches and a standard
deviation of 4.5 inches, while the population of female college students has a mean height of 64 inches
and a standard deviation of 3.5 inches. Simulate heights for 50 male and 50 female college students.
Each observation should include a gender variable and a height variable. Use proc means to see how
close the mean and standard deviation of your simulated values come to the specified values.

Lesson 18: Using Arrays to Program With Variables


An array in SAS is a data step programming tool that allows us to reference a series of variables
mathematically, most often as part of a loop. The array itself is not part of the resulting data set. It is
only a temporary structure used during the data step to manipulate variables. An array essentially
assigns an index number to each variable in the array. Then statements in the program can be used to
calculate the index number of the variable to use or set its value.
In the following example, the array definition also creates the variables that are referenced by the
array. The syntax consists of the "array" keyword, followed by the name of the array (which doubles
as the base of the variable names) and then number of variables to create, in parentheses. The
variable names will have their index number appended. When used in an array reference, the index
(or mathematical expression to generate the index) is enclosed in square brackets.

Notice that we have created just one observation. The do loop is cycling through the variables, rather
than through observations. Study the correspondence between the variable names shown in the data
set and the values of the index of the array, which was also used to calculate the variable values. An
Page No.133
implicit output was used in this example. Since only one observation was created, there was no need
to specify an "output."
In the next example, we add another loop in order to create 10 observations. This time an explicit
output is needed, just before the end of the outer loop, when the values of an observation have been
assigned, and the observation is ready to be saved. We also demonstrate that mathematical
expressions can be used to control the index values, and that loop counters can be creatively used in
assignment statements as well.

The names of variables in an array are not restricted to this form, though. Nor does the array have to
create new variables. In the next example, we use a format statement as the first line in the data
step. This creates the three variables, length, width, and height, while at the same time saving a
format for them. Next, we define an array called "dims." Instead of giving a number of variables to
create, e.g. "(3)," we actually list the names of the variables to be used in the array. The effect of this
is that the array reference "dims[1]" is associated with the variable "length," "dims[2]" is associated
with "width," and "dims[3]" is associated with "height." In the example, we show that both the actual
variable names and the array references can be used in the program. This program shows how you
might go about adding a simulated "random measurement error" to the measurements of a rectangular
solid.

Suppose a teacher gives 5 quizzes and drops the lowest score. To "drop" the lowest score means to
replace it with zero. Here is a SAS program that will do that, while also keeping the information about
the dropped score. This example differs from the previous ones because the data step is reading from
cards. Therefore, the "iteration" of the data step comes into play. The statements in the data step
repeat for each observation read from the data. Here, the array statement is given first and creates
the variables q1-q5. However, it can also be placed after the input statement, allowing the input
statement to create the variables. Either way, there is no conflict. Two new variables are created,
"low" to hold the low score, and "lowi" to hold the index of the low scoring variable. These are initially
assigned the values of q1, then the do loop compares the values to each of the other scores, to see if
any are lower. When the loop is finished, the proper variable is assigned a zero value.
Page No.134

Exercises:
1. Create a data set that contains the first four powers of the numbers 1 through 10 (e.g., 2, 4, 8, and
16 are the first four powers of 2). Use an array to assign the values to each variable. Use only one
assignment statement, and take advantage of a loop index to assign the right power to each variable.
2. For the following data, use arrays to find the highest score in each row then create adjusted scores
so that each is a percent of the maximum. Assume that the columns are students and the rows are
quizzes.
10 9 7 8 5
58468
42649
35845
3. Use the same data as #2, but consider each row to be the scores of one student. Use an array to
move the smallest score to the last position. You will compare each pair in turn, and if the first is
smaller, switch them. (Note there are four pairs to check.) You will need a temporary variable to hold
one value while you do the switch (i.e., x2-->tmp, x1-->x2, tmp-->x1).
Lesson 19-20: Project Yahtzee Simulation
In the game of Yahtzee, five dice are tossed, and various combinations of numbers, similar to poker
hands, are assigned point values. In the game, dice can be selected and re-tossed, but we will focus
on calculating the probabilities for the first toss only. We will also deal only with the "lower half" of the
score card in the game. For the interested student, continuing this project to account for the complete
rules of play would be an entertaining challenge.
Anyone not familiar with Yahtzee should try a web search for the rules of the game. Some sites have
applets that let you play online.
All you really need to know for this lesson, though, is which combinations are counted. We will call
these "hands," as the combinations in poker are called. The hands in Yahtzee are:
Three of a Kind (three of one number and two others that are different)
Full House (three of one number and two of another number)
Four of a Kind (four of one number and one other)
Yahtzee (five of the same number)
Small Straight (four consecutive numbers)
Large Straight (five consecutive numbers)
Chance (anything that does not fit the above patterns)
We begin by creating an array to hold the die tosses. There are five dice, so there will be five elements
in the array. A do loop can be used to "toss" the dice.
Page No.135

That gives us one toss. To simulate the game, we will want to toss the dice many times and see what
the probability of getting each scoring combination, or hand. We will need another loop, surrounding
this one, to give these repeated observations. Note that an explicit output is now needed. Since this
program is going to get rather complicated, we will pay close attention to issues of style and
readability. Putting in comments to identify the beginning and end of major loops is helpful. Care
should be taken with indenting, to make sure all lines associated with a particular loop are indented at
the same level. The statement that begins a loop and the corresponding end should be at the same
level of indenting, and statements within the loop should be indented two spaces from the level of the
loop. Statements within nested loops are indented again.

Suppose we systematically build up the identification of the hands. There are many ways to do this.
Some are easier to program, some are more efficient from a processing standpoint. At the beginning,
it may not be clear what the best method is, so you should try some of your own ideas before reading
on. The solution presented here is kind of a compromise. It may not be the easiest to program, nor
the most efficient method.
To start, let's see if we can identify a Yahtzee. Now, Yahtzees are quite rare, so we can't rely on
getting one by doing 10 random tosses. The best thing is to put in a temporary piece of code that will
artificially give us a Yahtzee, so we are sure to have something to idenify.
Page No.136

That was easy, huh?


OK, now you should take some time and think about what is required to identify a "Four of a Kind."
Consider all the possible ways that one would show up in the data. How can you check for all the
possibilities in an efficient way? Is there something that can be done to make the search easier?
Don't read on until you've thought about it!

Well, I hope you thought about it. Maybe you came up with the idea that it would be easier to identify
the hands if the dice were sorted. In fact, that is a very big help. But, sorting between variables is not
such a straightforward thing. We can do something called a "Bubble Sort." It is one of the simplest
sort algorithms to program. For more information, look it up on the internet (Wikipedia has a good
explanation). The sort routine can be inserted after the data are generated, and before the
identification part of the program. Here we have included a set of test values that are exactly
backwards. The sort routine handles these correctly, along with all the random observations. Examine
the sort routine thoroughly so you understand how it works, and how it makes good use of the array
structure. (A drop option has also been added to the data set to streamline the output.)
Page No.137

Sorting the dice means that all dice that are equal will be next to each other. Thus, to check for a
Yahtzee, all we need is to find out if x1=x5. If x1 and x5 are the same, it is not possible (in sorted
order) for the numbers in between to be different.
Some examples of Four of a Kind (after sorting) are:
12222
22223
As you can see, either x1=x4 or x2=x5. If it is not a Yahtzee, then these two conditions will identify
Four of a Kind.
When it comes to Three of a Kind, we run into a little complication. If we follow the strategy used for
Four of a Kind, we would check if x1=x3, x2=x4, or x3=x5. Consider the following examples:
11123
12225
24555
These would all be correctly identified. But what about:
11122
33555
As you can see, these would all fulfill the first and third conditions proposed above, but they should be
classified as Full House. Therefore we also need to check for a Full House in these cases. The
following identification routine checks for these types of Hands. At each stage, we have to be very
careful that all possibilities are accounted for.
Page No.138

Now we are down to the straights. Large straights are simpler, as the only possibilities are 1-2-3-4-5
and 2-3-4-5-6. Small straights have a number of different forms, such as 1-2-3-4-6 and 1-3-4-5-6,
where none of the numbers are the same, and a number of possibilities involving numbers that are
doubled, such as 1-1-2-3-4, 1-2-2-3-4, and 2-3-4-5-5, to give just a few examples.
Page No.139

Now we can change the number of observations to 10,000 and use proc freq to count the hands (DO
NOT PRINT!). Here is a table of the theoretical proportions. These are given with 4 decimal places,
which is convenient for a simulation of 10,000, because if you ignore the decimal point, it is the
expected number out of 10,000.
Yahtzee .0008
Four of a Kind .0193
Three of a Kind .1543
Full House .0386
Large Straight .0309
Small Straight .1235
Chance .6326
Exercises:
1. Create another array called "d" (for differences) with four variables. After the sort routine, load the
"d" variables with the differences between the dice. That is d1=x2-x1, d2=x3-x2, etc. Rewrite the
identification routine to use the differences rather than the original die values. Plan your strategy by
writing out what the differences look like for each hand, and try to come up with an efficient method of
identifying the hands. When finished, run 10,000 simulations and compare your results (use proc freq)
with the theoretical values given above.
Lesson 21: Data Null
It has been the author's experience that in many job interviews where SAS programming is an
important part of the job description, there are questions about "data null." It appears that employers
consider this a sort of "litmus test" of the level of a candidate's ability. Therefore, it is important that
we give a little attention to this topic.
The idea of "null" here is that we have a data step that actually doesn't create a data set. For a data
set name, we use the special name "_null_" where the underscores are part of the name. This causes
SAS to carry out the commands in the data step, but as far as the output is concerned, it is, well, "null"
or nothing.
Page No.140
Why have a data step that doesn't save anything? Actually, it doesn't save a data set, but it can save
something else, in particular, a text file. Thus, a data step can be used for report writing. or the
creation of "raw data" files.
The process is simply the reverse of reading a raw data file. Instead of an "infile" statement, there will
be a "file" statement. Instead of "input," there will be "put." Instead of informats, formats.
First, let's see what a put statement does in an ordinary data step. It sends lines of text to the log.
You can have character expressions in quotes and variable names in the statement. The variables will
have their current values printed. The following data step has three iterations, so three lines are
printed to the log.

This is very useful for debugging data steps with loops and conditional statements, since you can
examine the values the variables take as the data step executes.
Next, add a file statement. This gives the location and filename that you want to save the results of
the put statements in. Note that the log shows where the file is located, as well as the number of
records written. It also shows that the data step is still writing a data set.
Page No.141

The same program, with only the data set name changed to "_null_", gives the log below. There is no
"NOTE" about the data set, because none was created.

Now let's revisit the used car data from previous lessons. Suppose we begin with reading the data into
a SAS data set. Then, we use data null to write some of the data to another file. Notice that the data
step will iterate once for each observation in the source data set. The variables, as they are listed in
the put statement, are sent to the file with just a space between them. This is a list put style, similar
to the list input style.
Page No.142

We can use the formatted put style to control the appearance of the output.

But we can do way more than that! Here is an example that uses the internal variable "_n_" (the
underscores are part of the name), which keeps track of the observations, to control when to print a
heading. So, if we are on the first observation, the first if statement puts the two header lines into the
file. The second if condition is not true, so it does not execute, then the final put statement sends the
first observation to the file. For the remaining observations, the first if condition is false, so the header
lines are never printed again.
The second if condition has two parts. The observation number must be greater than 1, and the value
of make must be different from the previous observation. The lag function allows us to compare values
between observations, with lag1 being the previous observation, lag2 being the the second previous
observation, etc. So, after the first observation, if there is a difference in makes, a blank line will be
inserted, before the last put statement sends the detail information.
Page No.143

Another Example:
Page No.144

Lesson 22-23: SQL


SQL stands for "Structured Query Language." SQL is an industry standard, that is, it is not something
invented by SAS, but is used by all major software companies that deal with databases. Although each
company has its own "enhancements," the basic language is the same for all of them. Thus, what you
learn here will be applicable in Microsoft Access, Oracle, Sybase, and many others.
As we have seen in an earlier lesson, in database terms, a data set is called a "table," an observation is
called a "record," and a variable is called a "field." Most of the work in databases is done by means of
"queries." Query, of course, is a word related to "enquire" and "question." A query, more or less, is a
Page No.145
question asked of a database, for which the answer is usually some portion of one or more tables, or a
summary of the data in the tables.
SQL is used in SAS by invoking Proc SQL, then submitting queries. Proc SQL is interactive, like proc
plot and proc reg, so it will continue to accept queries after the initial statements are submitted. In
fact, proc SQL does not require run statements to work. Each query is executed immediately upon
submission. Proc SQL continues running until another proc step or data step is encountered, or a quit
statement is submitted.
We will once again refer to the used cars data set for some examples. Here is a simple query to start
with.

A basic query begins with the keyword select. The idea is that we are going to "select" something from
the database that we want to display. The results of the select statement are displayed in the output
window. A select statement will have, at least, a list of variables to display, and a source table, given
in the from clause. Unlike most lists in native SAS language, items in SQL lists are separated by
commas. This is one of the most common sources of mistakes for beginning programmers. SQL is
intended to be fairly "plain English" in nature, so we may be tempted to put commas between clauses,
as in English, but SQL only uses commas for list separators. The next example shows how to form a
list of variables, as well as how to use a where clause to restrict the records that are selected.

There is a shortcut if you want to get all the fields from a table:

In the next example, we show that new fields can be created by using functions (or any mathematical
expressions). The count function counts the number of times a value occurs. The "group by" clause is
added to the end so that the counts, as well as the sums, apply to each make. The effect is similar to
the by statement we have seen in other procs. The keyword "as" is used to assign a name to the new
field. This syntax can also be used to assign aliases for existing field names. Also included is a format
for the price. Note that the result includes one row for each value of make, the group by field. Be
careful not to select any other field which has differing values within one make, because that will force
the output to give multiple rows for each make. As long as the results are unique within the group by
Page No.146
values, you will get one row for each. Also, if you leave out the group by clause, you will get one row
for each row in the table, and all the "number" and "totprice" values will be the same, the grand total
for the whole table.

You can also add an "order by" clause to sort the results differently. The example below will sort the
above output by totprice. (To reverse the order, put "desc" AFTER the sort field.) As queries get more
complex, writing style becomes important to keep track of the various parts of the query. It is good to
start each clause on a new line, and if the lists are long, put each item on a separate line, with
appropriate indenting.

The above examples show that SQL provides useful and easy ways to extract both detail and summary
information from data sets (tables). But some of the most powerful uses of SQL have to do with
extracting information from multiple tables. There is some similarity between what we will do here and
what can be done with a merge statement in a data step, but SQL does things differently and usually
more efficiently. One major difference is that SQL never requires data to be sorted in advance.
In database systems, much attention is given to a process called "normalization." This essentially
means that data are split up across multiple tables in order to avoid redundancy. For example, in a
sales database, you might have one table that lists customers with their contact information, another
table that lists the salesmen's information, and a third table that gives individual sales, including one
field to link to the customer table and another field to link to the salesmen table. Then a query might
be written to extract a particular sale, including the customer information and the salesman's
information for that sale, getting some of the information from each of the three tables. To
demonstrate how this works, we will split up our usedcars data into two tables. In this data, all of the
model names are unique to one make, so model will be the linking field. We can use SQL to create
these new tables, with, can you guess? "create table" queries. Here's one. (The create table query
doesn't send anything to the output window.)
Page No.147

And here's the other.

So now we suppose that these are the two tables we have to get our data from, and we want to print a
report with the make, model, year, price, and miles. In the list of fields selected, we now use two-level
names to specify which table the variables come from (actually this is only necessary for those in more
than one table). The asterisk can still be used to request all the variables from one table. In the from
clause, we list all the tables being used. In the where clause, we "join" the tables by specifying which
fields to match up in the two tables.

The where clause can contain the equality that defines how the tables are joined as well as other
conditions or restrictions, combined by using logical expressions. More than two tables can be joined,
and they need not be joined by the same field. Again, logical expressions can be used to combine
these conditions. In case there are values that have no match, no rows are included for them. This
Page No.148
differs from the merge statement in a data step, where missing values are generated in such a case.
However, SQL can produce the same effect, with something called an "outer join," if needed.
Exercises:
Use SQL and the School Sales data as the basis for the following exercises.
Do not create any new tables except for #7.
Also, there should be no other procedures used, nor any data steps beyond those that are in the
program schoolsalesdata.sas which is linked above..
1. Display the complete contents of the "sales" table, sorting by name.
2. Display the name and grade of students in the 11th and 12th grades only, sorted by grade from
highest to lowest.
3. Display each student's name and her goal.
4. For each student, display name, grade, and total sales.
5. Run the following five queries. In each case, explain what SQL is counting, and why the results are
different (or similar).
a) select count(grade) as numgrade from students;
b) select count(name) as numname from students;
c) select count(distinct grade) as numgrade from students;
d) select count(grade) as numgrade from students group by grade;
e) select grade, count(grade) as numgrade from students group by grade;
6. For each grade, display grade, goal, number of students, and total goal (students*goal).
7. Create a new table that contains the name and sales (only those two fields) of students in the 12th
grade. Use a select statement to display the results.
Lesson 24-25: SAS Macro
You may have used macros in Excel, or at least, you have probably heard about them. In Excel,
macros are basically used to carry out a series of complex tasks by using one command (perhaps
linked to a button in the spreadsheet). Thus a macro command carries out other commands.
In SAS, the macro language is a language "above" the regular SAS language. It essentially generates
SAS code for you. When you submit a SAS program, the first thing that happens is that SAS scans it
for macro code. The macro code is interpreted, or compiled, into SAS language statements, then the
statements are executed.
We must distinguish between macro code in general, and "a macro" in particular. Some macro
commands work anywhere in a program, but most work only inside a macro. A macro is like a sub-
routine or procedure that is compiled and stored in a library, and then "called" or "invoked" by a SAS
program. SAS statements that are not part of a macro are called "open code." All of the programs
we have written so far have consisted entirely of open code.
We begin with some ways that macro language is used in open code. A fairly convenient application is
to define macro variables that can be used to change values in a program. A macro variable can be
defined at top of the program, where it is easy to find and change. It then can determine what happens
further down. Here we look at a data step similar to that used in the Yahtzee program. The first two
lines are assignment statements for macro variables. The percent sign is always the first character in
macro language keywords. The %let commands assign values to the macro variables dice and reps.
The values of macro variables are always text (character). This is because they will be inserted into
the program statements that the macro language is writing for us, which becomes part of the program
before SAS executes the program statements.
Page No.149

When the macro variable is used, that is, when we want to insert its value into the program, it is
preceded by an ampersand (&). Whenever the macro interpreter sees the ampersand, it tries to
interpret the value of the variable that follows. SAS calls this resolving the macro variable. So, in the
array definition, you can see that the value of "&dice" will be "5", so that after interpretation, the line
says "array x (5);". The next line, which sets up the loop for the number of repetitions, will be
interpreted as "do i=1 to 10;". Since the number of dice tossed has to match the array size, the next
line uses the dice variable again, and it will be interpreted as "do j=1 to 5;". In other words, what SAS
sees after the macro compiler gets done is like this:

So, the idea is, if you have numbers or names in your code that you might want to change later,
especially if they occur many times and might be anywhere in a long program, it is a good idea to put
them in macro variables that are defined at the top of the program. Then you can change the values in
the macro variables and automatically change all the related values in the program.
The macro variables are also available in title statements, which can be very useful to keep track of the
settings you use. For this to work, the title must be enclosed in double quotes, not single quotes.
Single quotes will prevent resolving the macro variables, and you will get a title that says exactly what
is typed between the single quotes.

Now let's create a macro. The macro begins with the keyword "%macro" followed by a name for the
macro, and ends with the keyword "%mend" which means "macro end." You can put the name of the
macro after %mend if you like, to keep track of which macro is ending. Now, what we are actually
Page No.150
doing here is creating a macro, not executing the statements inside. When the code below is
submitted, there will not be any data set created or any output from proc print. The macro will be
compiled and saved in a subfolder of the work directory called "sasmacr." There is NO message in the
log that this has happened. (You will only get a message if there are errors.)
This macro makes use of a %do loop. Note that there is also a %to and a %end. This is macro
code that creates a loop for the statements inside. That means, as the macro runs, it is writing the
code inside three times. The variable in the %do loop, dsnum, is a macro variable. Its value is text,
just like the other macro variables. The data statement uses this value to name the data set
differently each time the loop executes: one1, one2, and one3.

Now we need to run the macro. To do that, issue a command in SAS using the percent sign and macro
name.

In the explorer window we can see that three data sets have been created.
Page No.151

Here is the output from the third loop.

Now consider the data step below. Suppose this is an example of a much larger data set that we have
on hand (perhaps 52 weeks worth). For some reason, we need to split it up into separate data sets for
each week.
Page No.152

We could write a data step like this, but you can see that if there were many more weeks, this could be
tedious.

So let's get a macro to do the typing for us. Did you notice that the enhanced editor's color coding
changes? Inside a macro it focuses on the macro language elements. The regular SAS language
keywords no longer turn blue. Another thing to notice is the usage of semicolons. Macro statements
end in semicolons like any other SAS statements. However, all macro statements are separate from
the code that they create. In the data step above, there is no semicolon until all four weekly data sets
have been named. In the macro, there is no semicolon after the data set option. If we put one there,
we would get a semicolon after each one. Furthermore, there is a "%end;" which has a semicolon, but
that semicolon is not going to be part of the text the macro writes, because it is part of a macro
statement. Instead we put a semicolon by itself after the loop. This semicolon will be part of the text
in the final program. The macro shown below writes a program just like that shown above.
Page No.153
Remember that a macro is stored in a library, and can also be stored permanently. The creation of a
macro is separate from calling (using) a macro. The last line in the program above is the macro call.
Once the macro is saved, one needs only to call it, not re-run the code that creates it. With that in
mind, consider that we might want to pass information to the macro when we call it. Suppose in the
program above we want to choose which weeks to split out when we call the macro, rather than
automatically doing four of them. Let's assume for this example that there are 52 weeks in the original
data set and we want to pick an arbitrary range. Then we can make our macro accept parameters
(which are macro variables named in parentheses after the macro name, separated by commas), which
we "pass" when we call the macro (putting the variable values in parentheses after the macro name,
separated by commas). Here we show the macro with two parameters which supply the starting and
ending week numbers for the %do loop. In the example, weeks 1 to 2 are requested in the macro call.

The idea here is that this macro would probably be saved permanently, then whenever we needed
some of these data sets, we just call the macro with the appropriate week numbers. We don't have to
rewrite the program each time.
Now let's turn things around. The following SQL code combines the data from four weekly datasets for
one of the salesmen and calculates a total.

Our goal will be to turn this into a macro in order to accomplish two things: 1. to eliminate the
repetitive typing, and 2. to make it possible to pass parameters that specify the range of weeks and
the salesman for the report. The way to approach this, as a beginning macro programmer, is to take
an example of working code, like that shown above, put it inside a macro, then change things a little at
a time, frequently checking if it still works. It is also good to separate the code in strategic places so
that the repeated parts are on their own lines, being careful to note whether the punctuation is also
repeated..
Page No.154

Now focus on the first line of repeated phrases. Note that in this case, all end with a comma and are
the same except for the numbers. We put a do loop around this line, and delete all but one example.

Next, we need to replace the numbers with the macro variable i (the loop counter). But before we can
do that, we need to understand a little more about how macro variables are resolved. For the following
discussion, suppose that the value of macro variable num is "1" and the value of macro variable
month1 is "January" (remember, macro variable values are text).
In previous examples, our macro variables have either stood alone or they have come at the end of a
word. Sometimes macro variables are embedded inside words. For example, suppose you wanted to
replace the number in "month1sales" with the macro variable, num. If you put "month&numsales" the
macro compiler would look for a macro variable variable called "numsales", not just "num". To get
around this problem, the macro language uses a period (dot) to signal the end of a macro variable.
Writing "month&num.sales" would be correct. The period is part of the macro variable reference, so
that "&num." would all be replaced by the value of the num macro variable, resulting in the correct
resolution, "month1sales".
Macro variables can be "nested" so that one variable resolves to complete the name of another. A
double & is used indicate this nesting. For example, "&&month&num" will first resolve to "&month1"
and then to "January". You can think of it this way: The compiler will make two passes through this
phrase (if it finds "&&&" it will make three, and so on). In one pass, it will hold off on interpreting
things that are preceded by more than one "&", but it will remove one "&" in preparation for the next
pass. Wherever there is only one "&" left, it resolves the variable, and the value becomes part of the
text it can resolve in the next pass.
If nesting is used, one dot is resolved with each macro variable. If there is supposed to be a dot in the
result, make sure to include that in addition to the others. Thus "month&num..sales" resolves to
Page No.155
"month1.sales", "&&month&num..sales" resolves to "Januarysales", but "&&month&num...sales"
resolves to "January.sales".
Now, to return to our example, we have a period as part of the SAS code we want to generate. If we
write "saleswk&i.week&i," the macro compiler will resolve this to "saleswk1week1," with no period!
Therefore, we have to put in an extra period.

Now let's move on to the next line with a repeated pattern, the one that comes from inside the
parentheses of the sum statement. There is one difference between this line and the previous one--
this one does not have a comma at the end. That means we have to include a comma in all but the
last loop. To do that, we use a %if - %then statement. The text between %then and the semicolon
will be added to the program if the condition is true. The semicolon is the end of the macro statement
and is not part of the generated text.

This could be considered a short form of the statement. In fact, you might find it a bit unsatisfying--
don't you want it to "%do" something? It can be written that way, and must be, if there is a semicolon
in the text you want included. Below is an alternate way of doing the same thing.

We can finish up the remaining repeated sections in a similar way. The completed macro looks like
this:
Page No.156

Now we have finished the first goal, which was to eliminate all repeated typing. Our second goal was
to make it possible to pass parameters that specify the range of weeks to report, and the salesman to
report on. The parameters are the three macro variables in parentheses after the macro name. Since
our original macro had a range of 1 to 4, it is a simple matter to go through the program and replace
every instance of a 1 with "&start" and every instance of 4 with "&stop", being careful to add an extra
period where necessary (these numbers don't occur in the program for any other reason than to refer
to the data sets and variables we want to use). The salesman's name only occurs one time at the end,
inside quotes. It is important that these quotes be double and not single in order for the macro
variable to be resolved. An example of calling the macro is included, with the corresponding output
shown below the program.
Page No.157

The process of detecting and correcting errors is more difficult when using macro language. One
reason is that there are now two levels of errors, those in the macro code, and those in the program
generated by the macro. For instance, if you delete one of the %if statement in the above program,
which makes the SQL code incorrect, and create the macro (without calling it) this is what you get in
the log:
Page No.158

No errors are reported, because there were no errors in the macro code. There are also no helpful
notes telling us the macro was successfully created. There is nothing but a copy of the lines that were
submitted. Now suppose we call the macro:

SAS reports the error, but it is not related back to the line numbers in the macro program! This is
because the macro is generating the code in the background, and the lines it generates are not copied
to the log. It does not identify the exact location where the error occurred. It would be very hard to
figure out from this limited information, where exactly our error is. For this reason, it is important to
try to write macros in small steps, and test them often.
There are two tools are available for debugging. The first is the symbolgen system option (put in an
options statement, use nosymbolgen to turn it off). This will cause macro variables and their values to
be printed to the log when the macro is called. This can be helpful to find out if your macro variables
are resolving correctly, but once again, it may not be easy to connect these messages to the code that
you wrote. Here is part of the log that came from calling the macro with symbolgen turned on.
Page No.159

The second tool is the %put statement. This is much like the regular put statement, in that it writes
messages to the log. By careful planning, we can make %put statements tell us what is going on in
the program. Here, a %put statement is inserted after each %do loop just to tell us it has finished.
This, in combination with the symbolgen option, gives a pretty good indication when the error
occurred--after the third %do loop, and after the macro variable i had a value of 1. In other words, it
occurs because the comma is missing after saleswk1--not saleswk3, as the error message suggests!

Using symbolgen may give you more information than you need or want, so you can also use macro
variables in your own put statements to report the values you want to see.
Each company that implements the SQL standard can add its own enhancements. SAS had added a
useful feature that allows SQL to assign query results to macro variables. The syntax uses the key
word "into" together with a macro variable name with a colon in front of it. Here we show how the
Page No.160
means of two variables are stored in two macro variables, which are then resolved in proc plot to
create reference lines in the graph.

SAS has a number of built-in or automatic macro variables. One example is "sysdate". The following
title statement will include the current date in the title statement.

Exercises:
1. Suppose you get a monthly sales report with the salesman's name and total sales for the month.
The reports are loaded into SAS data sets with names like Month1, Month2, etc. You want to generate
quarterly summary data sets that include each salesman's name, sales for the three months, and total
sales for the quarter. Since this is not a one-time project, but an ongoing task that will presumably
Page No.161
continue for years to come, you want to automate the process as much as possible. The initial SAS
program creates six months' worth of data and includes SQL code that will create the first quarter's
summary. As you have seen, it is best to create macro code in small steps, and run the program
frequently to make sure it works as you add more components. I will try to guide you through the
steps here, but you only need to turn in the log and output for the final product, NOT every step listed
below.
a) Run the program as given to create the data sets. You can then comment out the data steps or
delete them, as they do not need to run again (until you start a new session, unless you save them to
a permanent data set).
b) At the top of the program, define a macro variable called "q" for the quarter number and set it
equal to 1. The idea is that this is the only thing the "user" should have to change when running the
program for a new quarter. We will need to generate the corresponding month numbers from q.
Define macro variables m1, m2, and m3, setting them equal to a month value that is generated from
the quarter number.
c) Define a macro around the existing SQL code, with a %macro, a %mend, and then a statement to
call the macro.
d) Now start substituting macro variables into the SQL code in each place where a quarter number or a
month number appears. Make sure your program runs correctly before adding any %do loops.
e) Now "compact" the code using %do loops each time a sequence of similar terms appears in the SQL
code.
f) When you have everything running correctly, add a proc print statement to print out the table that
is created by the SQL command, and submit the log and output.

2. In this problem we will create a macro that will take parameters for a data set name and two
variable names, then produce a plot (using proc plot) based on the parameters that are passed, and
will also automatically generate a vertical reference line at the mean of the vertical variable.
a) For purposes of this exercise, use stavwood.txt as the source data (this file has tabs in it). The
variables in this data set are group, y, x1, x2, x3. Begin by getting the next two steps to work
outside of a macro.
b) Use proc SQL to get the mean for your vertical reference line into a macro variable.
c) Now write a plot step and use the vref option with the macro variable to create a reference line.
d) When you have all this working, put it inside a macro. Note, the data step that reads stavwood.txt
is not part of the macro. The macro should have parameters for data set name, vertical axis variable,
and horizontal axis variable, in that order. Now you must change the program inside the macro so that
all references, and I do mean ALL, to these three things are replaced by macro variables. (Include the
log from compiling your completed macro in your homework submission.)
e) Call the macro, putting in your dataset name, y for the vertical axis variable, and one of the x
variables from stavwood.txt for parameters. Make sure the plot is correct and there are no errors in
the log. (Submit this output with your results. Include the log from calling your macro in your
homework submission one time. )
f) Call the macro again, reversing your vertical and horizontal variables. What happens? (Submit this
output with your homework.)
g) Call the macro again, with another combination of variables. (Submit this output with your
homework.)

Lesson 24-25: SAS Macro


You may have used macros in Excel, or at least, you have probably heard about them. In Excel,
macros are basically used to carry out a series of complex tasks by using one command (perhaps
linked to a button in the spreadsheet). Thus a macro command carries out other commands.
In SAS, the macro language is a language "above" the regular SAS language. It essentially generates
SAS code for you. When you submit a SAS program, the first thing that happens is that SAS scans it
for macro code. The macro code is interpreted, or compiled, into SAS language statements, then the
statements are executed.
Page No.162
We must distinguish between macro code in general, and "a macro" in particular. Some macro
commands work anywhere in a program, but most work only inside a macro. A macro is like a sub-
routine or procedure that is compiled and stored in a library, and then "called" or "invoked" by a SAS
program. SAS statements that are not part of a macro are called "open code." All of the programs
we have written so far have consisted entirely of open code.
We begin with some ways that macro language is used in open code. A fairly convenient application is
to define macro variables that can be used to change values in a program. A macro variable can be
defined at top of the program, where it is easy to find and change. It then can determine what happens
further down. Here we look at a data step similar to that used in the Yahtzee program. The first two
lines are assignment statements for macro variables. The percent sign is always the first character in
macro language keywords. The %let commands assign values to the macro variables dice and reps.
The values of macro variables are always text (character). This is because they will be inserted into
the program statements that the macro language is writing for us, which becomes part of the program
before SAS executes the program statements.

When the macro variable is used, that is, when we want to insert its value into the program, it is
preceded by an ampersand (&). Whenever the macro interpreter sees the ampersand, it tries to
interpret the value of the variable that follows. SAS calls this resolving the macro variable. So, in the
array definition, you can see that the value of "&dice" will be "5", so that after interpretation, the line
says "array x (5);". The next line, which sets up the loop for the number of repetitions, will be
interpreted as "do i=1 to 10;". Since the number of dice tossed has to match the array size, the next
line uses the dice variable again, and it will be interpreted as "do j=1 to 5;". In other words, what SAS
sees after the macro compiler gets done is like this:

So, the idea is, if you have numbers or names in your code that you might want to change later,
especially if they occur many times and might be anywhere in a long program, it is a good idea to put
them in macro variables that are defined at the top of the program. Then you can change the values in
the macro variables and automatically change all the related values in the program.
The macro variables are also available in title statements, which can be very useful to keep track of the
settings you use. For this to work, the title must be enclosed in double quotes, not single quotes.
Single quotes will prevent resolving the macro variables, and you will get a title that says exactly what
is typed between the single quotes.
Page No.163

Now let's create a macro. The macro begins with the keyword "%macro" followed by a name for the
macro, and ends with the keyword "%mend" which means "macro end." You can put the name of the
macro after %mend if you like, to keep track of which macro is ending. Now, what we are actually
doing here is creating a macro, not executing the statements inside. When the code below is
submitted, there will not be any data set created or any output from proc print. The macro will be
compiled and saved in a subfolder of the work directory called "sasmacr." There is NO message in the
log that this has happened. (You will only get a message if there are errors.)
This macro makes use of a %do loop. Note that there is also a %to and a %end. This is macro
code that creates a loop for the statements inside. That means, as the macro runs, it is writing the
code inside three times. The variable in the %do loop, dsnum, is a macro variable. Its value is text,
just like the other macro variables. The data statement uses this value to name the data set
differently each time the loop executes: one1, one2, and one3.
Page No.164
Now we need to run the macro. To do that, issue a command in SAS using the percent sign and macro
name.

In the explorer window we can see that three data sets have been created.

Here is the output from the third loop.

Now consider the data step below. Suppose this is an example of a much larger data set that we have
on hand (perhaps 52 weeks worth). For some reason, we need to split it up into separate data sets for
each week.
Page No.165

We could write a data step like this, but you can see that if there were many more weeks, this could be
tedious.

So let's get a macro to do the typing for us. Did you notice that the enhanced editor's color coding
changes? Inside a macro it focuses on the macro language elements. The regular SAS language
keywords no longer turn blue. Another thing to notice is the usage of semicolons. Macro statements
end in semicolons like any other SAS statements. However, all macro statements are separate from
the code that they create. In the data step above, there is no semicolon until all four weekly data sets
have been named. In the macro, there is no semicolon after the data set option. If we put one there,
we would get a semicolon after each one. Furthermore, there is a "%end;" which has a semicolon, but
that semicolon is not going to be part of the text the macro writes, because it is part of a macro
statement. Instead we put a semicolon by itself after the loop. This semicolon will be part of the text
in the final program. The macro shown below writes a program just like that shown above.
Page No.166
Remember that a macro is stored in a library, and can also be stored permanently. The creation of a
macro is separate from calling (using) a macro. The last line in the program above is the macro call.
Once the macro is saved, one needs only to call it, not re-run the code that creates it. With that in
mind, consider that we might want to pass information to the macro when we call it. Suppose in the
program above we want to choose which weeks to split out when we call the macro, rather than
automatically doing four of them. Let's assume for this example that there are 52 weeks in the original
data set and we want to pick an arbitrary range. Then we can make our macro accept parameters
(which are macro variables named in parentheses after the macro name, separated by commas), which
we "pass" when we call the macro (putting the variable values in parentheses after the macro name,
separated by commas). Here we show the macro with two parameters which supply the starting and
ending week numbers for the %do loop. In the example, weeks 1 to 2 are requested in the macro call.

The idea here is that this macro would probably be saved permanently, then whenever we needed
some of these data sets, we just call the macro with the appropriate week numbers. We don't have to
rewrite the program each time.
Now let's turn things around. The following SQL code combines the data from four weekly datasets for
one of the salesmen and calculates a total.

Our goal will be to turn this into a macro in order to accomplish two things: 1. to eliminate the
repetitive typing, and 2. to make it possible to pass parameters that specify the range of weeks and
the salesman for the report. The way to approach this, as a beginning macro programmer, is to take
an example of working code, like that shown above, put it inside a macro, then change things a little at
a time, frequently checking if it still works. It is also good to separate the code in strategic places so
that the repeated parts are on their own lines, being careful to note whether the punctuation is also
repeated..
Page No.167

Now focus on the first line of repeated phrases. Note that in this case, all end with a comma and are
the same except for the numbers. We put a do loop around this line, and delete all but one example.

Next, we need to replace the numbers with the macro variable i (the loop counter). But before we can
do that, we need to understand a little more about how macro variables are resolved. For the following
discussion, suppose that the value of macro variable num is "1" and the value of macro variable
month1 is "January" (remember, macro variable values are text).
In previous examples, our macro variables have either stood alone or they have come at the end of a
word. Sometimes macro variables are embedded inside words. For example, suppose you wanted to
replace the number in "month1sales" with the macro variable, num. If you put "month&numsales" the
macro compiler would look for a macro variable variable called "numsales", not just "num". To get
around this problem, the macro language uses a period (dot) to signal the end of a macro variable.
Writing "month&num.sales" would be correct. The period is part of the macro variable reference, so
that "&num." would all be replaced by the value of the num macro variable, resulting in the correct
resolution, "month1sales".
Macro variables can be "nested" so that one variable resolves to complete the name of another. A
double & is used indicate this nesting. For example, "&&month&num" will first resolve to "&month1"
and then to "January". You can think of it this way: The compiler will make two passes through this
phrase (if it finds "&&&" it will make three, and so on). In one pass, it will hold off on interpreting
things that are preceded by more than one "&", but it will remove one "&" in preparation for the next
pass. Wherever there is only one "&" left, it resolves the variable, and the value becomes part of the
text it can resolve in the next pass.
If nesting is used, one dot is resolved with each macro variable. If there is supposed to be a dot in the
result, make sure to include that in addition to the others. Thus "month&num..sales" resolves to
Page No.168
"month1.sales", "&&month&num..sales" resolves to "Januarysales", but "&&month&num...sales"
resolves to "January.sales".
Now, to return to our example, we have a period as part of the SAS code we want to generate. If we
write "saleswk&i.week&i," the macro compiler will resolve this to "saleswk1week1," with no period!
Therefore, we have to put in an extra period.

Now let's move on to the next line with a repeated pattern, the one that comes from inside the
parentheses of the sum statement. There is one difference between this line and the previous one--
this one does not have a comma at the end. That means we have to include a comma in all but the
last loop. To do that, we use a %if - %then statement. The text between %then and the semicolon
will be added to the program if the condition is true. The semicolon is the end of the macro statement
and is not part of the generated text.

This could be considered a short form of the statement. In fact, you might find it a bit unsatisfying--
don't you want it to "%do" something? It can be written that way, and must be, if there is a semicolon
in the text you want included. Below is an alternate way of doing the same thing.

We can finish up the remaining repeated sections in a similar way. The completed macro looks like
this:
Page No.169

Now we have finished the first goal, which was to eliminate all repeated typing. Our second goal was
to make it possible to pass parameters that specify the range of weeks to report, and the salesman to
report on. The parameters are the three macro variables in parentheses after the macro name. Since
our original macro had a range of 1 to 4, it is a simple matter to go through the program and replace
every instance of a 1 with "&start" and every instance of 4 with "&stop", being careful to add an extra
period where necessary (these numbers don't occur in the program for any other reason than to refer
to the data sets and variables we want to use). The salesman's name only occurs one time at the end,
inside quotes. It is important that these quotes be double and not single in order for the macro
variable to be resolved. An example of calling the macro is included, with the corresponding output
shown below the program.
Page No.170

The process of detecting and correcting errors is more difficult when using macro language. One
reason is that there are now two levels of errors, those in the macro code, and those in the program
generated by the macro. For instance, if you delete one of the %if statement in the above program,
which makes the SQL code incorrect, and create the macro (without calling it) this is what you get in
the log:
Page No.171

No errors are reported, because there were no errors in the macro code. There are also no helpful
notes telling us the macro was successfully created. There is nothing but a copy of the lines that were
submitted. Now suppose we call the macro:

SAS reports the error, but it is not related back to the line numbers in the macro program! This is
because the macro is generating the code in the background, and the lines it generates are not copied
to the log. It does not identify the exact location where the error occurred. It would be very hard to
figure out from this limited information, where exactly our error is. For this reason, it is important to
try to write macros in small steps, and test them often.
There are two tools are available for debugging. The first is the symbolgen system option (put in an
options statement, use nosymbolgen to turn it off). This will cause macro variables and their values to
be printed to the log when the macro is called. This can be helpful to find out if your macro variables
are resolving correctly, but once again, it may not be easy to connect these messages to the code that
you wrote. Here is part of the log that came from calling the macro with symbolgen turned on.
Page No.172

The second tool is the %put statement. This is much like the regular put statement, in that it writes
messages to the log. By careful planning, we can make %put statements tell us what is going on in
the program. Here, a %put statement is inserted after each %do loop just to tell us it has finished.
This, in combination with the symbolgen option, gives a pretty good indication when the error
occurred--after the third %do loop, and after the macro variable i had a value of 1. In other words, it
occurs because the comma is missing after saleswk1--not saleswk3, as the error message suggests!

Using symbolgen may give you more information than you need or want, so you can also use macro
variables in your own put statements to report the values you want to see.
Each company that implements the SQL standard can add its own enhancements. SAS had added a
useful feature that allows SQL to assign query results to macro variables. The syntax uses the key
word "into" together with a macro variable name with a colon in front of it. Here we show how the
Page No.173
means of two variables are stored in two macro variables, which are then resolved in proc plot to
create reference lines in the graph.

SAS has a number of built-in or automatic macro variables. One example is "sysdate". The following
title statement will include the current date in the title statement.

Exercises:
1. Suppose you get a monthly sales report with the salesman's name and total sales for the month.
The reports are loaded into SAS data sets with names like Month1, Month2, etc. You want to generate
quarterly summary data sets that include each salesman's name, sales for the three months, and total
sales for the quarter. Since this is not a one-time project, but an ongoing task that will presumably
Page No.174
continue for years to come, you want to automate the process as much as possible. The initial SAS
program creates six months' worth of data and includes SQL code that will create the first quarter's
summary. As you have seen, it is best to create macro code in small steps, and run the program
frequently to make sure it works as you add more components. I will try to guide you through the
steps here, but you only need to turn in the log and output for the final product, NOT every step listed
below.
a) Run the program as given to create the data sets. You can then comment out the data steps or
delete them, as they do not need to run again (until you start a new session, unless you save them to
a permanent data set).
b) At the top of the program, define a macro variable called "q" for the quarter number and set it
equal to 1. The idea is that this is the only thing the "user" should have to change when running the
program for a new quarter. We will need to generate the corresponding month numbers from q.
Define macro variables m1, m2, and m3, setting them equal to a month value that is generated from
the quarter number.
c) Define a macro around the existing SQL code, with a %macro, a %mend, and then a statement to
call the macro.
d) Now start substituting macro variables into the SQL code in each place where a quarter number or a
month number appears. Make sure your program runs correctly before adding any %do loops.
e) Now "compact" the code using %do loops each time a sequence of similar terms appears in the SQL
code.
f) When you have everything running correctly, add a proc print statement to print out the table that
is created by the SQL command, and submit the log and output.

2. In this problem we will create a macro that will take parameters for a data set name and two
variable names, then produce a plot (using proc plot) based on the parameters that are passed, and
will also automatically generate a vertical reference line at the mean of the vertical variable.
a) For purposes of this exercise, use stavwood.txt as the source data (this file has tabs in it). The
variables in this data set are group, y, x1, x2, x3. Begin by getting the next two steps to work
outside of a macro.
b) Use proc SQL to get the mean for your vertical reference line into a macro variable.
c) Now write a plot step and use the vref option with the macro variable to create a reference line.
d) When you have all this working, put it inside a macro. Note, the data step that reads stavwood.txt
is not part of the macro. The macro should have parameters for data set name, vertical axis variable,
and horizontal axis variable, in that order. Now you must change the program inside the macro so that
all references, and I do mean ALL, to these three things are replaced by macro variables. (Include the
log from compiling your completed macro in your homework submission.)
e) Call the macro, putting in your dataset name, y for the vertical axis variable, and one of the x
variables from stavwood.txt for parameters. Make sure the plot is correct and there are no errors in
the log. (Submit this output with your results. Include the log from calling your macro in your
homework submission one time. )
f) Call the macro again, reversing your vertical and horizontal variables. What happens? (Submit this
output with your homework.)
g) Call the macro again, with another combination of variables. (Submit this output with your
homework.)

Lesson 26-27: SAS Graph

SAS/Graph is the high-resolution graphics package for producing plots and charts in SAS. Many basic
commands are similar to those we have learned for proc plot and proc chart. In fact, you can often
insert a "g" in front of "plot" and "chart" and end up with a workable result, using the same syntax.
However, there are many more options and capabilities available. We will just explore a few of them
here.
Page No.175
SAS/Graph is a separate module or package in SAS, like SAS/STAT. In the Online Documentation, you
will find an entry called "SAS/GRAPH Reference" in the first set of branches. After opening that, you
can click on "SAS/GRAPH Procedures," followed by the name of the procedure you want to use.
We begin with proc gplot. You can get a passable plot by typing commands similar to those in proc
plot: Note that the plot is displayed in a new window, a "graph" window. This window behaves
differently than the output window in one important way--it does not scroll down automatically when a
new plot is created. This is easy to forget! Don't be fooled when it looks like your results haven't
changed!

Note that we have two series because of the "=group" in the plot statement. By default, proc gplot
assigns different colors to each value of group and prints a legend.
There are many things that can be done to customize this graph. The most common commands are
summarized in SASGraphCommands.doc.
We can customize the symbols and colors used for the two groups. To do this we write symbol
statements, which are global in effect, so we usually place them above the gplot step (although they
Page No.176
work inside as well). The symbols are numbered and the symbol statements written accordingly, much
like title statements. Like title statements, the symbol definitions remain in effect until you change
them. Unlike title statements, a change to a higher number symbol does not affect the lower
numbered ones. Also, individual commands inside symbol statements do not get cleared or reset by
leaving them out of a subsequent symbol statement.
The value= command determines the shape of the symbol, and the color= command, well, you
know. These can be abbreviated v= and c=. Many common shapes and capital letters can be used for
the value. Most color names that you would think of will work too.

If you want to connect the dots, use one of the interpolation methods, abbreviated as "interpol" or
just "i". The methods are join, spline, and regression. "Join" connects the points with straight lines.
"Spline" uses a polynomial function for a smooth fit. However, if the points vary too much, there can
be wild peaks and valleys. Both these methods connect the points in the order they occur in the data
set. With the spline method, you can add the "s" option, which stands for sort. It looks like "splines"
but it means it will sort by the x variable so that the points are connected from left to right.
Page No.177

The regression method has several parts to its command. It starts with r. The second character is
either l for linear, q for quadratic, or c for cubic. Then, you can have either cli or clm (confidence limits
of prediction or confidence limits for the mean) followed by a confidence level like 80, 90, or 95. The
regression equation is also printed in a note in the log.
Page No.178

A height= (h=) command controls the size of the value symbol, and a line= (l=) command selects a
line style.

Now wouldn't it be nice if the symbols didn't spill over the frame of the graph? So glad you asked...
The appearance of the axes is controlled by global axis statements similar to the symbol statements,
except that they are not automatically applied. The axis definition must be assigned to an axis using
the vaxis or haxis option in the plot statement. Here we see two of the options in the axis definition
demonstrated. The order= option controls the spacing and range of the values displayed on the axis.
For categorical data, a list of category names can be given in the parentheses. The label= option
controls the appearance of the axis label. The options inside the parentheses apply to the text that
follows, so some characteristics can be changed in the middle of a word. Similar options can also be
used in title statements, which will control the appearance of the titles in graphs. Also shown here is a
legend statement, and the corresponding reference in the plot statement.
Page No.179

At this time it is appropriate to add a note about sizing graphs. The graphs that SAS produces can be
resized by dragging the borders like other windows objects. However, this shrinks or expands
everything in the picture (like it should), so if you make the graph smaller you might no longer be able
to read the text. Although there are options that can specify the size of the graph, it is worthwhile to
experiment with another method. The intial size of the graph is determined (by default) by the size of
the SAS window (the application window, not the graph window). Resizing your SAS window can go a
long way toward giving you the results you want. Then, once a graph is produced, you should try
resizing it (the graph window this time) by small amounts in both directions. This can affect the
appearance considerably, through small adjustments in spacing between objects, and even in the
displayed font. When the graphs above were produced, for example, the font for the numbers on the
axes (tick mark labels) was thin and hard to read. A small adjustment in the size of the graph changed
it to what you see here.
Here is another type of plot, called a bubble plot. This is a three-dimensional plot, because the size of
the bubble is determined by the variable on the right side of the equal sign. In this case there are only
two values for group, so you only see two sizes.
Page No.180

SAS has more nice examples in the documentation. See some under "SAS/GRAPH Reference,
SAS/GRAPH Procedures, The GPLOT Procedure." To find more information about symbol and axis
statements, and tahe like, see "SAS/GRAPH Reference, SAS/GRAPH Concepts, SAS/GRAPH
Statements." It is really worthwhile to browse through this documentation to get an idea of what
SAS/GRAPH can do.
Next we turn to proc gchart, the high-resolution version of proc chart. Again, we can begin with simple
commands such as those we have learned in earlier lessons for proc chart. Here is a bar graph with
continuous data.
Page No.181

In this case, x was treated as a continuous variable, and SAS used midpoints of 7 bin ranges that it
chose according to some default rules. You can use the levels= option to specify how many bins you
want SAS to create, or you can use the midpoints= option to list the midpoints you want. You can list
the numbers in parentheses or use "(a to b by c)" notation. If your values are discrete, use the
discrete option, as shown below. If the chart variable is character, there will be a bar for each value,
and the discrete option is not used.
Page No.182

Like proc plot, you can also use a group option.

With a subgroup option, the colors of the bars and a crosshatch pattern can be controlled using
pattern statements. The value= or v= option uses "L" for left slanting, "R" for right slanting, and "X"
for crosshatch, followed by a number for the style.
Page No.183

You can make block charts:


Page No.184

Try out the hbar3d and vbar3d statements.


Page No.185

And, of course, there are pie charts.


Page No.186

The explode= option separates the listed slices away from the pie for emphasis. The angle= option
turns the pie.
Page No.187

Exercises:
Download the data program to get started. Produce a graph using proc gplot for each of the four
problems below.
1. First graph y*x and z*x on the same axes, using default setting (symbols and axes).
2. Customize the graph by defining some nice symbols and modifying the axes as you think
appropriate (don't add interpolating lines at this time). Add an appropriate title too, perhaps with
some nice formatting or color options.
3. Use the line interpolation method, with two different line styles for y and z.
4. Do a plot of only y*x with a linear regression line and clm90 option.
To get a little practice with proc gchart, simulate 200 tosses of a pair of dice and calculate the sums.
Create a chart using proc gchart for each of the five problems below. Explore options such as coloring,
patterns, etc. as you wish.
1. Create a frequency histogram (use the discrete option) for the first die.
2. Create a frequency histogram for the sum of the dice.
3. Make a pie chart for the first die.
4. Make a pie chart for the sum of dice with an "exploded" view of "7".
5. Create a side-by side bar chart for the two dice, with 1's grouped, 2's grouped, etc.
For number 5, the data needs to be organized differently. You can do the other problems first, using
the same data, then do one of two things: You can generate new data that has a die number for one
variable and the die toss result for the other, or you can figure out how to rearrange the data you have
so that it is in that form (this would be better practice). In any case, you need to end up with
something like:
Die X
1 4
2 3
1 1
2 5 etc.
Optional challenge for geometry and craft fans: Write a data step that creates points on a circle,
ordered in such a way that when using proc gplot with i=join, the points will be connected like this
Page No.188
string art:

(http://www.mathcats.com/crafts/stringart.html)

Lesson 31: Introduction to IML

SAS IML is a programming language for working with matrices. IML stands for "Interactive Matrix
Language." The language is invoked, or started, by issuing a "proc iml;" command. IML is
interactive. You can keep submitting statements, one after the other, and IML will execute them. No
"run" statement is necessary. IML stops when another step (data or proc) starts, or a "quit;"
command is given.
Incidentally, the Learning Edition does not have IML.
We can define a matrix in IML with an assignment statement and a list of elements in curly brackets,
with rows separated by commas. Just like any other SAS code, the placement of the text on the line
doesn't matter, so you can string them out on one line or organize them neatly in rows and columns.
In order to see what the matrix looks like, you can use the print command.

The standard matrix operations, line addition, subtraction, and multiplication, are given by the usual
operators, "+", "-", and "*". The division operator can be used with scalars, but if used with two
conformable matrices, it will do element-wise division too, and a double asterisk is for exponents, as
usual. However, there are many more operators in IML. The number sign ("#") is element-wise
multiplication, and a double number sign is element-wise exponentiation. Double horizontal bars will
concatenate matrices side-by-side, while double slashes will concatenate them vertically.
There are a number of standard functions that are commonly used in IML. The transpose of X is given
by "X`", which is X followed by a back-quote character (to the left of "1" on the keyboard). It can also
be found by "t(X)" which is the function notation equivalent. The inverse of X is given by "inv(X)".
Since there are many functions, the best thing to do is refer to the documentation. In the Online
Documentation, IML has its own entry in the main tree. You can go to the link
http://support.sas.com/onlinedoc/913/docMainpage.jsp, then find the branch that says "SAS/IML
User's Guide. Click the plus sign next to it, and a long list of interesting branches open up. Near the
bottom, we find "Language Reference." Open that up, and you will find, among other things,
"Operators" and "Statements, Functions, and Subroutines." In these two sections you will find
information about the operators and functions. There is no need to repeat all this here, you should
make use of the documentation as necessary.
Page No.189
Exercises:
1. Explore the available functions and give examples of the use of five different functions.

Lesson 33: Headings, Do Loops, and Sampling

In reference to the assignment in Lesson 32, in the documentation, look at the section called "Working
with Matrices." Then, under "Using Assignment Statements," find "Matrix Generating Functions." Here
you can see how to use the J matrix function to create a new matrix.
More options using the reset command:
reset autoname;
Autoname gives headings for the rows and columns of matrices that are printed. Default labels are
Row1, Col1, etc. You can define a vector of headings, such as head={"Mon" "Tue" "Wed"}; Then in a
print statement you can put
print x[colname=head, rowname=rowh];
where rowh is another vector of headings for the rows.
The mattrib command provides a more sophisticated way to assign row and column names.
mattrib x rowname=(rowh) colname=(head);
print x;
This association of the row and column headings will last as long as the iml step runs, so you don't
have to keep specifying them in the print statement.
The range assignment like
a=1:10;
b=5:2;
actually creates a vector of integers specified by the range. We saw that in subscripting we can use
variables like a and b to specify the rows and columns to be selected. However, any vector will work,
and the numbers need not be in order. For example, you could have
a={5, 3, 7, 1};
Using this variable to select rows in a subscript notation would pull the 5th, 3rd, 7th, and 1st rows, in
that order.
In addition, once a mattrib statement has been used to define row and column headings, those
headings can also be used in the subscript specification instead of the numbers.
You could even make the row and column headings part of a new matrix. However, if they are
character vectors (as usual), the matrix must be a character matrix. Look at the following:
x2=("Row"||head)//(rowh||x);
Visualize the results or try an example. The Char function may be used to convert a numeric matrix to
character form for this purpose.
We bring in the usedcars data set for the next example. If work.usedcars doesn't exist, run a program
to load the usedcars.txt file into a sas data set. Usedcars has both character and numeric data.
use usedcars;
read all into UC.
print UC;
This will bring in only the numeric variables (there are three).
read all var _num_ into UC;
is equivalent, because it specifies numeric variables, which is the default. You can include this to make
your program more readable.
To create a character matrix for the character variables, write:
read all var _char_ into UC2;
Now we'll learn something about do loops. We begin with the J function to create a column of 1's and
the following do loop:
Y=J(34,1);
do i=1 to 34;
Y[i,1]=uniform(0);
end;
Page No.190
Note that the editor turns the "do" red but it is not an error, it is some problem in the editor. Now do
the following:
Yr=rank(Y);
UCsample=UC[Yr[1:10,1],];
Look carefully at what is happening here. Yr contains the rank values of the random numbers in the Y
vector. I want to choose 10 randomly sampled rows from UC. I can get the indices from the first 10
entries of Yr. Remember they don't have to be in order. So that is what "Yr[1:10,1]" gives. It is in
essence a list of 10 index numbers for rows I want to select. Now, I use that list or vector to subscript
the UC matrix, and take all the rows, assigning the result to UCsample. Notice that Yr is actually a
column vector, but the subscript notation accepts either row or column vectors for arguments.
Now here is an alternative way to get the random numbers.
Y=uniform(0);
do i=1 to 33;
Y=Y/uniform(0);
end;
This works by starting with a random scalar, then vertically concatenating 33 more random numbers to
it, building up the Y vector that way.

Lesson 34: SAS Data Sets and Matrices


Let's say we have the dataset called "usedcars" in the work or user library. This time, we are going to
select the first 10 observations into two matrices. We can write:
proc iml;
reset printall;
use usedcars;
read point (1:10) into UC;
read point (1:10) var _char_ into UCC;
Read the documentation in the IML section about working with SAS data sets for details.
Now we want to go backwards, sending the contents of matrices back into a SAS data set.
create ucs from uc;
append;
use ucs;
list all;
The create statement only creates an empty data set. The append statement adds the data from the
matrix to the dataset.
To specify the variable names for the data set, use
create ucs from uc[colname={"year" "miles" "price"}];
This method with the from clause will only work with one matrix and therefore one type of data at a
time. The other method is to use create with a var clause. In order to do this, each variable should be
in a vector.
year=uc[,1];
miles=uc[,2];
price=uc[,3];
make=ucc[,1];
model=ucc[,2];
color=ucc[,3];
stock=ucc[,4];
create uccs var{year miles price make model color stock};
append;
use uccs;
list all;
To delete a data set so you can replace it, use
call delete("ucs");
Another useful command allows sorting a data set from within IML.
sort data=uccs by miles;
Page No.191
We can also sort a matrix by any column.
call sort(uc, by year);
Modules
Modules are basically subroutines. A module can just be a structure that groups a set of commands
executed at once, or it can be a subroutine that passes parameters, or it can be a function, which is
used in an assignment statement that returns one (matrix) value.

Lesson 35: Modules

Consider matrices
X={1 1, 2 2, 3 3};
Y={"A", "B", "C"};
Z={1 2 3, 2 3 4, 3 5 9};
X1=X[,1];
X2=X1+5;
Side note: the following commands give the same result:
J
To create data sets that have both numeric and character variables, name vectors that have the same
name as the variables in your data set and and put them in a create statement this way:
create mix var{X1 X2 Y};
append;
show contents;
list all;
The var list should consist of vectors. If you put in a matrix it will list all the elements in a column.
If you want to recreate this data set (change it) you need to delete it first, using
call delete("mix");
Modules provide a means of saving and reusing code. They can take three forms, a simple module
without parameters, a subroutine module that passes parameters, and a function.
Simple module. The name is XPX. X and XPRIMEX are global variables. This type of module depends
on having the correctly named matrices available in the program, and having the program accept the
matrices it creates as global variables.
start xpx;
xprimex=x`*x;
print xprimex "Inside Module";
finish;
That just creates the module, nothing runs. So then run it and show the XPRIMEX exists outside the
module.
run xpx;
print xprimex "Outside Module";
Errors inside modules can do strange things. A module may be created even if it has errors. If this
happens and you run it, it might go into pause mode. This means that it is waiting for your input. In
this case you can submit commands and it continues to wait. To get out of the module, you can issue
a "stop" command. You can also use a "stop" command inside the module in programming
statements, such as in if-then conditionals. If the condition is met, the module will stop, but IML keeps
running. There is also an "abort" command that you can use in programming statements which will
exit out of IML under program control. You can also pause a module with programming statements so
that it waits for your input. If you put a "quit" statement inside a module, IML will exit immediately
and the module will not be created.
Subroutine module. This type of module accepts parameters when it is called and can pass parameters
back. Put parameters next to the module name in parentheses. The parameters will have local symbol
table values, even if they have the same names as variables in the global symbol table. The second
print command will not work.
start xpx2(X);
xprimex=x`*x;
Page No.192
print xprimex "Inside Module";
finish;
run xpx2(Z);
print xprimex "Outside Module";
You can pass parameters back out: This time the second print works, because the value of xprimex
has been passed back out to the global variable Y. Here the global Z value is passed into the module's
local symbol table where it becomes X.
start xpx2(X,xprimex);
xprimex=x`*x;
print xprimex "Inside Module";
finish;
run xpx2(Z,Y);
print Y "Outside Module";
Function module. This kind of module returns a single value and is used in assignment statements like
any function. The print statements inside the module would not normally be there, this is just to show
what happens inside the module.
module upsquare(X);
print X "Incoming value of X";
n=min(nrow(X),ncol(X));
X=X[1:n,1:n];
print X "Processed value of X";
return(X);
finish;
Z=upsquare(X);
print Z;
Storing modules and matrices. If you want to save all your matrices and modules, a simple store
command will do it. It will, by default, go to the current user library (work if you haven't changed it).
There is some difficulty in keeping track of when the library goes into effect, so the best thing to do is
to have a permanent library defined, then use
reset storage=library.imlstor;
store;
Where library is your library name. You can give it a different name but IMLSTOR is the default name
of the IML catalog which is a single file that holds all the matrices and modules. You can then call
everything back into memory with a load command.
load;
But you can store and load individual matrices and modules too. For matrices just give the name, for
modules, say "module=name".
save X Y module=XPX;
load X;
Exercises:
1. Write a subroutine without parameters that finds the average value of each column in a matrix,
then subtracts that value from each element in the corresponding column (this is called "centering" in
regression).
2. Write a subroutine module that takes two parameters, a constant and a square matrix, adds the
constant value to every element of the diagonal of the matrix, and passes the result back in a new
parameter (this is related to "ridge regression").
3. Write a function module that will return one of three identically sized input matrices that has the
largest determinant (you decide what to do about ties).

Lesson 36: First Simulation


Setting a user libname while IML is running may not change the default library. It is not entirely clear
to me how this works. You can define the libname user before starting IML, and then the default
module storage will go there. You can save matrices by saying "store matrixname;" and you can bring
them back with "load matrixname;" Modules need the syntax "store module=modulename;" You can
Page No.193
also use the load and store commands without any options (names) and you will load or store all
matrices and modules that are available.
IMLstore is the default name of the catalog in which matrices and modules are stored. A catalog is a
single file that stores multiple objects. If you don't have a user library defined this will go into work.
Otherwise it will save it in the specified user library.
There is another way to do this though. You can use an option in the reset command, like
libname stat510 "C: \Stat510";
reset storage=stat510.imlstore;
You can use other names instead of imlstore but it is probably best to keep the default name.
Monte Carlo simulations are done by generating data from a theoretical distribution, while bootstrap
methods use actual data but resample from the existing data to calculate various statistics. Both
methods work by generating thousands of samples and then studying the statistics that come from the
sample.
Starting with a monte carlo simulation: We set up some parameters for the simulation.
proc iml;
sims=10; *this is for the outer loop that says how many times the simulation is repeated;
ssize=10; *this is the size of each simulated sample;
mean=20;
std=3; *these are parameters for a normal distribution which we are going to simulate.
Next, we set up the basic loops. We will have an outer loop that repeats the samples, and an inner
loop that takes one sample of size ssize.
do i=1 to sims;
do k=1 to ssize;
end;
end; *sims;

My sample is going to be stored in a vector called sample, so with each simulation loop, we need to
clear out the old sample, so add a free command just inside the simulation loop. Normal random
variables are generated by taking the standard normal random function, normal(seed), multiplying it
by the desired standard deviation, and then adding the desired mean. Here we create a sample by
using a loop which appends a new random number to the vector with each iteration. It is not
necessary to initialize the matrix.
do i=1 to sims;
free sample; *clear sample vector before starting sample procedure;
do k=1 to ssize;
sample=sample//(normal(0)*std+mean);
end;
print sample; *for test only, remove later;
end; *sims;
Let's study the distribution of x-bar (the sample mean). That is, if you do repeated sampling, what is
the average (mean) of all the x-bars, and what is the standard deviation (or variance) of the x-bars?
This is a basic question in the study of statistics. If you have a statistic, that is, a number calculated
from a sample, usually intended to estimate something, what are its properties? More specifically,
what is its distribution? Is it an unbiased estimate of the parameter you want to estimate? Does it
have a small enough variance to be useful? Is its variance smaller than other candidate estimators?
Sometimes there is nice theory to answer these questions, but in other cases we either don't have a
solution to the needed equations, or we do not have data that meets the assumptions of the theory.
This is when simulation comes in handy.
What we want to do now is calculate xbar for each sample, then accumulate them in a vector. There
will be an xbar for each sample (each loop of sims). Then we need to calculate the mean of all the
xbars and the standard deviation of all the xbars. For the standard deviation, we will implement the
formula
Page No.194

sx =
∑x 2
− nx 2

n −1
free xbar; *clear xbar vector before running simulation again;
do i=1 to sims;
free sample; *clear sample vector before starting sample procedure;
do k=1 to ssize;
sample=sample//(normal(0)*std+mean);
end;
xbar=xbar//sample[:,];
end; *sims;
print xbar; *for test only, remove later;
meanxbar=xbar[:,];
sxbar=sqrt((xbar[##,]-sims*meanxbar**2)/(sims-1));
print meanxbar sxbar;
truesxbar=std/sqrt(ssize); *theoretical value of sxbar;
print mean truesxbar;
A better way to do the sampling is to use the matrix version of the normal function. You can take out
the sampling loop and replace it with simpler commands as follows. The "free sample" command is
also not needed this way.
free xbar; *clear xbar vector;
do i=1 to sims;
sample=J(ssize,1,0);
sample=normal(sample)*std+mean;
xbar=xbar//sample[:,];
end; *sims;
meanxbar=xbar[:,];
sxbar=sqrt((xbar[##,]-sims*meanxbar**2)/(sims-1));
print meanxbar sxbar;
truesxbar=std/sqrt(ssize); *theoretical value of sxbar;
print mean truesxbar;
Now change the sims to 10,000 and see what results you get.
Using Simulation to evaluate alpha and beta (significance and power)
Alpha=P(Type I Error)=P(Reject Ho|Ho is True)
Beta=P(Type II Error)=P(Do not Reject Ho|Ho is False)
Power=1-Beta=P(Reject Ho|Ho is False)

Lesson 37: Simulations for Power and Significance


Simulations help us understand how statistics work. We now see exactly what is meant by xbar having
a distribution, as we have calculated a bunch of xbars and found their mean and standard deviation.
One of the purposes of simulation is to check if theory matches reality. So we have also compared the
theoretical values of paramaters with the simulated values to see if they agree.
We will now continue to examine issues with hypothesis tests. Continuing with the previous program,
we will now add a hypothesis test. For convenience we will test whether the mean is zero. Any value
could be used. So we have
Ho: mu=0
Ha: mu<>0.
We used xbar for the name of the vector that collected all the xbars in the simulation, so we need a
different name to calculate the mean of the current sample for the t statistic. We will call it smean.
The t statistic for the test of mu=0 is
t=smean/(s/sqrt(ssize)) where
s=sqrt((sample[##,]-ssize*smean**2)/(ssize-1)).
Now the p-value for the test is the probability in both tails beyond the t value on one side and its
opposite on the other side. IML has a probt function which gives the left tailed probability of any t
Page No.195
statistic, which could be on the left or right. So to get the right probability we can take probt(-
abs(t),ssize-1)*2. The minus absolute value is to make sure we are using the left value to calculate
the probability.
The t statistic has degrees of freedom, which will be ssize-1 in our notation.
proc iml;
sims=10;
ssize=10;
mean=0; *set the mean to 0 for now;
std=3;
alpha=.05; *provide an alpha value for the decision;
free xbar; *clear xbar vector;
free pvec; *like xbar, clear before restarting simulation;
do i=1 to sims;
sample=J(ssize,1,0);
sample=normal(sample)*std+mean;
smean=sample[:,]; *new name for sample mean;
p=probt(-abs(smean)/sqrt((sample[##,]-ssize*smean**2)/(ssize-1)),(ssize-1))*2; *calculate p-
value;
pvec=pvec//p; *accumulate the p-values in a vector;
xbar=xbar//smean;
end; *sims;

numrej=sum(pvec<alpha); *summarize number of rejections;


pctrej=numrej/sims*100; *and do percents;
print numrej pctrej; *and print;
The pvec vector stores the pvalues during the simulation, then at the end we count up the rejections in
the numrej variable. The sum function adds up the elements of a matrix. The expression pvec<alpha
creates a matrix of ones and zeros, one for every element where the comparison is true.

When the program is all working, change the sims to a larger value such as 1000 or 10,000. The
rejections should be about 5%, since that is the alpha value--the probability of rejecting when Ho is
true.
Now consider changing the mean. Well, let's try 3. I got about 36% rejections. So the true mean is
not zero, thus the null hypothesis is false. We have rejected 36% of the time, meaning our beta is .64
and our power is .36. But if you try other means you will see that the power varies.
This is a good time to think about what it means for a null hypothesis to be true or false. Suppose we
put the mean at .0001. That is not zero, so the null hypothesis is false, right? Or is it close enough to
be true? I ran the simulation this way and got no rejections. Now if we say the null hypothesis is true,
that means they were all correct decisions. But if we say the null hypothesis was false, then they are
all type II errors and the power of the test is 0. It is important to realize here that the standard
deviation also plays a role. If you changed the standard deviation to .0001 you would get a very
different result!

Lesson 38: Power and Regression


Here is our program from last time again.
proc iml;
sims=10;
ssize=10;
mean=0; *set the mean to 0 for now;
std=3;
alpha=.05; *provide an alpha value for the decision;
free xbar; *clear xbar vector;
free pvec; *like xbar, clear before restarting simulation;
do i=1 to sims;
Page No.196
sample=J(ssize,1,0);
sample=normal(sample)*std+mean;
smean=sample[:,]; *new name for sample mean;
p=probt(-abs(smean)/sqrt((sample[##,]-ssize*smean**2)/(ssize-1)),(ssize-1))*2; *calculate p-
value;
pvec=pvec//p; *accumulate the p-values in a vector;
xbar=xbar//smean;
end; *sims;

numrej=sum(pvec<alpha); *summarize number of rejections;


pctrej=numrej/sims*100; *and do percents;
print numrej pctrej; *and print;
We can try some different sample sizes and see how they affect power. The larger the sample size, the
more likely you are to detect a false null hypothesis.
Now it is nice to have a graph to display power under different conditions. We can add another outer
loop to our simulation so that we can repeat it with different means. We will keep track of the percent
rejects for different means in another matrix and not print them individually, but make a graph.
proc iml;
sims=10;
ssize=10;
std=3; *remove mean from this list;
alpha=.05;
free xy; *this matrix stores the result of each simulation;
do mean=-5 to 5; *set up a loop that changes the mean;
free xbar;
free pvec;
do i=1 to sims;
sample=J(ssize,1,0);
sample=normal(sample)*std+mean;
smean=sample[:,];
p=probt(-abs(smean)/sqrt((sample[##,]-ssize*smean**2)/(ssize-1)),(ssize-1))*2;
pvec=pvec//p;
xbar=xbar//smean;
end; *sims;

numrej=sum(pvec<alpha);
pctrej=numrej/sims*100; *stop printing this;
xy=xy//(mean||pctrej); *create a matrix with columns for the means and pctrejs;
end; *end of the loop for means.
call pgraf(xy); *graphs a two-column matrix;
Next we will try to simulate regression. So a data point in regression is assumed to be the result of a
linear equation plus an error term which has a mean of zero and a constant variance. So to simulate
regression data, we need a set of x's together with the parameters (coefficients) in the equation, and
also a standard deviation for the errors. We can then generate y values.
The solution for the parameter estimates (referred to as the b vector) are found by matrix calculations
based on a matrix called X which, in the simple linear case, consists of a column of 1's and a column of
the x values. (In multiple regression there are more columns of independent variables.) The solution
is given by b=(X`X)-1X`Y. So, the procedure here is to first simulate a regression sample, then
calculate the b vector, and analyze the distribution of the results.
proc iml;
*simulating regression;
*note that at the beginning of regression we
assume the x values are fixed, there is a
linear relationship between x and the mean
Page No.197
of y, and errors have a homogeneous normal
distribution with mean 0.;
sims=10000;
ssize=10;
b0=20;
b1=5;
std=3;
xvals=(1:ssize)` ; *this is the fixed set of x;
x=J(ssize,1)||xvals; *the X matrix in simple linear regression is ;
*a column of 1s for the intercept and the x values;
xpxi=inv(x`*x); *"x prime x" matrix;
h=xpxi*x`; *useful in later calculations;
free bsave; *this matrix will hold the parameter estimates;;
do i=1 to sims;
errors=J(ssize,1,0);
errors=normal(errors)*std;
y=b0+b1*xvals+errors; *note that xvals and errors are vectors;
b=h*y;
bsave=bsave//b`;
end;
bmean=bsave[:,]; *This calculates means of both columns;
btruemn=b0||b1; *Want a vector to compare to bmean;
btruecov=std**2*xpxi; *standard formula for covariance matrix of b;
*calculate variances and covariances from simulation to compare;
bvar=(bsave[##,]-sims*bmean##2)/(sims-1);
bcov=((bsave[,1]#bsave[,2])[+,]-sims*bmean[,1]*bmean[,2])/(sims-1); *just cov(bo,b1);
bcovar=(bvar[1,1]||bcov)//(bcov||bvar[1,2]); *variance-covariance matrix;
bstd=sqrt(bvar);
print bmean btruemn bstd;
print bcovar btruecov;

quit;

Exercise:
Modify the regression simulation to do a one-way ANOVA with three levels. There is actually not much
to change because the matrix solutions will be the same. What changes is the X matrix and the way
you simulate the data. You will have three means for three groups or levels. The distribution of errors
is considered the same for all (simplest assumptions again). Your y vector will then consist of the
group means plus the errors. The X matrix will consist of three columns which are the values of the
indicator variables for the groups. It looks something like
Page No.198

Lesson 39: Binomial Confidence Intervals


Consider a binomial distribution with parameters n and p. We want to study confidence intervals and
coverage. Here is a way to simulate binomial data:
%let n=10;
%let p=.5;
%let sims=10;
data one;
do i=1 to &sims;
x=ranbin(0,&n,&p); *parameters are seed, n, p;
output;
end;
proc print;
run;
The usual method of calculating a confidence interval is using the normal approximation. This can
perform poorly especially if p is not close to .5 and n is small. The formula is phat plus or minus z
alpha/2 times square root of phat times qhat over n.
Now, coverage means the proportion or percent of confidence intervals that actually contain or capture
the true parameter (p). So, ideally, if we have a 95% confidence interval, 95% of confidence intervals
should contain the true parameter. However, this may not be true, if the assumptions are not met or
the approximation is poor.
In our simulation we will then calculate a 95% CI for each sample being simulated.
%let n=10;
%let p=.5;
%let sims=10;
%let z=1.96; *z-value for a 95% CI;
data one;
do i=1 to &sims;
x=ranbin(0,&n,&p);
phat=x/&n;
lb=phat-&z*sqrt(phat*(1-phat)/&n); *upper and lower bounds of CI;
ub=phat+&z*sqrt(phat*(1-phat)/&n);
output;
end;
proc print;
run;
To find coverage we need to keep track of how many times the CI includes the true mean. Now, we
could store the results of the simulation in a data set and then analyze it, but this is a simple question.
We can actually count the correct results as the data step runs. We don't need to save all the results.
Page No.199
data one(keep=cover coverage);
do i=1 to &sims;
x=ranbin(0,&n,&p);
phat=x/&n;
lb=phat-&z*sqrt(phat*(1-phat)/&n);
ub=phat+&z*sqrt(phat*(1-phat)/&n);
if &p>lb and &p<ub then cover+1; *count up the intervals that cover;
coverage=cover/&sims; *remove the output statement, only need the final result;
end;
Run this with p=.5 and n=10 with 1000 or 10,000 simulations. I got a coverage of 89%. This means
we do not have a 95% CI, but only an 89% CI in this situation. Now try increasing the sample size (n)
to 20. I got 95.7%. So with larger sample sizes we are getting better coverage. Increasing the
sample size beyond that doesn't seem to improve it. Now try setting p to .1. It seems in this case you
need a sample of at least 100 to get close to 95% coverage.
However, that is not the whole story. There are other problems with these confidence intervals. Look
at this sample of 10 confidence intervals produced with a p of .1 and an n of 20. Notice that one
interval is 0 to 0 so it automatically misses. But most of the examples have negative numbers. This
does not make sense, because 0<p<1. If p=.9, you might get the same phenomenon at the other
end, with values greater than 1. Some people just solve this problem by setting the lower bound to 0
or the upper bound to 1. It is still somewhat unsatisfactory.
-0.031481 0.23148
0.000000 0.00000
-0.006493 0.30649
-0.006493 0.30649
0.024692 0.37531
-0.031481 0.23148
-0.045519 0.14552
-0.031481 0.23148
-0.006493 0.30649
-0.031481 0.23148
There is an alternate formula for the confidence interval (proposed by Cohen?). The technique consists
of adding two successes and two failures to the results. In other words, the estimator is
ptilde=(x+2)/(n+4). We will add this formula to the program and compare the results.
data one(keep=cover coverage covert coveraget);
do i=1 to &sims;
x=ranbin(0,&n,&p);
phat=x/&n;
lb=phat-&z*sqrt(phat*(1-phat)/&n);
ub=phat+&z*sqrt(phat*(1-phat)/&n);
if &p>lb and &p<ub then cover+1;
coverage=cover/&sims;
ptilde=(x+2)/(&n+1);
lbt=ptilde-&z*sqrt(ptilde*(1-ptilde)/(&n+4));
ubt=ptilde+&z*sqrt(ptilde*(1-ptilde)/(&n+4));
if &p>lbt and &p<ubt then covert+1;
coveraget=covert/&sims;
end;
Here are the alternate confidence intervals for the same run as that above. We didn't get any 0 to 0
intervals, and we only got one negative number.
0.03337 0.34758
-0.02220 0.21268
0.06769 0.40850
0.06769 0.40850
0.10498 0.46645
Page No.200
0.03337 0.34758
0.00286 0.28286
0.03337 0.34758
0.06769 0.40850
0.03337 0.34758

For n=10 and p=.1, the old method has a coverage of about 65%, while the new method has a
coverage of about 92%. The new method gives about 95% with only n=20, as compared to n=100
for the old method..

Lesson 40: Bootstrap for Model Selection Frequencies

libname s "c:\stat510";
options nonotes;
%let ds=s.manp; *name of dataset to be analyzed;
%let n=25; *number of observations in source dataset;
*Reserved variable should not be in original data: i, idr;
*Reserved dataset names should not be in work: temp, rands, sub;
%macro boot;
*This data step reads the source data and adds an idr column;
data temp;
set &ds;
idr+1;
%do j=1 %to 100;
*This data step creates random numbers for bootstrap sample;
data rands(drop=i);
do i=1 to &n;
idr=int(uniform(0)*&n)+1;
output;
end;
*This sql step creates the bootstrap sample;
proc sql;
create table sub as
select temp.* from temp, rands where temp.idr=rands.idr;
*Here is where the analysis goes. Depending on the task, you
need to send output to an output data set, then pull it into
a summary data set (probably via sql). Finally, the summary
data set must be summarized to generate the desired bootstrap
statistics;
*Example using proc reg and adjusted r-square selection to evaluate
model selection frequencies;
proc reg data=sub outest=est noprint;
model y=x1-x7 /selection=adjrsq best=1;
%if &j=1 %then %do;
*Creates the summary data set on the first iteration;
data summ;
set est;
%end;
%else %do;
*Adds to the summary data set on subsequent iterations;
proc sql;
insert into summ select * from est;
%end;
Page No.201
%end;
*process summary data set;
data summ2 (keep=modl);
set summ;
length modl $21.;
if x1 ne . then modl=" x1-";
if x2 ne . then modl=trim(modl)||"x2-";
if x3 ne . then modl=trim(modl)||"x3-";
if x4 ne . then modl=trim(modl)||"x4-";
if x5 ne . then modl=trim(modl)||"x5-";
if x6 ne . then modl=trim(modl)||"x6-";
if x7 ne . then modl=trim(modl)||"x7-";

proc freq data=summ2;


tables modl;
run;
%mend boot;

%boot;
Lesson 41: Bootstrap II

Difference between regression and anova simulation


Download stateinfo data set.
Use proc univariate to examine the distribution of the area. It is not normal due in large part to
outliers.
This data will be used to demonstrate how we can use bootstrapping to estimate a population
parameter, and then in addition, to use simulation to compare the bootstrap estimate to a "normal"
estimate in terms of its statistical performance.
We will consider the 50 state values to be the entire population. We will be sampling from this
population, and studying the behavior of our estimates for the mean in terms of their ability to predict
the population mean. We will use the area variable to do this.
Because this data is highly skewed with serious outliers, the measure of central tendency that should
be used is the median. However, for the purpose of this example, we will focus on estimating the
mean. We can find the population mean (mu) from the proc univariate output, it is 75894.1.
What is a confidence interval?
The next part of the process is a bit confusing because we have different levels of sampling. We are
going to begin by simulating taking a sample (without replacement) from the population. In this part
we are simulating what happens when we really take a sample. The process is similar to what we
studied the first time we did it in IML. We can use a uniform random variable assigned to each
observation and then sort them and take the top 20 or whatever our sample size is. From this sample
we can calculate a traditional x-bar and confidence interval based on normal theory.
Then we go one step further, and take this sample and do a bootstrap on it, which means we will
resample from it, with replacement, and from all these samples, we find the percentiles corresponding
to the confidence level we want. (Should do 90% because it is easier to get P5 and P95). These will
be the bootstrap lower and upper bounds. We then repeat the sampling process (sample from the
population again) and find a new P5 and P95. Do this many times and see what the coverage is as well
as the variability. Does it perform better than the normal theory sample?
libname s "c:\stat510";
proc print data=s.stateinfo;
run;
options ls=80 nonotes;
*treat data set as population data. Examine distributions, particularly area.;
proc univariate data=s.stateinfo normal plot;
var area pop hipt;
Page No.202
run;

*note non-normality, mainly due to Alaska.;


proc univariate data=s.stateinfo normal plot;
var area;
where numenter < 49;
run;

*We will study the efficiency of the bootstrap technique for building
confidence intervals. Copy the mean from univariate output.;

*Let us use normal techniques for building confidence intervals and


see what the true coverage is.;
%let trumean=75894.1;
%let n=20;
%let sims=10;
%let ds=s.stateinfo;
%cistates;
%macro CIstates;
%do j=1 %to &sims; *This is the simulation loop;
data temp (keep=state area rand);
set &ds;
rand=uniform(0);
*This data step creates random numbers for simulation sample;
proc sort data=temp out=sub1(drop=rand);
by rand;
data sub;
set sub1(obs=&n);
*proc print;run;
*This sql step creates the bootstrap sample;
proc sql;
create table est as
select mean(area)-tinv(.975,(&n-1))*std(area)/sqrt(&n) as LB,
mean(area)+tinv(.975,(&n-1))*std(area)/sqrt(&n) as UB from sub;
run;
%if &j=1 %then %do;
*Creates the summary data set on the first iteration;
data summ;
set est;
%end;
%else %do;
*Adds to the summary data set on subsequent iterations;
proc sql;
insert into summ select * from est;
%end;
%end; *end of simulation loop;
*process summary data set;
data summ2;
set summ;
cover=0;
if lb<&trumean and ub>&trumean then cover=1;
proc print data=summ2(obs=10);
run;
proc freq data=summ2;
Page No.203
tables cover/nocum;
proc means data=summ2 mean std cv;
var lb ub;
run;
%mend CIstates;

%macro CIboot;
data temp (keep=state area rand);
set &ds;
rand=uniform(0);
*This data step creates random numbers for simulation sample;
proc sort data=temp out=sub1(drop=rand);
by rand;
data sub;
set sub1(obs=&n);
idr+1;
run;
%do j=1 %to &sims; *This is the simulation loop;
*This data step creates random numbers for bootstrap sample;
data rands(drop=i);
do i=1 to &n;
idr=int(uniform(0)*&n)+1;
output;
end;
*This sql step creates the bootstrap sample;
proc sql;
* create table sub as ;
select sub.* from sub, rands where sub.idr=rands.idr;
quit;
*left off here;
%if &j=1 %then %do;
*Creates the summary data set on the first iteration;
data summ;
set est;
%end;
%else %do;
*Adds to the summary data set on subsequent iterations;
proc sql;
insert into summ select * from est;
%end;
%end; *end of simulation loop;
*process summary data set;
data summ2 (keep=modl);
set summ;

proc freq data=summ2;


tables modl;
run;
%mend CIboot;

Lesson 42: Bootstrap III


Page No.204
We should be using t distribution to calculate normal confidence intervals. Use tinv(alpha, df) but be
careful about the tail probability, which is different in different software. SAS uses the left tail (excel
uses two tails).
Continue building program from last time.

libname s "c:\stat510";
proc print data=s.stateinfo;
run;
options ls=80 nonotes;
*treat data set as population data. Examine distributions, particularly area.;
proc univariate data=s.stateinfo normal plot;
var area pop hipt;
run;
*note non-normality, mainly due to Alaska.;
proc univariate data=s.stateinfo normal plot;
var area;
where numenter < 49;
run;
*We will study the efficiency of the bootstrap technique for building
confidence intervals. Copy the mean from univariate output.;
*Let us use normal techniques for building confidence intervals and
see what the true coverage is.;
%let trumean=75894.1;
%let n=20;
%let sims=100;
%let ds=s.stateinfo;
%cistates;
%macro CIstates;
%do j=1 %to &sims; *This is the simulation loop;
data temp (keep=state area rand);
set &ds;
rand=uniform(0);
*This data step creates random numbers for simulation sample;
proc sort data=temp out=sub1(drop=rand);
by rand;
data sub;
set sub1(obs=&n);
*proc print;run;
*This sql step creates the bootstrap sample;
proc sql;
create table est as
select mean(area)-tinv(.95,(&n-1))*std(area)/sqrt(&n) as LB,
mean(area)+tinv(.95,(&n-1))*std(area)/sqrt(&n) as UB from sub;
run;
%if &j=1 %then %do;
*Creates the summary data set on the first iteration;
data summ;
set est;
%end;
%else %do;
*Adds to the summary data set on subsequent iterations;
proc sql;
insert into summ select * from est;
%end;
Page No.205
%end; *end of simulation loop;
*process summary data set;
data summ2;
set summ;
cover=0;
if lb<&trumean and ub>&trumean then cover=1;
proc print data=summ2(obs=10);
run;
proc freq data=summ2;
tables cover/nocum;
proc means data=summ2 mean std cv;
var lb ub;
run;
%mend CIstates;
%let trumean=75894.1;
%let n=20;
%let sims=100;
%let boots=100;
%let ds=s.stateinfo;
%ciboot;
%macro CIboot;
%do k=1 %to &sims; *This is the simulation loop;
data temp (keep=state area rand);
set &ds;
rand=uniform(0);
*This data step creates random numbers for simulation sample;
proc sort data=temp out=sub1(drop=rand);
by rand;
data sub;
set sub1(obs=&n);
idr+1;
run;
%do j=1 %to &boots; *This is the bootstrap loop;
*This data step creates random numbers for bootstrap sample;
data rands(drop=i);
do i=1 to &n;
idr=int(uniform(0)*&n)+1;
output;
end;
*This sql step creates the bootstrap sample and calculates xbar;
%if &j=1 %then %do;
proc sql;
create table summ as
select mean(area)as xbar from sub, rands where sub.idr=rands.idr;
quit;
%end;
%else %do;
proc sql;
insert into summ
select mean(area)as xbar from sub, rands where sub.idr=rands.idr;
quit;
%end;
%end; *end of bootstrap loop;
proc means data=summ noprint;
Page No.206
var xbar;
output out=summ2 p5=lb p95=ub;
run;
%if &k=1 %then %do;
data summ3;
set summ2;
run;
%end;
%else %do;
proc sql;
insert into summ3
select * from summ2;
quit;
%end;
%end; *simulation loop;
data summ4;
set summ3;
cover=0;
if lb<&trumean and ub>&trumean then cover=1;
proc print data=summ4(obs=10);
proc freq data=summ4;
tables cover/nocum;
proc means data=summ4 mean std cv;
var lb ub;
run;
%mend CIboot;

Você também pode gostar