Você está na página 1de 23


Data preparation is the very first thing that you do and spend a lot of time on as a
data analyst much before trying to build predictive models using that data.
In essence data preparation is all about processing data to get it ready for all kinds
of analysis. All industry data collection is mostly driven by business process at front
, not by the needs of predictive models. These various processes at some or the
other point become reason for introduction of errors here and there in the data.
There can be many kind of reasons [not necessarily errors ] for which we'd need to
pre process our data and change it for better.

Missing data
Potentially incorrect data
Need for changing form of the data

We'll discuss various reasons and methods to achieve our pre-processing goals
going forward.

Handling Missing Values and Outliers

You'll figure out that treatment of both missing values and outliers can at times be
very similar. Reason being , both kind of observations are basically not in a state to
be used because of missing/ or miss information.
Treatment of missing values:

Removing observation with missing values

This is the most common method in the industry. Reason being that missing values
are generally a very very small chunk of the data that you deal with. However you
need to keep following things in mind while removing the observations because of
missing data:

If observations with missing values are significant chunk of the data then you
should not drop all observations with missing values
If the variable which had missing values has entered in your model, you need to
plan what to do when you encounter missing values in the unseen data while
model has been put in production.
Imputing [filling up] missing values with mean/median/mode of the respective

We don't need to get into details of this.

Imputing with business logic

Many at times , we know what a missing value might mean in the context of business
process. For example, If account balance is missing for the bank account , it might
mean that the account balance is zero.
Treatment of Outliers:

Removing observations with outliers

There are two issues with including outliers in the predictive analysis

Because of otuliers , the predictor variables ranges get inflated artificially . The
model that you get might not be applicable across that range
Some outliers have high leverage in context of the modelling process. In
presence of such observations you'll get a model which is not a good fit for the
general population [data].

If you are preparing data for predictive modelling , you need to remove outliers.
However if the variable with outliers is present in the model, you need to figure out
what to do when you encounter outlier values in the unseen data while model has
been put in production.


In some cases it might make sense to impute outlying values with upper and lower
limits when they exceed either of these values. Imputing with lower limit is called
flooring and imputing with upper limit is called capping.

Imputing with business logic

Many at times , we know what an outlier value might mean in the context of
business process.

Need for changing form of the data

Transforming and extracting information from the existing data
Consider a simple transaction date and time column for an eCommerce website. A
simple column containing dates will not be of much use but a lot of information can
be extracted from this simple looking data. E.g. : Information regarding gaps
between transactions, number of transactions happening every week or day or
month etc.
Collapsing and Summarising Data:
Many at times we need to collapse data based on some grouping variables [This is
more or less same as what we discussed in univariate statistics]. E.g. Finding out
monthly summary of the data from a daily transaction data. In addition to tools
which we learned in Univariate Statistics module we will learn few new things in the
"to do with SAS" section.

Transposing Data
This is one of the very useful procedures we'll learn here. Below given is an example
of long data



sometimes it'd make sense to this kind of the data into a wide format .Below given is
an example of same data in a wide format.




Since SAS process data row by row in many procedures as well as in data step codes,
many at times these kind of transformation are very much needed. We'll learn how
to achieve the same with Proc Transpose.
Formatting Data Columns, Creating Reports
In addition to other tools we'll also learn very useful procedures for creating all
kinds of reports and user defined data format using Proc Report and Proc Format

Data Preparation with SAS

In coming section we'll learn many tools, SAS functions and utility procedures to
achieve many data preparation tasks that we discussed so far and then some more.
We'll start with finding answers for a few simple questions based on data
"bank_transactions" using tools that we learned in Univariate Statistics module.
Later we'll see how the same can be achieved with much simpler and faster manner.
libname dp "/folders/myfolders/Datasets/Data Prep";

Q: find category of highest transaction in debit/credit for each month

A: We can sort the data by year,month and then amount in descending order. Then
within that group we can find the observation with max amount.

proc sort data=dp.bank_transactions;
by year month dc descending amount;

proc means data=dp.bank_transactions max;

var amount;
by year month dc;

Q: total transaction for debit/credit each month

A: We can again use combination of proc sort and proc means to find this out with
"sum" option.
proc sort data=dp.bank_transactions;
by year month dc;

proc means data=dp.bank_transactions sum;

var amount;
by year month dc;

Find this works out alright but as we have seen before , taking output of proc means
to output dataset is not a straight forward task.Lets learn about "first." and "last.",
these are temporary variables created at the back end when a by statement is used
in data step code. [ keep in mind that "by" statement can be used after sorting your
data only ]. Lets create the data that we'll be using to learn for the same:
data example;
input grps section $ score;
1 a 10
1 a 20
1 b 30
1 b 40
2 a 50
2 a 60
2 b 0
2 b -10

The dataset that we have create is already sorted, hence we can simply use "by"
statement without really sorting this. When we use "by" statement; "first." and
"last." will create temporary variables which take values "1" and "0" for each
observation depending on groups created by variables used in "by statement". Lets
look at this example given below to understand this better:
data example;
set example;

by grps;

data example1;
set example;
by grps section;

In the first program we used "by grps", the variable "grps"" creates two groups in
the data, one for the value "1" and another for the value "2". The variable "first."
takes value "1" for the first observation in the groups and "0" for others, on the
other hand "last." variable takes value "1" for the last observation in the group and
"0" for others.
In the second program we used "by grps section", this makes more groups in the
data, first. and last. takes values "1" and "0" accordingly.
We don't really need to create these first. and last. variables to use them, in the
programs above we created those just for demonstration. Lets use them to solve a
similar problem which we did for the bank_transaction data.Lets get the top score
for each section.
proc sort data=example;
by grps section descending score;

data top_example;
set example;
by grps section;
if first.section;

get total score for each section:

data total_scores(drop = score);
set example;
by grps section;
if first.grps then total_score=score;
if last.grps then output;

In a similar fashion , we can solve the original problem that we solved for dataset

proc sort data=dp.bank_transactions;
by year month dc descending amount;

data bt_summary(drop=day category);

set dp.bank_transactions;
by year month dc;
if last.dc then output;

data bt_summary_total(drop= amount day category);

set dp.bank_transactions;
by year month dc;
if first.dc then do;
if last.dc then output;

Numeric Functions
Before we start to learn about SAS functions, lets learn about a way to "not" create a
dataset every time we just want to see what a function does. Handy way is to name
my outgoing dataset simply "null" , this tells sas not to create any dataset in the data
step program. But we do need something which will show us the result of the
function that we just used. "put" statement comes to rescue. Put statement prints
whatever we ask it to , in the log. Remember , not in the result window but in the log
window. Lets look at few numeric functions available in the SAS system:
data _null_;
put x;
put y;
put z;

There are several such numeric functions. A longer list can be found here :
a quick list that comes to mind is this : log, exp, sqrt, mean, median, sum, n, nmiss.
These functions do what the name sounds like. That also is not really an exhaustive
list. In fact you can find almost all direct mathematical formulas that you use in the

SAS function list if you look for the documentation. We'll not be going through all the
One important thing however is to understand that data processing happens in SAS
row by row not column by column lets create a data set and understand how these
functions work row by row ; not column by column .
data func;
input x y z;
10 20 30
1 2 3
5.4 6.7 9.33
100 200 0

now lets apply some numerical functions and see what they do.
data func;
set func;

You would notice that the variable "s1" above is not containing sum of the entire
column x. In fact it is rather containing values exactly same as x. why? , because
these functions only work on rows , not on columns. So in the same row, there is
only one value of x to be summed, and the result is just x.
Now on the other hand, "s2" is sum of values of variables x,y and z in the same row.
Note: you must be wondering , why do we need a function for sum when we can use
the algebraic sign "+" for the same purpose. Well, there is a small difference. When
function sum encounters a missing value while performing addition, it ignores it,
where as if that happens while using "+" operator , you'll get a missing value as the
result. Lets see an example:
data _null_;
put x;
put y;

String Functions
We saw that most of the numeric functions are simply named as their mathematical
names. These names readily make sense and tell what do we use these functions for.
Same is not the case for string functions, or functions which are used to process

character variables. We'll talk about few important character functions in detail
with example.
This function takes a string as input . Imagine a scenario where this input string is
an address with elements of it such as home number, street , city etc are separated
by "/". Third input scan function is this "delimiter" which separates different
elements of the string within it. Second input is the element which you want to
extract from the string. For example we have this address:
"1502/Panch Mahal/Malad/Mumbai"
And we want to extract suburb name from this address which is the second element
if we consider "/" to be the delimiter in the string. Lets see:
data _null_;
address="1502/Panch Mahal/Malad/Mumbai";
put suburb;
Explore Yourself:
Can we use multiple delimiters with scan?

Function substr can be used to extract a substring from a larger string if we know
position of start and end of the said substring in the larger input string. Keep in
mind that counting start with one not zero as seen in other programming
languages.Here are few examples for the same:
data _null_;
put port;

data _null_;
put port;
put port1;
Explore Yourself:
What happens if we give input for end position in the function substr?

trim , strip , || ,catx,compress
Functions named above and operator || are used remove white spaces[ trim
,strip,compress] from the input string in various ways and combining them [||, catx].
We'll learn through some examples:
data _null_;
put z;
put m;

You can see that operator || [this is double pipe symbol] simply combines strings.
Lets look at white space removing functions and peculiarities associated with them.
data _null_;
x=trim(" Lalit ");
y=trim(" Sachan ");
put x_l;
put y_l;
put z;

You can see that in above example none of the spaces get removed. This is a peculiar
behavior of the function trim . If you use function trim the variable value assignment
directly then only it works. It removed trailing spaces from the string.:
data _null_;
x=" Lalit ";
y=" Sachan ";
put z;

now lets look at how strip behaves. We are using length function to check if
trim/strip functions are working , in addition to printing them in log using "put"
data _null_;
x=strip(" Lalit ");
y=strip(" Sachan ");
put z;

As opposed to trim function ,in the above example strip is removing leading spaces ,
let see how it behaves when used directly during new variable creation.
data _null_;
x=" Lalit ";
y=" Sachan ";
put z;

in this case it removes all [not the ones in between] the spaces, leading and trailing
This function removes all spaces from the string , including the ones which are in
data _null_;
x=" Lalit Sachan ";
put z;

This function concatenates strings after removing leading and trailing spaces from
them. First argument however here is the delimiter which will be used while
combining the strings. If any of the strings to be combined are simply white spaces
they are ignored. Here is an example to make you understand better. Notice how to
white space is simply ignored, while creating y. In both the cases "$" has been used a
data _null_;
x=catx("$"," 45 "," ytfy ","asdf ");
y=catx("$"," xd ", " ","dr ");
put x;
put y;
Explore Yourself:
Find out what functions "upcase" and "lowcase" do? Come up with a
functioning example.

This function is used to find the starting position of a smaller substring in a larger
input string. Remember that counting start with one from the beginning of the
string. The first argument to function is the larger string where we aim to find the
smaller one. Second argument is the string which we are looking for in the larger
one. Third argument is where we should start in the larger string to look for the

smaller one. If that number is "+ve" then search is done from left to right, if that
number is negative , search is done from right to left. However returned value is the
starting position of the smaller string from the beginning of the larger string only.
if third argument is left blank, then by default search starts at the beginning of the
string and is done left to right.Also note that if there are multiple occurrences of the
smaller strings, the starting position of that occurrence is returned which is
encountered first depending on starting position and direction of the search as
specified by various inputs of the function Below given here are few examples:
data _null_;
put z;
put m;
put k;
put a;
put b;

Search here by default is case sensitive as can be seen in the example below. "s" is
not found because the letter "S" is in caps in the larger string.
data _null_;
put y;

If you want your search to be case insensitive, you need to use the identifier "i". The
first and second arguments are meant for strings to be searched in and strings to be
searched for . Beyond that "i" means identifier i which makes your search case
data _null_;
put m;
put z;
put n;
Explore Yourself:
What does the identifier "t" do in the function "find"?

This function is used to replace substring occurrences in the larger input string. In
the example given below we are replacing all hyphens with "/" . Second argument is
what we want to replace and the third is what we want to replace it with. Of course
first argument being the string where we want to do these replacements.
data _null_;
address="1203-Some Tower-powai/Mumbai";
put proper_add;
Here is an exercise. Run the code given below to create the dataset.:
data Add;
length address $40;
input address $;

Once that is done. Create a column in the dataset which contains city
names extracted from these address. Do that using whatever functions
you think are going to be appropriate for the process.
Exercise Solution:

data add(drop=a1 a2 z);

set add;

Utility Functions and Procedures

In addition to numeric and string functions there are many more utility procedures
in SAS which enable us to do many other tasks other than simply extracting or
transforming numeric or categorical variables.
This functions is used to apply a specific format while creating a new variable.
Remember that it can not be used to change format of existing variables.
data temp;


/*" In the data set temp above, x is essentially a string as can be
confirmed by looking at its type, now we can apply a date format on
this to create another variable which contains the same values but "
data temp;
set temp;
format y mmddyy10.;
put y;

Many at times it happens that variable which is supposed to be in numeric format

comes out to be in character format while importing that data due to presence of
some character values. We can use input function to convert this variable into a
numeric one by applying format "8.". Lets see an example of doing the same:
data temp;
input some $;

If you look at type of variable "some" in the data temp, it is character. Lets convert
that to numeric variable.
data temp;
set temp;

smallest , largest
Function min and max always give largest and smallest value , however at times we
might need n!" largest or smallest value among many. For that we can use smallest
or largest functions. First argument to these function is the value of "n". Example
given below get 3rd largest and 3rd smallest values from the data respectively.
data _null_;
put x;

put y;

Since by default SAS processes data row by row, there is no direct method to access
previous observations in data step. For doing so we have to use lag function which is
designed do specifically this:
data temp;
input A $ B C;
truck 10 1
truck 20 2
truck 30 3
car 40 4
car 50 5
car 60 6

data temp;
set temp;

You can see that new variable "D" is simply take previous values of variable. Or in
other words its equivalent to column "B" with one lag. You can apply lag function
with multiple lags too by using function lagn. Following is an example with lag3.
data temp;
set temp;

However this gets tricky if you use the function lag inside a condition. In that case
lag function returns only those values which it gets to see within the condition
block. Here is is example. Try to understand this and if doesn't make sense ask for a
detailed explanation in the class:
proc sort data=temp;
by A;

data temp;
set temp;
by A;
if first.A then D=lag(B);
else D=lag(C);

Round function is used to round off digits for numeric values. First argument is the
value being rounded off and second argument is indicator for the rounding.
data _null_;
put z;
put y;

in the above example , second input is .001 which means x will rounded off up to 3rd
digit after decimal. You can consider the process like this. First x is divided by .001,
rounded off to nearest integer and then multiplied by .001.
So x/.001 = 123455.67, this being rounded off to nearest integer becomes 123456
this again gets multiplied by .001 and becomes 123.456
lets take few more examples:
data _null_;
put m;
put z;
put y;

consider m=round(x,10), first x gets divided by 100 which becomes 12.345567 then
it gets rounded off to nearest integer which is 12, then it gets multiplied by 10 and
becomes 120, which is the final value of m.
Explore Yourself:
Do the above the process for y and z also and see whether the final
values match with what your calculations.

Proc Rank
Proc rank is used to make bins in your data. You can use a numeric variable by
which you want to make bins in the data. For example in the data set sashelp.cars ,
we want to make bins in the data by variable invoice. What happens is that data is
sorted by variable invoice and then starting from top equal numbers of observations
are put into each bin.
proc rank data=sashelp.cars out=car_rank group=10;
var invoice;

ranks basket;

groups=10 tells proc rank there are going to 10 bins/groups in the data. "ranks
basket": this names the variable containing group/bin number as "basket". Bin
numbering starts with 0.
Proc transpose
This is used to make your data from long to wide or wide to long as discussed
before. Lets create the same data which we showed there
data long1 ;
input famid year faminc ;
cards ;
1 96 40000
1 97 40500
1 98 41000
2 96 45000
2 97 45400
2 98 45800
3 96 75000
3 98 77000

Following program using proc transpose converts the long format data into wide:
proc transpose data=long1 out=wide1 prefix=year_;
by famid ;
id year;
var faminc;

by statement: makes rows based on how many unique values the specified variable
in the by statement has
id statement: makes columns based how many unique values the specified variable
in the id statement has
var statement : fills the values of variable specified in the var statement in the
resulting cells of transposed dataset. If some cells don't have a corresponding values
in the incoming dataset they are assigned missing values such as cell corresponding
to year 97 and famid 3 in the above example.
Now next question that might be bothering you must be what happens if there are
more than one variables to filled in, you simply get multiple rows corresponding to
each value of variable in "by statement". For example in the example given below
you get 2 rows for each famid.

data long2;
input famid year faminc spend ;
1 96 40000 38000
1 97 40500 39000
1 98 41000 40000
2 96 45000 42000
2 97 45400 43000
2 98 45800 44000
3 96 75000 70000
3 97 76000 71000
3 98 77000 72000
run ;

proc transpose data=long2 out=wides ;

by famid;
id year;
var faminc spend;

Proc Format
Proc format is used to create user defined format. This does not require any input
from a dataset and create format can be applied on any variable in any dataset. Here
is an example given below. Also it does not change underlying format of the variable,
it only changes how it is displayed.
proc format;
value $jc 'one'='Management'
value Grade 0-32="F"

"value" statement here is the one which essentially creates the format for you. If this
format is going to be *applied on on character values then the format name starts
with a "$" sign otherwise the name starts as usual. Naming constraints for formats is
same as variable names. in the value statement given above we created format $jc, if
we apply it on a categorical variable and the value is "Management" then displayed
value will be 'one' and 'two' if the value is "Trainees". If the value does not match
with either of the "Management" or "Trainee" then value will displayed as is.
For the numeric format Grade , if the numeric variable on which it is being applied,
is in the range 0-32 then "F" will be displayed, if any of the values does not match
with the given ranges then a * will be displayed in its place. Lets see an example of
these formats being applied on the data set temp. To emphasize that the underlying
values don't change i have also created a numeric variable in the same data step.

data temp;
input jobs $ marks;
one 10
two 75
one 34
two 59
abc 79
one 49
one 56
two 90
abc 20

data temp;
set temp;
format jobs $jc.;
format marks grade.;

Proc SQL
This is implementation of SQL language with in SAS. All of the tasks which we'll see
here can be achieved with whatever we have learned so far. SQL language queries
are however at times easy to read and write. But do not use them with large dataset.
They might not be as fast as their data step counterparts.
You will see that SQL queries are very English like to write. They are mostly used to
subset,summarize and pre-process the data. There are no predictive modeling
procedures in SQL framework.
We'll see that all SQL queries are just select statements. These select statements
have incremental capacities which we'll see starting with the simplest form where
you select all the observation from the incoming dataset. All SQL queries are going
to be in a block starting with "proc sql" and closed with "quit". Result of the
selection will be displayed in result window. If we want to put the result of selection
in a data set we can simple add "create table as table_name " in front of the select
statement. Lets see some example for the same.
proc sql;
select * from sashelp.cars;

All observations from sashelp.cars are displayed in result window.

proc sql ;
create table lalit as select * from sashelp.cars;

All obs are still displayed but a table named "lalit" is created in the work library [you
can supply a lib ref for it to be createdin some other location] with all the
observations. Here on wards we'll not use create table, whenever you want to do
that , simply add that part in front of select statement.
If you do not want to select columns of the data you restrict by mentioning the
variable names separated by comma.
proc sql;
select name,nhits from sashelp.baseball;

This controls number of variables/columns which you are selecting from the
dataset.now what if i want to restrict number of observations There are many ways
to do it.
proc sql inobs=10;
select name from sashelp.baseball;
select make from sashelp.cars;

using inobs/outobs with proc sql statements restrict number of incoming/outgoing

observations for all the select statements in that block. If we want to restrict number
of obs selectively for each select statement separately we can do the following.
proc sql;
select name from sashelp.baseball(obs=10);
select make from sashelp.cars(obs=20);

There is also an option called outobs. Outobs specifies number of observation which
go out. In the current example it works same as inobs but when you are processing
data it behaves differently.
proc sql outobs=10;
select name from sashelp.baseball;

As we saw in data step, just restricting number of observations is not enough, We

need some way to conditionally filter observation. We can achiever that by using
"where " with select statement as following:
proc sql;
select invoice,drivetrain from sashelp.cars
where origin="Asia";

we can write multiple conditions as well by combining them with and, or operators.
proc sql;
create table temp as select invoice,origin,drivetrain,type,mpg_city
from sashelp.cars

where origin="USA" and type="Sedan" and mpg_city>15;

Remember that you don't need to necessarily select the variable on which you apply
conditional statement. Next requirement is to sort the data, for that we'd add order
by to our select statement.
proc sql;
select invoice,origin from sashelp.cars order by invoice;

default order of sorting is ascending. If you want to sort things in descending order
then you'll have to use the keyword desc as given below :
proc sql;
select invoice,origin from sashelp.cars order by invoice desc;

you can order by multiple variables as well:

proc sql;
select origin,msrp from sashelp.cars order by origin,msrp desc ;

Now next is to group variables or get aggregated/summary statistics such as mean

std etc which are defined for a group of values rather than individual observation.
proc sql ;
select origin,drivetrain,mean(msrp) as msrp_avg from sashelp.cars
group by origin,drivetrain;

Here the summary operations [ such as calculating mean in the above example] is
carried out on the groups created by "group by". Here are few more examples , one
which include order by as well.s
proc sql ;
select origin, std(msrp) as price_std from sashelp.cars
group by origin;

proc sql ;
select make, std(msrp) as price_var from sashelp.cars
group by make order by price_var;

now if we wanted to put condition here on the new var which is created [price_var];
lets see if simple where condition works :
proc sql ;
select make, std(msrp) as price_var from sashelp.cars

where price_var>10000 group by make order by price_var;

above mentioned code throws an error:

ERROR: The following columns were not found in the contributing tables:

To apply conditions on the variables which are created in sql queries we need to use
proc sql ;
select make, std(msrp) as price_var from sashelp.cars
group by make having(price_var>10000) order by price_var ;

sequence in which you should write :where > group by > having > order by. Next
we'll see how to get data from multiple tables.
libname dp "/folders/myfolders/Datasets/Data Prep";

Key is to give names to tables which can be use to reference table while extracting
those columns from it. We'll try to solve following case which involves getting data
from multiple tables.
case: datasets gaming1,2,3 contain information on customers of a gaming
company which provides online platform for playing team games such as

we want to get those customers ids which play DOTA on mac os in solo
sessions with free license type and their average time per session is
more than 40 minutes

Lets first list what information stored where:

gaming1=gamer_id, game name, atps
gaming2= gamer_id , os , license
gaming3= gamer_id, session_type, netspeed

We'll give names to tables in select statement only, i have written following select
statement in multiple lines for better readability.
proc sql;
select a.gamer_id

from dp.gaming1 as a,
dp.gaming2 as b,
dp.gaming3 as c

b.os="mac" and
a._game_name="dota" and

a.atps>40 and
c.session_type="solo" and
b.license="free" and

a.gamer_id=b.gamer_id and
a.gamer_id=c.gamer_id ;


The part "a.gamer_id=b.gamer_id and a.gamer_id=c.gamer_id" is must for setting

up correspondence between observations of multiple tables. If you don't do that
you'll get a cross product of observation as shown below:
data s1;
input id a $;
1 q
2 a
3 z

data s2;
input id b $;
1 p
2 l
3 m

proc sql;
select a,b from s1,s2;

Now if we put that correspondence setting where condition we'll get the desired
proc sql;
select a,b from s1,s2
where s1.id=s2.id;
Explore Yourself:
* How to join/merge tables using SQL
* What do distinct, count do when used with SQL queries

We'll conclude here. In case of any doubts regarding content of this study material,
please post on QA forum in LMS.

Prepared By: Lalit Sachan
Contact: lalit.sachan@edvancer.in

Você também pode gostar