Page 1 of 19
Minsk, February 2000
Table of Contents
Table of Contents..........................................................................................................................2
A brief introduction......................................................................................................................3
Starting EpiInfo...........................................................................................................................3
Introduction to the Analysis program.........................................................................................3
Browsing the data.......................................................................................................................4
First look at the data...................................................................................................................4
Oneway frequency tables..........................................................................................................5
Entering Commands and Variable names using Menus..............................................................6
Displaying Categorical Variables...............................................................................................6
Statistics on Continuous Variables..............................................................................................6
Displaying Continuous Variables...............................................................................................7
Saving variables and leaving EpiInfo........................................................................................9
Analysis of Quantitative variables...............................................................................................9
Introduction................................................................................................................................9
Using the Means Command.......................................................................................................9
Linear Regression.....................................................................................................................11
Analysis of Categorical Data......................................................................................................12
Introduction..............................................................................................................................12
Using the Tables command.......................................................................................................12
Using the Statcalc program for stratified analysis....................................................................13
Introduction..............................................................................................................................13
Linear trend in proportions.......................................................................................................15
Reading a data file into Epiinfo................................................................................................17
Creating a questionnaire (qes) File.........................................................................................17
Variable types...........................................................................................................................18
Creating a .rec file.................................................................................................................18
Importing the data....................................................................................................................18
Page 2 of 19
Minsk, February 2000
A brief introduction
Epiinfo is a multipurpose computer package designed for use by epidemiological researchers. It
contains smaller programs for use with Survey Design (Epiaid), Questionnaire Design and Report
Writing (Eped), Data Entry (Enter), Data Checking (Check), Data Analysis (Analysis), Simple
Statistics (Statcalc), Importing and Exporting files (Import, Export). There is also a separate
package for mapping (EpiMap).
The package is made available by WHO and CDC as public domain software and can be
downloaded (free of charge) from http://www.cdc.gov/epo/epi/epiinfo.htm .
Starting EpiInfo
Epiinfo is a DOS based program, using pull down menus, although a mouse can be used. The
cursor can be used to move up and down the menus (using the up arrow and the down arrow) to
see the descriptions of the programs. Note the on a colour display an alternative way of moving
up and down is to press the highlighted letter for the program you require.
Position the cursor bar on Analysis and read the description on the right hand side. Press ENTER
to select Analysis. The screen goes blank for a few seconds and then the Analysis screen appears.
The screen is split into two – the upper window is headed Output and the lower, smaller, window
is headed Commands. The cursor is on the Command window against the EPI> prompt. At the
top of the screen are two lines giving the status information:
This indicates that we have not yet specified the name of the dataset to be analysed, and hints
how to do it. It also states the amount of free memory.
In order to load a dataset for use, we use the read command, for example if the file is called
itpexamp we type
read itpexamp
The full name of an Epiinfo file will end with .rec , so the actual name of the file will be
itpexamp.rec, but Epiinfo allows it to be omitted.
Note that you should enter the whole path as well as the file name, for example
a:\itpexamp.rec
The name of the file and the number of records appears at the top of the screen, indicating that the
file has been found and read. We also see the all records have been selected (as so far we have not
specified any criteria for selecting or rejecting records).
Page 3 of 19
Minsk, February 2000
Browsing the data
You can browse the data by pressing F4. As you pass through the different columns, you can see
what type of variables they contain at the top of the screen.
If you press F4, Full screen mode is selected, this shows a single record in its entirety.
Pressing F5 will start Split mode, this is a combination of both modes, browse in the top window
and Full screen in the bottom
Note that although we entered browse by pressing F4 in the analysis mode, we could have also
typed browse at the prompt.
When starting to look at any new set of data, one of the first steps is to check that the values of
the variables are sensible and that they correspond to the codes defined in the coding schedule or
other documentation about the data. For the categorical variables, we might do oneway tables to
check that only the specified codes occur and to check for missing values, for example in the sex
field there should only be the values 1 and 2. For continuous variables, we need to obtain
summary statistics (mean, standard deviation, minimum, maximum) and to check that these are
what we expect.
Page 4 of 19
Minsk, February 2000
Oneway frequency tables
We start by producing oneway tables for the categorical variables. At the prompt type
tables sex
This shows that there are 45 males (1’s) and 35 females (2’s) together with percentages, the total
number of records (80) and summary statistics (sum, mean and standard deviation). Ignore for
now the Student’s tdistribution.
Exercise:
Repeat the tables command for the observed ages (observeage). (Note that in many cases
age would have a large number of possible values and so a frequency table might be large and
unwieldy and so other commands would be used  however here we have a small range of ages
and the table can be quite useful)
Page 5 of 19
Minsk, February 2000
What percentage of the children are 13 years or younger ?
We have used the tables command by typing it at the command prompt. It is also possible to enter
commands by selecting them from a list of commands, similarly it is also possible to select the
variable names from a list of variables.
If, for example, we wanted to construct a table of sex, F2 is the Commands key which brings up a
list of possible commands . The tables command is in the General section, by highlighting it and
pressed ENTER, the command is ‘pasted’ into the command line. Now press the Variables
function key, F3, and a list of the variables will be shown. Highlighting sex and pressing
ENTER will paste into onto the command line, which can now be entered giving the same results
as when we typed in the commands by hand.
If you want to pick more than one variable in this way, as will be the case when we do two way
tables, you can tag groups of variables using the plus (+) and minus () sign, i.e. select sex and
press + and then select observeage and press +. You will see that these two variables will have
been tagged (marked) by a small sign, pressing ENTER and both of them will appear in the
command line. This also works for more than 2 variables.
pie sex
A pie chart should appear on the screen, showing the percentage of males and females. A bar
chart can be produced using the command bar
bar sex
Exercise:
means height
The output is the same as for the tables command – a frequency table followed by summary
statistics. Because there are so many different values for height, the table is much longer. For
continuous variables a frequency table is not much use – except for checking for suspicious
values. The means command (unlike the tables command) allows us to suppress the
frequency table and print only the summary statistics. This is achieved as follows
Page 6 of 19
Minsk, February 2000
means height /n
The full specification of the means command includes a grouping variable, but for now we are
dealing with all the data together. At this stage we do not want to subdivide the data into groups,
for example males and females separately. We need a way of forming one group of all the records
in it. This is done as follows:
let groupall = 1
This creates a new variable groupall which has the value 1 for every record. Thus to group the
data by groupall will cause all the records to be included in one group. If you browse the data
you can see the new variable. We can now use the means command
This produces an entirely different output – no frequency table and a total of 11 statistics.
Exercise:
What are the mean and standard deviation of the WEIGHTs of the 80 children ?
What are the median and interquartile range (75 th percentile – 25th percentile) of the weights of
the 80 children ?
Neither bar charts or pie charts are sensible ways of displaying continuous variables with a lot of
different values. Try one of the commands on height and you will see that the result is not very
useful.
Bar charts have individual separated bars and are used to display categorical variables for which
the order of the categories is irrelevant. Histograms are used for continuous variables.
Usually a continuous variable is grouped before the histogram is drawn. However, if the variable
has a relatively small number of distinct values a histogram can sometimes give a good
representation of the distribution.
histogram observeage
Exercise:
histogram heights
Page 7 of 19
Minsk, February 2000
If there are a lot of different values, then the resulting histogram can be less useful. We might
want to group the variable. To group the height variable we need to create a new variable,
which we will call htgp , which will have grouping interval of 10cm. To form the groups we use
the let statement to divide height by 10 and assign the result to htgp. Because we want the
new variable to have integer values rather than exact values with decimal places, we use the div
operator (this is the way that Epiinfo does integer division – the traditional / will give the exact
answer)
Before looking at the histogram, see the effect of the let statement by getting a frequency
distribution for htgp
tables htgp
You will see that 8 height groups have been created. Now type
histogram htgp
Exercise:
Are the heights Normally distributed ?
Page 8 of 19
Minsk, February 2000
Repeat this process for weights. Create a new variable called wtgp again using the let
command and the div operator, choosing a sensible grouping interval.
If you have created new variables, you might want to save them for use the next time you use
EpiInfo. You could rewrite the original data file, but it is recommended that you save to a new
file. To do this you first need to route the output and then to designate a file to which the new
dataset (including both the old and new variables) will be saved. If we wanted to save out new
dataset to a file called itpnew.rec we would type
route itpnew.rec
Again, it is important that you put in the full path for the file, e.g. a:\itpnew.rec
And then to write the data to that file
write recfile
To leave EpiInfo, press F10 to leave Analysis and return to the main EpiInfo menu, and then
press F10 (or select Quit) to leave EpiInfo
Introduction
Here, we are going to use Epiinfo to analyse data in the form of continuous variables, i.e.
quantitative variables measured on a continuous scale. We shall use the means command to
compare continuous variables classified by categorical variables, we shall also see how two
continuous variables can be compared using scatter and regress.
One of the advantages of using statistical packages is that it is easy to examine the data visually
before proceeding to formal statistical analysis. This is one way of checking whether the
assumptions made in the analysis are reasonable. We can examine a scatter plot of the data
Exercise
Execute the above command. Note that the first variable is put on the xaxis and the second on
the yaxis.
Page 9 of 19
Minsk, February 2000
Mean height for sex = 1
Mean height for sex = 2
These graphs are not really suitable for presenting the data, since it is difficult to discern the
distribution of height where the points are crowded together. An alternative graphical
presentation to illustrate the variation in height according to sex is to use histograms.
Exercise:
Type the following:
This produces a histogram of all the values of height. Epiinfo allows us to use subgroups of
the data with the select command.
Type:
select sex=1
histogram htgrp
To see a histogram of height for males only. Note the result of the select command is shown
at the top left of the screen as Criteria: sex=1
Exercise:
Plot a histogram for the heights of females, is there any difference ?
Recall that we used the means command to derive summary statistics for a variable. Remind
yourself of the reason for having to create a new variable with a single value to get the means
output for a single variable
let groupall = 1
means weight groupall /n
Make sure that you understand the output, now to calculate the statistics for each sex separately
The first part is the same summary as you have already seen, but subdivided by each level of sex
Exercise
Do these results compare with what you guessed when you looked at the scatter plot ?
Page 10 of 19
Minsk, February 2000
Linear Regression
If we are examining the relationship between two continuous variables, such as height and
weight we might start by drawing a scatter diagram before proceeding to formal statistical tests.
Exercise :
Does there appear to be a straight line relationship between the two variables ? If so, guess the
best straight line, now estimate the slope of the best straight line as follows:
Pick two points, A and B, towards the ends of your line (A at the bottom, B at the top). Write
down the values of height and weight for each point.
At point A height
weight
At point B height
weight
We can use Epiinfo to perform linear regression using the regress command. To use this to
perform a linear regression of height on weight type:
Note that the regress command requires the dependent (response) variable, to go on the yaxis
and then the independent (explanatory) variable to go on the xaxis. We are given the correlation,
together with 95% confidence limits.
Exercise:
What does this correlation tell you about the relationship between height and weight ?
The program then gives us output which tests the null hypothesis that the slope of the line is equal
to zero (i.e. no relationship between the two variables). The next part of the output is the
estimated regression coefficients. These are estimates of the parameters and in the formula:
height = + x weight
The estimate of the parameter is labeled as the coefficient for variable weight and is the
slope of the line. The estimate of parameter is labeled as the Yintercept, i.e. the value of
height when weight=0.
Exercise:
What is the equation of the fitted line ? How does it compare to the estimate that you previously
calculated ?
Using the equation of the Epiinfo fitted line, calculate the following
Page 11 of 19
Minsk, February 2000
the predicted value of height for weight = 120
the predicted value of height for weight = 170
How do these values compare with what you would have got using your original equation ?
Note the value of the yintercept in the absurd case that weight=0, this apparently ridiculous
result arises because the relationship between height and weight is not linear over the entire range
of the data, although it does look to be a reasonable approximation over the range we are
examining. One of the reasons for checking the data graphically is to check whether the
relationship might be linear, or whether a curve might be a better description.
Epiinfo will plot the regression line on the graph for you.
Introduction
In this session, we aim to use Epiinfo for the analysis of categorical data. In particular we will
construct and interpret twoway tables of categorical data, test the association between 2
categorical variables using the chisquared test, analyse the association between two binary
variables in the presence of one or more confounding variables and test for a linear test in
proportions.
Since these are now two categorical variables, we can look at their associations using a two way
table, using the tables command.
Exercise:
The program prints out the required 2x2 table. Examine the table and decide which part of the
output indicates whether there is an association between the two variables ?
It would be easier to see the measure of association if the table had percentages on it. We can
request these using the set command
set percents=on
Page 12 of 19
Minsk, February 2000
Now repeat the tables command
This time the tables appear with row and column percentages in the cells of the tables. The row
percentages are printed first, with an arrow beside them pointing to the denominator on the right.
The column percentages are printed underneath the row percentages. However, the tables has a
rather muddled appearance and you have to look at it quite carefully to see what percentage and
cell counts are. It would be better if there were more space between the four cells  or if there
were lines between them. This can be achieved using the lines command.
set lines=on
If we now concentrate on the statistics provided within the table. We are given, with confidence
limits, the odds ratio and relative risk, together with Chisquared statistics with and without
continuity correction (“Yates corrected”). There is also a “MantelHaenszel” chisquared
statistics, which is really for stratified analyses and can be ignored for the time being.
Exercise:
Check that you can calculate
(i) the odds ratio
(ii) the relative risk
(iii) the chisquare statistics, with and without continuity correction
What is the response variable ?
Which is the explanatory variable ?
What do you conclude about the association between the two variables ?
Note that Epiinfo assumes that the response variables is the column variable in the table, the one
listed second in the tables command.
Introduction
Often when we are investigating the relationship between two variables, we want to take into
account the effect of other variables that have associations with both the response and explanatory
variables we are interested in. Epiinfo can be used to allow for such confounding variables, using
the tables command and also a separate module called statcalc.
To illustrate these methods, we use another dataset, on the use of bed nets and the presence of
enlarged spleens in two villages in Africa.
Village A Village B
Spleen enlarged Spleen enlarged
yes no Total yes no Total
With nets 12 (50%) 12 24 15 (22%) 52 67
Without 42 (59%) 29 71 4 (25%) 12 16
nets
Total 54 (57%) 41 95 19 (23%) 64 83
Page 13 of 19
Minsk, February 2000
Both
villages
combined
Spleen enlarged
yes no Total
With nets 27 (30%) 64 91
Without 46 (53%) 41 87
nets
Total 73 (41%) 105 178
A stratified analysis is necessary here, because village is a confounding factor – being related
both to the response variable (enlarged spleen) and the explanatory variable (bednet use)
We can conduct this analysis using the statcalc module of Epiinfo. We start this module from
the Epiinfo menu (after exiting Analysis by pressing the F10 key)
You are given the choice of three options – choose the first option, Tables (2x2, 2xn). You are then
faced with the traditional, “Exposure by Disease” table:
Disease
+ 
Exposure +

Note that, once again the disease (response variable) must be the column variable and exposure
the row variable.
Exercise:
We now have to enter the data for the two villages combined, entering cells counts only and not
totals, as follows
Type 27 and press ENTER. Notice that the cursor automatically goes to the next cell
Type 64 and press ENTER
Type 46 and press ENTER
Type 41 and press ENTER
If you have entered the four cell counts correctly, press F4 to request the analysis of the table. A
set of statistics, similar to those produced from the tables command used earlier is given.
Page 14 of 19
Minsk, February 2000
Exercise:
What are the values of the
(i) Relative risk ?
(ii) Yates corrected chisquared test?
Now we have to enter the data separately for the two villages, press the ENTER key twice to
return to the blank table.
Exercise:
Enter the data for the first table (village A), press F4 to get the analysis. For village A,
What are the values of the
(iii) Relative risk ?
(iv) Yates corrected chisquared test?
To enter the data for village B, press the F2 key and proceed as before. For village B,
What are the values of the
(v) Relative risk ?
(vi) Yates corrected chisquared test?
Page 15 of 19
Minsk, February 2000
(ix) The Mantel Haenzel summary chisquare ?
What are your conclusions about the relationship between bednet use and presence of an
enlarged spleen ?
What was the effect of controlling for the confounding variable, village ?
Note that there is an increasing trend in the proportion of women with early age at menarche as
skinfold thickness increases.
Exercise:
The usual 2x3 chisquare test can be carried out within Statcalc by selecting the (2x2,2xn) option,
as before. The table has 2 rows and 3 columns, but Epiinfo requires the data to be entered as 2
columns and 3 rows. Starting with the blank table, enter the data (cell counts only, no totals), for
the ‘Small’ group (from column 1 of the table), remembering to press ENTER after each number
has been entered. Now enter the cell counts for the ‘Intermediate’ group (column 2 from the table
above). Now just continue typing to enter a third row of numbers for the ‘Large’ group. After the
third row, press F4 to get the analysis. Statistics for the table are displayed
Now we will carry out a test in proportions, to do this we have to assign a numerical score to each
column (the skinfold group from the table), namely 1, 2 and 3.
In order to perform a chisquared test for trend, return to the Statcalc menu by pressing F10.
Notice that the third option is chisquared test for trend .
Page 16 of 19
Minsk, February 2000
controls refers to the number of women with age at menarche 12+ years;
Press ENTER after each entry. After entering the last number, press the F4 key to calculate the
statistics.
Exercise:
(i) What is the value of the chisquared test for trend ?
(ii) How many degrees of freedom are there ?
(iii) What is the Pvalue ?
Note that the odds ratios are given, using the ‘Small’ category as the baseline. This confirms the
initial observation that there is an increasing proportion of women with early age at menarche.
Exercise:
Perform a similar analysis on the following data, which describes 128 children aged under 12
years who were followed up during the malaria season to record which of them experienced
clinical attacks of malaria. The results by age group were:
Page 17 of 19
Minsk, February 2000
(i) Calculate the percentage of children in each age group who contracted malaria
(ii) Conduct a significance test to assess if there is any evidence of agerelated variation
in malarial morbidity
(iii) Carry out a test for trend in proportions
(iv) What are your conclusions ?
The layout of the file will consist of names for all the variables in your data file and some
information to define them. An example of a file layout is given in itpexamp.qes
PERSONCOD ##
SEX #
OBSERVEAGE ##
THYROIDILLIS #
THYRMEDICAMIS #
THYRFAMILYIS #
IODSALTIS #
SEAPRODUCTIS #
OBSERVEGORMON #
OBSERVEIOD #
OBSERVEVITAMINE #
HEIGHT ###
WEIGHT ##
RIGHTSHIRINA #.##
RIGHTTOLSHINA #.##
RIGHTDLINA #.##
LEFTSHIRINA #.##
LEFTTOLSHINA #.##
LEFTDLIN #.##
IODURINE #.##
Next to the variable names are special characters that define the type and length of the variable.
The first variable in this file is the person code (PERSONCOD) which is a number can take up
two digits, i.e. it can be from 0 to 99. The ## characters imply that the variable is of numeric type
and is made up of 2 digits.
Page 18 of 19
Minsk, February 2000
Variable types
There are four basic variable types allowed within EpiInfo. Once the variable type is defined,
EpiInfo will only allow data of that type to be entered for that variable. The variable types are as
follows:
(i) Numeric variables are defined using the # character. The number of characters defined
the length, so that #### defines a variable with 3 digits. A decimal point can also be
included so that ###.## can hold numbers between –99.99 and 999.99
(ii) Text variables are defined using the _ (underline/underscore) character, with the number
of characters defining the length. Alternatively , a text variable can be defined as upper
case, using the ‘A’ character between lessthan and greater than signs, e.g. <A>.
(iii) Date variables are defined as a date format between lessthan and greater than signs, e.g.
<DD/MM/YY> defines a variable in the form day. month, year
(iv) Logical or yes/no variables are defined as <Y>.
Open the Enter program from the menu. You are asked for the name of a .rec file, but since it
doesn’t exist you must put in the name you want to call it. You now want to create new data from
a .qes file, which is choice 2 which will ask you to enter the filename of your .qes file. You will
now see the questionnaire (file layout) that you have created, with spaces highlighted for each of
the variables. We need go no further within Enter as we have now created a .rec file into which
we can load out preprepared data.
In the first space enter the name of the .rec file that contains the file layout. The second filename
required is the name of the file that contains the data, for example itpexamp.csv which is
comma separated. In the case of comma separated data, choose the ‘delim’ option for delimited
(or separated by something). The other option ‘fixed’ is for when the data is in a very rigid
structure, which is unlikely to be the case if the data has been extracted from a common
spreadsheet or statistical package, such as Excel or Splus.
The data will now have been loaded into the .rec file and is ready for use in Analysis.
Page 19 of 19
Minsk, February 2000