Escolar Documentos
Profissional Documentos
Cultura Documentos
Copyright
Software copyright 1991-2014 by Frontline Systems, Inc.
User Guide copyright 2014 by Frontline Systems, Inc.
Analytic Solver Platform: Portions 1989 by Optimal Methods, Inc.; portions 2002 by Masakazu
Muramatsu. LP/QP Solver: Portions 2000-2010 by International Business Machines Corp. and others.
Neither the Software nor this User Guide may be copied, photocopied, reproduced, translated, or reduced to any
electronic medium or machine-readable form without the express written consent of Frontline Systems, Inc.,
except as permitted by the Software License agreement below.
Trademarks
Analytic Solver Platform, Risk Solver Platform, Premium Solver Platform, Premium Solver Pro, Risk Solver
Pro, Risk Solver Engine, Solver SDK Platform and Solver SDK Pro are trademarks of Frontline Systems, Inc.
Windows and Excel are trademarks of Microsoft Corp. Gurobi is a trademark of Gurobi Optimization, Inc.
KNITRO is a trademark of Ziena Optimization, Inc. MOSEK is a trademark of MOSEK ApS. OptQuest is a
trademark of OptTek Systems, Inc. XpressMP is a trademark of FICO, Inc.
Acknowledgements
Thanks to Dan Fylstra and the Frontline Systems development team for a 20-year cumulative effort to build the
best possible optimization and simulation software for Microsoft Excel. Thanks to Frontlines customers who
have built many thousands of successful applications, and have given us many suggestions for improvements.
Analytic Solver Platform and Risk Solver Platform has benefited from reviews, critiques, and suggestions from
several risk analysis experts:
Sam Savage (Stanford Univ. and AnalyCorp Inc.) for Probability Management concepts including SIPs,
SLURPs, DISTs, and Certified Distributions.
Sam Sugiyama (EC Risk USA & Europe LLC) for evaluation of advanced distributions, correlations, and
alternate parameters for continuous distributions.
Savvakis C. Savvides for global bounds, censor bounds, base case values, the Normal Skewed distribution
and new risk measures.
How to Order
Contact Frontline Systems, Inc., P.O. Box 4288, Incline Village, NV 89450.
Tel (775) 831-0300 Fax (775) 831-0314 Email info@solver.com Web http://www.solver.com
Table of Contents
Start Here: Data Mining Essentials in V2014
14
XLMiner Overview
19
25
33
Introduction ...................................................................................................................... 33
Working with Licenses in V2014 ..................................................................................... 33
Using the License File Solver.lic ........................................................................ 33
License Codes and Internet Activation ............................................................... 33
Running Subset Products in V2014 .................................................................................. 34
Using the Welcome Screen ............................................................................................... 36
Using the XLMiner Help Text .......................................................................................... 36
Introduction to XLMiner
39
Introduction ...................................................................................................................... 39
Ribbon Overview .............................................................................................................. 39
XLMiner Help Ribbon Icon .............................................................................................. 40
Change Product .................................................................................................. 40
License Code ...................................................................................................... 41
Examples ............................................................................................................ 43
Help Text ............................................................................................................ 45
Check for Updates .............................................................................................. 46
About XLMiner .................................................................................................. 46
Common Dialog Options .................................................................................................. 47
Frontline Solvers V2014
3
Worksheet ........................................................................................................... 47
Data Range ......................................................................................................... 47
# Rows, # Columns ............................................................................................ 47
First row contains headers .................................................................................. 48
Variables in the data source ................................................................................ 48
Input variables .................................................................................................... 48
Help .................................................................................................................... 48
Reset ................................................................................................................... 48
OK ...................................................................................................................... 48
Cancel ................................................................................................................. 48
Help Window ..................................................................................................... 49
References ........................................................................................................................ 49
50
Introduction ...................................................................................................................... 50
Sampling from a Worksheet ............................................................................................. 51
Example: Sampling from a Worksheet using Simple Random Sampling ......... 51
Example: Sampling from a Worksheet using Sampling with Replacement ...... 54
Example: Sampling from a Worksheet using Stratified Random Sampling ...... 55
Sample from Worksheet Options ...................................................................................... 60
Data Range ......................................................................................................... 61
First row contains headers .................................................................................. 61
Variables............................................................................................................. 61
Sample With replacement ................................................................................... 62
Set Seed .............................................................................................................. 62
Desired sample size ............................................................................................ 62
Simple random sampling .................................................................................... 62
Stratified random sampling ................................................................................ 62
Stratum Variable ................................................................................................. 62
Proportionate to stratum size .............................................................................. 62
Equal from each stratum ..................................................................................... 62
Equal from each stratum, #records = smallest stratum size ................................ 63
Sampling from a Database ................................................................................................ 63
66
Introduction ...................................................................................................................... 66
Bar Chart ............................................................................................................ 66
Box Whisker Plot ............................................................................................... 66
Histogram ........................................................................................................... 68
Line Chart ........................................................................................................... 68
Parallel Coordinates............................................................................................ 69
Scatterplot........................................................................................................... 69
Scatterplot Matrix ............................................................................................... 69
Variable Plot ....................................................................................................... 70
Bar Chart Example ........................................................................................................... 70
Box Whisker Plot Example............................................................................................... 75
Histogram Example .......................................................................................................... 80
Line Chart Example .......................................................................................................... 85
Parallel Coordinates Chart Example ................................................................................. 87
ScatterPlot Example .......................................................................................................... 91
Scatterplot Matrix Plot Example ....................................................................................... 95
Variable Plot Example ...................................................................................................... 97
Common Chart Options .................................................................................................... 99
103
119
131
140
k-Means Clustering
151
Hierarchical Clustering
160
176
Smoothing Techniques
202
232
243
Logistic Regression
256
276
286
303
314
336
356
368
383
Association Rules
398
403
419
Obtaining a License
Use Help License Code on the XLMiner Ribbon. The license manager in
V2014 allows users to obtain and activate a license over the Internet. V9.5 and
earlier license codes in your Solver.lic license file will be ignored in V2014.
See the chapter Using Help, Licensing and Product Subsets for details.
period for Use on any one computer shall be ten (10) minutes, but may be longer depending on the
Software function used and the size and complexity of the model.
Other License Restrictions: The Software includes license control features that may write encoded
information about the license type and term to the PC or LS hard disk; Licensee agrees that it will not
attempt to alter or circumvent such license control features. This License does not grant to Licensee the
right to make copies of the Software or otherwise enable use of the Software in any manner other than as
described above, by any persons or on any computers except as described above, or by any entity other than
Licensee. Licensee acknowledges that the Software and its structure, organization, and source code
constitute valuable Intellectual Property of Frontline and/or its suppliers and Licensee agrees that it shall
not, nor shall it permit, assist or encourage any third party to: (a) copy, modify adapt, alter, translate or
create derivative works from the Software; (b) merge the Software into any other software or use the
Software to develop any application or program having the same primary function as the Software; (c)
sublicense, distribute, sell, use for service bureau use, lease, rent, loan, or otherwise transfer the Software;
(d) "share" use of the Software with anyone else; (e) make the Software available over the Internet, a
company or institutional intranet, or any similar networking technology, except as explicitly provided in the
case of a Flexible Use License; (f) reverse compile, reverse engineer, decompile, disassemble, or otherwise
attempt to derive the source code for the Software; or (g) otherwise exercise any rights in or to the
Software, except as permitted in this Section.
U.S. Government: The Software is provided with RESTRICTED RIGHTS. Use, duplication, or
disclosure by the U.S. Government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the
Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 or subparagraphs (c)(1)
and (2) of the Commercial Computer Software - Restricted Rights at 48 CFR 52.227-19, as applicable.
Contractor/manufacturer is Frontline Systems, Inc., P.O. Box 4288, Incline Village, NV 89450.
2. ANNUAL SUPPORT.
Limited warranty: If Licensee purchases an "Annual Support Contract" from Frontline, then Frontline
warrants, during the term of such Annual Support Contract ("Support Term"), that the Software covered by
the Annual Support Contract will perform substantially as described in the User Guide published by
Frontline in connection with the Software, as such may be amended from time to time, when it is properly
used as described in the User Guide, provided, however, that Frontline does not warrant that the Software
will be error-free in all circumstances. During the Support Term, Frontline shall make reasonable
commercial efforts to correct, or devise workarounds for, any Software errors (failures to perform as so
described) reported by Licensee, and to timely provide such corrections or workarounds to Licensee.
Disclaimer of Other Warranties: IF THE SOFTWARE IS COVERED BY AN ANNUAL SUPPORT
CONTRACT, THE LIMITED WARRANTY IN THIS SECTION 2 SHALL CONSTITUTE
FRONTLINE'S ENTIRE LIABILITY IN CONTRACT, TORT AND OTHERWISE, AND LICENSEES
EXCLUSIVE REMEDY UNDER THIS LIMITED WARRANTY. IF THE SOFTWARE IS NOT
COVERED BY A VALID ANNUAL SUPPORT CONTRACT, OR IF LICENSEE PERMITS THE
ANNUAL SUPPORT CONTRACT ASSOCIATED WITH THE SOFTWARE TO EXPIRE, THE
DISCLAIMERS SET FORTH IN SECTION 3 SHALL APPLY.
3. WARRANTY DISCLAIMER.
EXCEPT AS PROVIDED IN SECTION 2 ABOVE, THE SOFTWARE IS PROVIDED "AS IS" AND
"WHERE IS" WITHOUT WARRANTY OF ANY KIND; FRONTLINE AND, WITHOUT EXCEPTION,
ITS SUPPLIERS DISCLAIM ALL WARRANTIES, EITHER EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, WITH RESPECT TO THE
SOFTWARE OR ANY WARRANTIES ARISING FROM COURSE OF DEALING OR COURSE OF
PERFORMANCE AND THE SAME ARE HEREBY EXPRESSLY DISCLAIMED TO THE MAXIMUM
EXTENT PERMITTED BY APPLICABLE LAW. WITHOUT LIMITING THE FOREGOING,
FRONTLINE DOES NOT REPRESENT, WARRANTY OR GUARANTEE THAT THE SOFTWARE
Frontline Solvers V2014
16
the exclusive property of Frontline and/or its licensors. All rights in and to the Software and Frontlines
other Intellectual Property not expressly granted to Licensee in this License are reserved by Frontline. For
the Large-Scale LP/QP Solver only: Source code is available, as part of an open source project, for
portions of the Software; please contact Frontline for information if you want to obtain this source code.
Amendments: This License constitutes the complete and exclusive agreement between the parties relating
to the subject matter hereof. It supersedes all other proposals, understandings and all other agreements, oral
and written, between the parties relating to this subject matter, including any purchase order of Licensee,
any of its preprinted terms, or any terms and conditions attached to such purchase order.
Compliance with Laws: Licensee will not export or re-export the Software without all required United
States and foreign government licenses.
Assignment: This License may be assigned to any entity that succeeds by operation of law to Licensee or
that purchases all or substantially all of Licensees assets (the "Successor"), provided that Frontline is
notified of the transfer, and that Successor agrees to all terms and conditions of this License.
Governing Law: Any controversy, claim or dispute arising out of or relating to this License, shall be
governed by the laws of the State of Nevada, other than such laws, rules, regulations and case law that
would result in the application of the laws of a jurisdiction other than the State of Nevada.
XLMiner Overview
Analytic Solver Platform and XLMiner Overview
This Guide shows you how to use XLMiner Frontline Systems data mining
product that combines the capabilities of data analysis, Time Series analysis,
classification techniques and prediction techniques. XLMiner is included in
Frontline Systems Analytic Solver Platform or can be purchased as a standalone license.
XLMiner
XLMiner can also be purchased as a stand-alone product. A stand-alone license
for XLMiner includes all of the data analysis, time series data capabilities,
classification and prediction features available in XLMiner but does not support
optimization or simulation.
Analytic Solver Platforms XLMiner component offers over 30 different
methods for analyzing a dataset in order to forecast future events. The XLMiner
ribbon is broken up into four different segments as shown in the screenshot
below.
You can use the Data Analysis group of buttons to draw a sample of data
from a spreadsheet, external SQL database, or from PowerPivot, explore
your data, both visually and through methods like cluster analysis, and
transform your data with methods like Principal Components, Missing
Value imputation, Binning continuous data, and Transforming categorical
data.
Use the Time Series group of buttons for time series forecasting, using both
Exponential Smoothing (including Holt-Winters) and ARIMA (AutoRegressive Integrated Moving Average) models, the two most popular time
series forecasting methods from classical statistics. These methods forecast
a single data series forward in time.
The Data Mining group of buttons give you access to a broad range of
methods for prediction, classification, and affinity analysis, from both
classical statistics and data mining. These methods use multiple input
variables to predict an outcome variable, or classify the outcome into one of
several categories.
Use the Predict button to build prediction models using Multiple Linear
Regression (with variable subset selection and diagnostics), k-Nearest
Neighbors, Regression Trees, and Neural Networks.
Use the Associate button to perform affinity analysis (what goes with
what or marker basket analysis) using Association Rules.
If forecasting and data mining are new for you, dont worry you can learn a lot
about them by consulting our extensive in-product Help. Click Help Help
Text on the XLMiner tab, or click Help Help Text Forecasting/Data
Mining on the Analytic Solver Platform tab (these open the same Help file).
If youd like to learn more and get started as a data scientist, consult the
excellent book Data Mining for Business Intelligence, which was written by the
XLMiner designers and early academic users. Youll be able to run all the
XLMiner examples and exercises in Analytic Solver Platform.
Data Analysis
XLMiner includes several different methods for data analysis, including
Sampling from either a Worksheet or Database, Charting with 8 different types
of available charts, Transformation techniques which handle missing data,
binning continuous data, creating dummy variables and transforming categorical
data, and using Principal Components Analysis to reduce and eliminate
superfluous or redundant variables; along with two different types of Clustering
techniques, k-Means and Hierarchical.
Click the Sample icon to take a representative sample from a database included
in either an Excel workbook or an Oracle, SQL Server, or MS-Access database.
Users can choose to sample with or without replacement using simple or
stratified random sampling.
Click the Explore icon to create one or more charts of your data. XLMiner
includes 8 different types of charts to choose from, including: bar charts, line
charts, scatterplots, boxplots, histograms, parallel coordinates charts, scatterplot
matrix charts or variable charts. Click this icon to edit or view previously
created charts as well.
Click the Transformation icon when data manipulation is required. In most large
databases or datasets, a portion of variables are bound to be missing some data.
XLMiner includes routines for dealing with these missing values by allowing a
user to either delete the full record or apply a value of her/her choice. XLMiner
Frontline Solvers V2014
20
also includes a routine for binning continuous data for use with prediction and
classification methods which do not support continuous data. Continuous
variables can be binned using several different user specified options. Nonnumeric data can be transformed using dummy variables with up to 30 distinct
values. If more than 30 categories exist for a single variable, use the Reduce
Categories routine to decrease the number of categories to 30. Finally, use
Principal Components Analysis to remove highly correlated or superfluous
variables from large databases.
Click the Cluster icon to gain access to two different types of clustering
techniques: k-Means clustering and hierarchical clustering. Both methods
allow insight into a database or dataset by performing a cluster analysis. This
type of analysis can be used to obtain the degree of similarity (or dissimilarity)
between the individual objects being clustered.
Typically, when using a time series dataset, the data is first partitioned into
training and validation sets. Click the Partition icon within the Time Series
ribbon segment to utilize the Time Series Data Partitioning routine. XLMiner
features two techniques for exploring trends in a dataset, ACF (Autocorrelation
function) and PACF (Partial autocorrelation function). These techniques help the
user to explore various patterns in the data which can be used in the creation of
the model. After the data is analyzed, a model can be fit to the data using
XLMiner's ARIMA method. All three of these methods can be found by
clicking the ARIMA dialog.
Data collected over time is likely to show some form of random variation.
"Smoothing techniques" can be used to reduce or cancel the effect of these
variations. These techniques, when properly applied, will smooth out the
random variation in the time series data to reveal any underlying trends that may
exist.
Click the Smoothing icon to gain access to XLMiners four different smoothing
techniques: Exponential, Moving Average, Double Exponential, and Holt
Winters. The first two techniques, Exponential and Moving Average, are
relatively simple smoothing techniques and should not be performed on datasets
Frontline Solvers V2014
21
involving seasonality. The last two techniques are more advanced techniques
which can be used on datasets involving seasonality.
Data Mining
The Data Mining section of the Analytic Solver Platform or XLMiner ribbon
contains four icons: Partition, Classify, Predict, and Associate. Click the
Partition icon to partition your data into training, validation, and if desired, test
sets. Click the Classify icon to select one of six different classification methods.
Click the Predict icon to select one of four different prediction methods. Click
the Associate icon to recognize associations or correlations among variables in
the dataset.
XLMiner supports six different methods for predicting the class of an outcome
variable (classification) and four different methods for predicting the actual
(prediction) of an outcome variable. Classification can be described as
categorizing a set of observations into predefined classes in order to determine
the class of an observation based on a set of variables. A prediction method can
be described as a technique performed on a database either to predict the
response variable value based on a predictor variable or to study the relationship
between the response variable and the predictor variables. For example, when
determining the relationship between the crime rate of a city or neighborhood
and demographic factors such as population, education, male to female ratio,
etc.
One very important issue when fitting a model is how well the newly created
model will behave when applied to new data. To address this issue, the dataset
can be divided into multiple partitions before a classification or prediction
algorithm is applied: a training partition used to create the model, a validation
partition to test the performance of the model and, if desired, a third test
partition. Partitioning is performed randomly, to protect against a biased
partition, according to proportions specified by the user or according to rules
concerning the dataset type. For example, when creating a time series forecast,
data is partitioned by chronological order.
The six different classification methods are:
Discriminant Analysis - Constructs a set of linear functions of the
predictor variables and uses these functions to predict the class of a
new observation with an unknown class. Common uses of this method
include: classifying loan, credit card or insurance applicants into low
or high risk categories, classifying student applications for college
entrance, classifying cancer patients into clinical studies, etc.
Logistic Regression A variant of ordinary regression which is used
to predict the response variable, or the output variable, when the
response variable is a dichotomous variable (a variable that takes only
two values such as yes/no, success/failure, survive/die, etc.).
k-Nearest Neighbors This classification method divides a training
dataset into groups of k observations using a Euclidean Distance
measure to determine similarity between neighbors. These
classification groups are used to assign categories to each member of
the validation training set.
Classification Tree Also known as Decision Trees, this classification
method is a good choice when goal is to generate easily understood and
explained rules that can be translated in an SQL or query language.
Frontline Solvers V2014
22
cheese, etc. This information is useful in planning store layouts (placing items
optimally with respect to each other), cross-selling promotions, coupon offers,
etc.
Tools
The Tools section of the Analytic Solver Platform ribbon contains two icons:
Score and Help. Click the Score icon to score new data in a database or
worksheet with any of the Classification or Prediction algorithms. This facility
matches the input variables to the database (or worksheet) fields and then
performs the scoring on the database (or worksheet).
XLMiner also supports the scoring of Test Data. When XLMiner calculates
prediction or classification results, internal values and coefficients are generated
and used in the computations. XLMiner saves these values to an additional
output sheet, termed Stored Model Sheet, which uses the worksheet name,
XX_Stored_N where XX are the initials of the classification or prediction
method and N is the number of generated stored sheets. This sheet is used when
scoring the test data. Note: In previous versions of XLMiner, this utility was a
separate add-on application named XLMCalc. Starting in XLMiner V12.5, this
utility is included free of charge.
Click the Help icon to enter a new license or activation code, open an example
dataset (over 25 example datasets are provided and most are used in the
examples throughout this guide), open the online help, open this guide, or check
for updates. See the XLMiner Help Ribbon Icon section in the Introduction to
XLMiner chapter for more information on this menu.
Next, youll briefly see the standard Windows Installer dialog. Then a
dialog box like the one shown below should appear:
Next, the Setup program will ask if you accept Frontlines software license
agreement. You must click I accept and Next in order to be able to
proceed.
The Setup program then displays a dialog box like the one shown below,
where you can select or confirm the folder to which files will be copied
(normally C:\Program Files\Frontline Systems\Analytic Solver Platform, or
if youre installing Analytic Solver Platform for 32-bit Excel on 64-bit
Windows, C:\Program Files (x86)\Frontline Systems\Analytic Solver
Platform). Click Next to proceed.
If you have an existing license, or youve just activated a license for full
Analytic Solver Platform, the Setup program will give you the option to run
the XLMiner software as a subset product instead of the full Analytic
Solver Platform.
Click Next to proceed. Youll see a dialog confirming that the preliminary
steps are complete, and the installation is ready to begin:
After you click Install, the Analytic Solver Platform files will be installed,
and the program file RSPAddin.xll will be registered as a COM add-in
(which may take some time). A progress dialog appears, as shown below;
be patient, since this process takes longer than it has in previous Solver
Platform releases.
When the installation is complete, youll see a dialog box like the one
below. Click Finish to exit the installation wizard.
You can manage add-ins by selecting the type of add-in from the dropdown
list at the bottom of this dialog. For example, if you select COM Add-ins
from the dropdown list and clock the Go button, the dialog shown below
appears.
If you uncheck the box next to Analytic Solver Platform Addin and click
OK, you will deactivate the Analytic Solver Platform COM add-in, which
will remove the Analytic Solver Platform tab from the Ribbon, and also
remove the PSI functions for optimization from the Excel 2013 Function
Wizard.
Excel 2003
In earlier versions of Excel, COM add-ins and other add-ins are managed in
separate dialogs, and the COM Add-In dialog is available only if you
display a toolbar which is hidden by default. To display this toolbar:
1.
2.
3.
4.
Once you have done this, you can click COM Add-Ins on the toolbar to see
a list of the available add-ins in the COM Add-Ins dialog box, as shown
above.
If you uncheck the box next to Analytic Solver Platform Addin and click
OK, you will deactivate the Analytic Solver Platform COM add-in, which
will remove Analytic Solver Platform from the main menu bar, and also
remove the PSI functions for optimization from the Insert Function dialog.
You have two options to obtain and activate a license, using this dialog:
1.
2.
Even easier, and available 24x7 if you have Internet access on this PC:
If you have a license Activation Code from Frontline Systems, you can
copy and paste it into the upper edit box in this dialog. When you
click OK, Analytic Solver Platform contacts Frontlines license server
over the Internet, sends the Lock Code and receives your license code
automatically. Youll see a message confirming the license activation,
or reporting any errors.
the Analytic Solver Platform Ribbon. A dialog like the one below will
appear.
In this dialog, you can select the subset product you want, and click OK.
The change to a new product takes effect immediately: Youll see the
subset product name instead of Analytic Solver Platform as a tab on the
Ribbon, and a subset of the Ribbon options
XLMiner
XLMiner includes only the data mining and predictive capabilities of
Analytic Solver Platform. No optimization or simulation capabilities are
included in the XLMiner subset.
This screen appears automatically only when you click the Analytic Solver
Platform tab on the Ribbon in Excel 2013, 2010 or 2007, or use the
Analytic Solver Platform menu in Excel 2003 and then only if you are
using a trial license. You can display the Welcome Screen manually by
choosing Help Welcome Screen from the Analytic Solver Platform
Ribbon. You can control whether the screen appears automatically by
selecting or clearing the check box in the lower left corner, Show this
dialog at first use.
This Help file contains significant information about the features and
capabilities of XLMiner all at the tip of your fingertips. Each topic
covered includes an Introduction to the feature, an explanation of the
dialogs involved and an example using one of the example datasets. These
example datasets can be found on the XLMiner Ribbon under HELP
Examples.
Introduction to XLMiner
Introduction
XLMiner is a comprehensive data mining add-in for Excel. Data mining
is a discovery-driven data analysis technology used for identifying patterns
and relationships in data sets. With overwhelming amounts of data now
available from transaction systems and external data sources, organizations
are presented with increasing opportunities to understand their data and gain
insights into it. Data mining is still an emerging field, and is a convergence
of fields like statistics, machine learning, and artificial intelligence.
Often, there may be more than one approach to a problem. XLMiner is a
tool belt to help you get started quickly offering a variety of methods to
analyze your data. It has extensive coverage of statistical and machine
learning techniques for classification, prediction, affinity analysis and data
exploration and reduction.
Ribbon Overview
To bring up the XLMiner ribbon, click XLMiner on the Excel ribbon.
Click the XLMiner menu item in Excel 2003 to open the XLMiner menu.
This menu is arranged a bit differently than the Excel 2007 / 2010 / 2013
ribbon, but all features discussed in this guide can be used in Excel 2003.
Note: Menu items appear differently in Excel 2003 but all method dialogs
are identical to dialogs in later versions of Excel.
Change Product
Selecting Change Product on the XLMiner Help menu will bring up the
Change Product dialog shown below.
If you have a permanent license code for the Analytic Solver Platform, then
you can change to any subset and see that subset on the Ribbon. For
example, if XLMiner is selected, only XLMiner will appear on the ribbon
even if a license for Analytic Solver Platform is in place.
License Code
Selecting License Code from the XLMiner Help menu brings up the Enter
License or Activation Code dialog shown below. The top portion of this
dialog will always contain the currently licensed product along with the
product version number.
Enter the activation code into the Activation Code field leaving the License
Code field blank. Then click OK. At this point, your permanent license
should be activated and no further steps are needed.
If you encounter problems connecting to our license server, then you will
need to enter the complete license directly into this dialog. Clicking Email
Lock Code will create and send an email to info@solver.com that includes
the Lock Code displayed on this dialog. Our license manager will generate
a license based on this lock code and email the permanent license code back
to you. (Make sure to click Allow on the dialog below.)
Copy and paste the entire contents of the license code into the License Code
field as shown on the dialog below, then click OK. At this point, your
permanent license should be activated and no further steps are needed.
Frontline Solvers V2014
42
Examples
Clicking this menu item will open a browser pointing to C:\Program
Files\Frontline Systems\ Analytic Solver Platform \Datasets. See the table
below for a description of each example dataset.
Model
Airpass
Used in Example
Notes
Time Series
All Temperature
Temperature dataset.
Apparel
Arma
Associations
Association Rules
AssociationsItemList
Binning Example
Boston Housing
Boxplot
Box Plot
Catalog Multi
Daily Rate
Dataset.mdb
Demo.mdb
A synthetic database.
Digits
Discriminant Analysis
DistMatrix
Hierarchical Clustering
Durable Goods
Examples
Flying Fitness
Income
Time Series
Iris
Irisfacto
Monthly Rate
Retail Trade
Sampling
Scoring
Utilities
Wine
Wine Partition
Help Text
Clicking Help Text opens the online Help file. This file contains extensive
information pertaining to XLMiners features and capabilities. All at the tip
of your fingertips!
About XLMiner
Clicking this menu item will open the About XLMiner dialog as shown
below.
Worksheet
The active worksheet appears in this field
Data Range
The range of the dataset appears in this field
# Rows, # Columns
The number of rows and columns in the dataset appear in these two fields,
respectively.
Input variables
Variables listed in this field will be included in the output. Select the
desired Variables in the data source then click the > button to shift
variables to the Input variables field.
Help
Click this command button to open the XLMiner Help text file.
Reset
Click this command button to reset the options for the selected method.
OK
Click this command button to initiate the desired method and produce the
output report.
Cancel
Click this command button to close the open dialog without saving any
options or creating an output report.
Help Window
Click this command button to open the Help Text for the selected method.
References
See below for a list of references sited when compiling this guide.
Websites
1.
2.
3.
Books
1.
2.
3.
4.
5.
6.
7.
Shmueli, Galit, Nitin R. Patel, Peter C. Bruce. Data Mining for Business
Intelligence. Wiley, New Jersey (2010).
To start, click a cell within the data, say A2, and click Data Utilities -- Sample
from Worksheet.
In this example, the default option, Simple Random Sampling, will be used.
Select all variables under Variables, click >_ to include them in the sample data
then click OK.
Again, select all variables in the Variables section and click >_ to include each
in the sample data. Check Sample with replacement and enter 300 for Desired
sample size. Since we are choosing sampling with replacement, XLMiner will
generate a sample with a larger number of records than the dataset. Click OK.
A portion of the output is shown below.
The output indicates "True" for Sampling with replacement. As a result, the
desired sample size is greater than the number of records in the input data.
Looking closely, one can see that the second and third entries are the same
record, record #3.
In order to maintain the proportions of the strata, XLMiner has increased the
sample size. This is apparent in the entry for #records actually sampled. Under
the Stratum wise details heading, XLMiner has listed all the stratum values v8
assumes with #records in input data for each stratum. On this basis, XLMiner
calculated the percentage representation of that value in the dataset and
maintained it in the sample. This is evident in the entries since #records in
sampled data has the same proportion as in the dataset. XLMiner has added a
Row Id to each record before sampling. The output is displayed after sorting the
sample on these Row Ids.
Lets see what happens to our output when we select a different option for
Stratified Sampling.
Click back to the data worksheet, click a cell within the data, say A2, and click
Data Utilities -- Sample from Worksheet.
Select all variables under Variables, click >_ to include them in the sample data,
and then click OK. Select Stratified random sampling. Choose v8 as the
Stratum variable. The #strata is displayed automatically. Select Equal from
each stratum, please specify #records.
Enter the #records. Remember, this number should not be greater than the
smallest stratum size. In this case the smallest stratum size is 8. (Note: The
smallest stratum size appears automatically in a box next to the option, Equal
from each stratum, # records = smallest stratum size.). Enter 7, which is less
than the limit of 8, and then click OK.
As you can see in the output, the number of records in the sampled data is 56 or
7 records per stratum for 8 strata.
If a sample with an equal number of records for each stratum but of bigger size
is desired, use the same options above for sampling with replacement.
Check Sample with replacement. Enter 20 for Equal from each stratum, please
specify #records. Though the smallest stratum size is 8 in this dataset, we can
acquire more records for our sample since we are Sampling with replacement.
Keeping all other options the same, the output is as follows.
Since the output sample has 20 records per stratum, the #records in sampled
data is 160 (20 records per stratum for 8 strata).
Data Range
Either type the address directly into this field, or use the reference button, to
enter the data range from the worksheet. If the cell pointer (active cell) is
already somewhere in the data range, XLMiner automatically picks up the
contiguous data range surrounding the active cell. After the data range is
selected, XLMiner displays the number of records in the selected range.
Variables
This list box contains the names of the variables in the selected data range. If the
first row of the range contains the variable names, then these names appear in
this list box. If the first row of the dataset does not contain the headers, then
XLMiner lists the variable names using its default naming convention. In this
case the first column is named Var1; the second column is named Var2 and so
on. To select a variable for sampling, select the variable, then click the ">"
button. Use the CTRL key to select multiple variables.
Set Seed
Enter the desired sorting seed here. The default seed is 12345.
Stratum Variable
Select the variable to be used for stratified random sampling by clicking the
down arrow and selecting the desired variable. Note that XLMiner allows only
those variables which have less than 30 distinct values. As the user selects the
variable name, XLMiner displays the #Strata that variable contains in a box to
the left and the smallest stratum size in a box in front of the option Equal from
each stratum, #records = smallest stratum size.
Click the down arrow next to Data Source and select MS-Access, and then click
Connect to a database.
Since this database is not password protected, simple click OK. The following
dialog will appear.
Select all the fields from Fields in table and click > to move all fields to Selected
fields.
Click OK. A portion of the output is below.
Refer to the examples above for Sampling from a Worksheet. You can sample
from a database using all the methods described in this chapter.
Bar Chart
The bar chart is one of the easiest and effective plots to create and understand.
The best application for this type of chart is comparing an individual statistic
(i.e. mean, count, etc.) across a group of variables. The bar height represents the
statistic while the bars represent the different groups. An example of a bar chart
is shown below.
52
Histogram
A Histogram, or a Frequency Histogram is a bar graph which depicts the range
and scale of the observations on the x axis and the number of data points (or
frequency) of the various intervals on the y axis. These types of graphs are
popular among statisticians. Although these types of graphs do not show the
exact values of the data points, they give a very good idea about the spread and
shape of the data.
Consider the percentages below from a college final exam.
82.5, 78.3, 76.2, 81.2, 72.3, 73.2, 76.3, 77.3, 78.2, 78.5, 75.6, 79.2, 78.3, 80.2,
76.4, 77.9, 75.8, 76.5, 77.3, 78.2
One can immediately see the value of a histogram by taking a quick glance at
the graph below. This plot quickly and efficiently illustrates the shape and size
of the dataset above. Note: XLMiner determines the number and size of the
intervals when drawing the histogram.
Line Chart
A line chart is best suited for time series datasets. In the example below, the line
chart plots the number of airline passengers from January 1949 to December
1960. (The X axis is the number of months starting with January 1949 as 1.)
Parallel Coordinates
A Parallel Coordinates plot consists of N number of vertical axes where N is the
number of variables selected to be included in the plot. A line is drawn
connecting the observations values for each different variable (each different
axis) creating a multivariate profile. These types of graphs can be useful for
prediction and possible data binning. In addition, these graphs can expose
clusters, outliers and variable overlap. Axes can be reordered by simply
dragging and axis and moving the axis to the desired location. . An example of
a Parallel Coordinates plot is shown below.
Scatterplot
One of the most common, effective and easy to create plots is the scatterplot.
These graphs are used to compare the relationships between two variables and
are useful in identifying clusters and variable overlap.
Scatterplot Matrix
A Matrix plot combines several scatterplots into one panel enabling the user to
see pairwise relationships between variables. Given a set of variables Var1,
Var2, Var3, ...., Var N the matrix plot contains all the pairwise scatter plots of
the variables on a single page in a matrix format. The names of the variables are
on the diagonals. In other words, if there are k variables, there will be k rows
and k columns in the matrix and the ith row and jth column will be the plot of
Vari versus Varj.
The axes titles and the values of the variables appear at the edge of the
respective row or column. The comparison of the variables and their interactions
Frontline Solvers V2014
69
with one another can be studied easily and with a simple glance which is why
matrix plots are becoming increasingly common in general purpose statistical
software programs. An example is shown below.
Variable Plot
XLMiners Variables graph simply plots each selected variables distribution.
See below for an example.
Click Next.
On the Y Axis Selection Dialog, select MEDV, and then click Next.
Select CHAS on the X-Axis Selection dialog, then click Finish. Click Next to
set Panel and Color options. These options can always be set in the upper right
hand corner of the plot.
Click the right pointing arrow next to Count of MEDV and select MEDV from
the menu. When the second menu appears (below the first selection of MEDV)
select Mean.
This bar chart includes a categorical variable, CAT. MEDV, on the y-axis. This
variable is a 0 if MEDV is less than 30 (MEDV < 30), otherwise the variable
value is a 1. A user can quickly see that the majority of houses are located far
away from the Charles River.
Frontline Solvers V2014
73
Uncheck the 1 under the CHAS filter to view only homes located far away from
the Charles river.
To change the variable on the X-axis, simply click the down arrow and select
the desired variable from the menu.
To add a 2nd Bar Chart simply click the Bar Chart icon at the top of the Chart
Wizard.
A second chart is added to the Chart Wizard dialog. Click the X in the upper
right corner of each plot to remove from the window. Color by and Panel by
options are always available in the upper right hand corner of each plot.
The top graph shows the count of all records in each category. Since each
category includes the same amount of observations, all bars are set to the same
height.
Please see the Common Chart Options section (below) for a complete
description of each icon on the chart title bar.
To exit the graph, click the red X in the upper right hand corner of the Chart
Wizard window.
To save the chart for later viewing, click Save. To delete the chart, click
Discard, to cancel the save and return to the chart, click Cancel. For this
example, enter BarChart for the chart name, then click Save. The chart will
close. To reopen the chart, click Explore Existing Charts BarChart.
On the Y Axis Selection dialog, select Y1, and then click Next.
Select X-Var on the X-Axis Selection dialog, then click Finish. Click Next to
set Panel and Color options. These options can always be set in the upper right
hand corner of the plot.
Uncheck class 4 under the X-Var filter to remove this class from the plot.
The dotted line denotes the Mean of 22.49, the solid line denotes the Median of
23.22. The box reaches from the 25th Percentile of 9.07 to the 75th Percentile of
37.87. The lower whisker (or lower bound) reaches to -47.343 and the upper
whisker (or upper bound) reaches to 61.454.
To select a different variable on the y-axis, click the right pointing arrow and
select the desired variable from the menu.
To change the variable on the X-axis, select the down arrow next to X-Var and
select the desired variable.
To add a 2nd boxplot, click the BoxPlot icon on the top of the Chart Wizard
dialog.
A second chart is added to the Chart Wizard dialog. Click the X in the upper
right corner of each plot to remove from the window. Color by and Panel by
options are always available in the upper right hand corner of each plot.
To exit the graph, click the red X in the upper right hand corner of the Chart
Wizard window.
Please see the Common Chart Options section (below) for a complete
description of each icon on the chart title bar.
To save the chart for later viewing, click Save. To delete the chart, click
Discard, to cancel the save and return to the chart, click Cancel. For this
example, enter BoxPlot for the chart name, then click Save. The chart will
close. To reopen the chart, click Explore Existing Charts BoxChart. To
delete the chart, click Discard.
Histogram Example
The example below illustrates the use of XLMiners chart wizard in drawing a
histogram of the Boston_Housing.xlsx dataset. Click Help Examples on the
XLMiner ribbon to open the example dataset, Boston_Housing.xlsx. Select a
cell within the dataset, say A2, and then click Explore Chart Wizard on the
XLMiner ribbon. The following dialog appears.
The data has been divided into 14 different bins or intervals. Unselect the
variables CRIM and ZN under Filters. Notice the graph did not change. This
is because removing these variables is, in effect, removing a column from the
dataset. Since we are currently not interested in these columns, the plot is not
affected. However, now uncheck 0 under the CHAS variable.
Notice the number of bins has been reduced to 7 (down from 13). This is
because removing the 0 class from the CHAS variable is, in effect, removing
rows from the dataset, which does affect the INDUS variable in the plot.
To change the variables included in the plot, simply click the Histogram icon on
the title bar of the Chart Wizard,
Select DIS for the X-Axis, then click Next to choose color and panel options.
At this point, you could also click Finish to draw the histogram. Color and
panel options can be chosen at any time.
Frontline Solvers V2014
83
Select CAT. MEDV for Color By, then click Finish to draw the histogram.
The two histograms are drawn in the same window. Click the X in the upper
right corner of each plot to remove from the window. Color by and Panel by
options are always available in the upper right hand corner of each plot.
Please see the section Common Chart Options (below) for a complete
description of each icon on the chart title bar.
To exit the graph, click the red X in the upper right hand corner of the Chart
Wizard window.
To save the chart for later viewing, click Save. To delete the chart, click
Discard, to cancel the save and return to the chart, click Cancel. For this
example, type Histogram for the chart name, then click Save. The chart will
close. To reopen the chart, click Explore Existing Charts Boxplot.
Select Observation#, then select Finish. Click Next to choose Panel and Color
options. Both can be selected or changed in the upper right hand corner of the
plot.
The y-axis plots the number of passengers and the x-axis plots the month
number (starting with 1 for January 1949). The plot shows that as the months
progress, the number of airline passengers increases.
Please see the section Common Chart Options (below) for a complete
description of each icon on the chart title bar.
To exit the graph, click the red X in the upper right hand corner of the Chart
Wizard window.
To save the chart for later viewing, click Save. To delete the chart, click
Discard, to cancel the save and return to the chart, click Cancel. For this
example, type LineChart for the chart name, then click Save. The chart will
close. To reopen the chart, click Explore Existing Charts LineChart.
Select all variables except MEDV. (The CAT.MEDV variable is used in place
of the MEDV variable. CAT.MEDV is a categorical variable where a 1 denotes a
MEDV value larger than 30.)
Click Finish to draw the plot.
Leaving the Chart Wizard window open, click back to the Data worksheet
(within the Boston_Housing workbook), then click Explore Chart Wizard to
Frontline Solvers V2014
89
open a 2nd instance of the Chart Wizard. Select Parallel Coordinates on the
first Chart Wizard dialog and then select all variables except MEDV on the
Variable Selection dialog. When the 2nd plot is drawn, unselect the 0 class for
the CAT.MEDV variable.
The first characteristic that is evident is that there are more houses with a value
of 0 for CAT. MEDV (Median value of owner-occupied homes < 30,000) than
with a value of 1 (Median value of owner-occupied homes > 30,000). In
addition, the more expensive houses (CAT.MEDV = 1) have lower CRIM (Per
capita crime rate by town) and LSAT (% Lower status of the population) values
and higher RM (Average number of rooms per dwelling) values.
Select the 0 CAT. MEDV chart and select 1 under CAT. MEDV. Then
select CAT.MEDV for Color By. The chart now displays both classes of
CAT.MEDV (0 and 1) on the same chart. However, each class is given a
different color, blue for 0 and yellow for 1.
To remove a variable from the matrix, unselect the desired variable under
Filters. To add a variable to the matrix, select the desired variable under Filters.
Please see the section Common Chart Options (below) for a complete
description of each icon on the chart title bar.
To exit the graph, click the red X in the upper right hand corner of the Chart
Wizard window.
To save the chart for later viewing, click Save. To delete the chart, click
Discard, to cancel the save and return to the chart, click Cancel. For this
example, type Parallel for the chart name, then click Save. The chart will close.
To reopen the chart, click Explore Existing Charts Parallel.
ScatterPlot Example
The example below illustrates the use of XLMiners chart wizard in drawing a
Scatterplot using the Boston_Housing.xlsx dataset. Click Help Examples on
the XLMiner ribbon to open the example dataset, Boston_Housing.xlsx. Select
Frontline Solvers V2014
91
a cell within the dataset, say A2, and then click Explore Chart Wizard on the
XLMiner ribbon. The following dialog appears.
Select MEDV from the X-Axis Selection Dialog. Then click Finish.
Select Color by: CHAS (Charles River dummy variable = 1 if tract bounds
river; 0 otherwise) and Panel by: CAT.MEDV (Median value of owneroccupied homes in $1000's > 30).
This new graph illustrates that most houses that border the river are higher
priced homes.
To remove a variable from the matrix, unselect the desired variable under
Filters. To add a variable to the matrix, select the desired variable under Filters.
Please see the section Common Chart Options (below) for a complete
description of each icon on the chart title bar.
To exit the graph, click the red X in the upper right hand corner of the Chart
Wizard window.
To save the chart for later viewing, click Save. To delete the chart, click
Discard, to cancel the save and return to the chart, click Cancel. For this
Frontline Solvers V2014
94
example, type Scatterplot for the chart name, then click Save. The chart will
close. To reopen the chart, click Explore Existing Charts Scatterplot.
Select INDUS, AGE, DIS, and RAD variables, then click Finish.
Histograms of the selected variables appear on the diagonal. Find the plot in the
second row (from the top) and third column (from the left) of the matrix.
This plot indicates a pairwise relationship between the variables AGE and DIS.
The Y-axis for this plot can be found at the 2nd row, 1st column.
The X-axis for this plot can be found at the last row, 3 rd column.
To remove a variable from the matrix, unselect the desired variable under
Filters. To add a variable to the matrix, select the desired variable under Filters.
Please see the section Common Chart Options (below) for a complete
description of each icon on the chart title bar.
To exit the graph, click the red X in the upper right hand corner of the Chart
Wizard window.
To save the chart for later viewing, click Save. To delete the chart, click
Discard, to cancel the save and return to the chart, click Cancel. For this
example, type ScatterplotMatrix for the chart name, then click Save. The chart
will close. To reopen the chart, click Explore Existing Charts
ScatterplotMatrix.
All variables are selected by default. Click Finish to draw the chart.
The distributions of each variable are shown in bar chart form. To remove a
variable from the matrix, unselect the desired variable under Filters. To add a
variable to the matrix, select the desired variable under Filters.
Please see the section Common Chart Options (below) for a complete
description of each icon on the chart title bar.
To exit the graph, click the red X in the upper right hand corner of the Chart
Wizard window.
To save the chart for later viewing, click Save. To delete the chart, click
Discard, to cancel the save and return to the chart, click Cancel. For this
example, type Variables for the chart name, then click Save. The chart will
close. To reopen the chart, click Explore Existing Charts Variables.
The first icon (starting from the left) is the Print icon.
Click this icon to see a preview of the chart before it is printed and to change
printer and page settings.
Click the 2nd icon, the Copy icon, to copy the chart to the clipboard for pasting
into a new or existing document.
Click the 3rd option, the Chart Options icon, to change chart settings such as
Legend and Axis titles, to add lables, or to change chart colors or borders.
(Several charts do not support all tabs and options.)
Click the Legend tab to display the chart legend, legend position, and to add a
chart title.
Click the Labels tab to change or add either a header or footer to the chart. Use
this tab to select the position of the header/footer (center, left, or right), the font,
and the backplane style and color.
Click the Colors tab to change the colors used in the chart.
Click the Axes tab to change the X and Y Axis titles, placement and font. (The
Formatting menu is enabled only for Variable Plot, Histogram, and Scatterplot
Charts.)
Click OK to accept the changes or Cancel to disregard the changes and return to
the chart window.
Select a cell in the dataset, say A2, and click Transform -- Missing Data
Handling on the XLMiner ribbon to open the Missing Data Handling dialog.
As you can see, No Treatment is currently being applied to each variable.
As you can see, XLMiner has added a Row Id to every record (the highlighted
column). This is useful when the dataset does not contain a column for record
identification. This added Row Id makes it easier to find which records were
deleted or changed as per the instructions in the dialog. In this example, no
treatments were applied.
If Overwrite Existing Worksheet is selected in the Missing Data Handling
dialog, XLMiner will overwrite the existing data with the treatment option
specified. Note: you must save the workbook in order for these changes to be
saved.
Click the Ex2 worksheet tab. This dataset is similar to the dataset on the Ex1
worksheet in that this dataset contains empty cells (cells B6 and D10), cells
containing invalid formulas (B13, C8 & D4), cells containing non numeric
characters (C2), etc. In this example we will see how the missing values can be
replaced by the column Mean and Median.
To start, select cell A2 and click Transform -- Missing Data Handling on the
XLMiner ribbon to open the Missing Data Handling dialog.
Select variable_1 in the Variables field then click the down arrow next to No
Treatment in the section under How do you want to handle missing values for
the selected variable(s) and select Mean.
Click Apply this option to selected variable(s). Now select Variable_3 in the
Variables field and again click the down arrow next to Median under How do
you want to handle missing values for the selected variable(s). Then click
Apply this option to selected variable(s). Click OK.
As you can see, in the Variable_1 column, invalid or missing values have been
replaced with the mean calculated from the remaining values in the column.
(12.34, 34, 44, -433, 43, 34, 6743, 3, 4 & 3). The cells containing missing
values or invalid values in the Variable_3 column have been replaced by the
median of the remaining values in that column (12, 33, 44, 66, 33, 66, 22, 88, 55
Frontline Solvers V2014
108
& 79). The invalid data for Variable_2 remains since No Treatment was
selected for this variable.
Click the Ex3 worksheet tab. In this dataset, Variable_3 has been replaced with
date values.
As shown above, the missing values in the Variable_2 column have been
replaced by the mode of the valid values even though, in this instance, the data
is non-numeric. (Remember, the mode is the most frequently occurring value in
the Variable_2 column.)
In the Variable_3 column, the third and ninth records contained missing values.
As you can see, they have been replaced by the mode for that column, 2 Feb
01.
Click the Ex4 worksheet tab. Again, this dataset contains missing and invalid
data for all three variables.
.
Select cell A2 and click Transform -- Missing Data Handling on the XLMiner
ribbon to open the Missing Data Handling dialog. In this example, we will
demonstrate XLMiners ability to replace missing values with User Specified
Values.
Select Variable_1, then click the down arrow next to No Treatment under How
do you want to handle missing values for the selected variable(s), then select
User specified value. In the field that appears directly to the right of User
specified value, enter 100, then click Apply this option to selected variable(s).
Repeat these steps for Variable_2. Then click OK.
As you can see, the missing values for Variable_1 and Variable_2 have replaced
by 100 while the values for Variable_3 remain untouched.
Click the Ex5 worksheet tab. In this dataset, the value -999 appears in all three
columns. This example will illustrate XLMiners ability to detect a given value
and replace that value with a user specified value.
Select cell A2 and click Transform -- Missing Data Handling on the XLMiner
ribbon to open the Missing Data Handling dialog.
Select Missing values are represented by this value and enter -999 in the field
that appears directly to the right of the option. Select Variable_1 in the
Variables field and instruct XLMiner to replace the contents of the cells
containing the value -999 with the mean of the remaining values in the column
Next, select Variable_2 in the Variables field and instruct XLMiner to replace
the contents of the cells containing -999 in this column with zzz. Finally,
select Variable_3 in the Variables field and instruct XLMiner to replace the
contents of the cells containing -999 in this column for the mode of the
remaining values in the column.
Note that in the Variable_1 column, the specified missing code (-999) is
replaced by the mean of the column. In the Variable_2 column, the missing
Frontline Solvers V2014
114
values have been replaced by the user specified value of zzz and for
variable_3 by the mode of the column.
Lets take a look at one more dataset, Ex6, of Examples.xls.
Select cell A2 and click Transform -- Missing Data Handling to open the
Missing Data Handling dialog then apply the following procedures to the
indicated columns.
A. Select Missing values are represented by this value and enter 33 in
the field that appears directly to the right of the option.
B. Select Delete record for Variable_1s treatment.
C. Select mode for Variable_2s treatment
D. Specify the value 9999 for missing/invalid values for Variable_3.
E. Click OK.
As shown above, records 7 and 12 have been deleted since Delete Record was
chosen for the treatment of missing values for Variable_1. In the Variable_2
column, the missing values have been replaced by the mode as indicated in the
Missing Data Handling dialog (shown above) except for record 7 which was
deleted. It is important to note that "Delete record" holds priority over any
other instruction in the Missing Data Handling feature.
In the Variable_3 column, we instructed XLMiner to treat 33 as a missing value.
As a result 33, and the additional missing values in this column (D4 and D10),
were replaced by the user specified value of 9999. Note: The value for
Variable_3 for record 12 was 33 which should have been replaced by 9999.
However, since Variable_1 contained a missing value for this record, the
instruction "Delete record" was executed first.
Variables
Each variable and its selected treatment option are listed here.
Reset
Resets treatment to No Treatment for all variables listed in the Variables field.
Also, deselects the Overwrite Existing Worksheet option if selected.
OK
Click to run the Missing Data Handling feature of XLMiner.
Select a cell in the dataset, say A2, and click Transform -- Bin Continuous
Data on the XLMiner ribbon to open the Bin Continuous Data dialog shown
below.
Select x3 in the Variables field. The options are immediately activated. Under
Value in the binned variable is, enter 10 for Start and 3 for Interval, then click
Apply this option to the selected variable. The variable, x3, will appear in the
field labeled, Name of binned variable.
The next example pins the value of the variable to the mean of the bin rather
than the rank of the bin.
Click back to Sheet1 and select cell A2, then click Transform Bin
Continuous Data. Select Mean of the bin, rather than Rank of the bin for
Frontline Solvers V2014
123
Value in the binned variable. Leaving all remaining options at their defaults,
click Apply this option to the selected variable then click OK.
In the output, the Binned_x3 variable is equal to the mean of all the x3 variables
assigned to that bin. Lets take the first record for an example. Recall, from the
previous example, the values from Bin 13: 148, 150, 151, 164. The mean of
these values is 153.25 ((148 + 150 + 151 + 164) / 4) which is the value for the
Binned_x3 variable for the first record.
Similarly, if we were to select the Median of the bin option, the Binned_x3
variable would equal the median of all x3 variables assigned to each bin.
The next example explores the Equal interval option.
Click back to Sheet1 and select any cell in the dataset, say, A2, then click
Transform Bin Continuous Data on the XLMiner ribbon. Select x3 in the
Variables field, enter 4 for #bins for the variable, select Equal interval under
Bins to be made with, enter 12 for Start and 3 for Interval under Value in the
binned variable is, then click Apply this option to the selected variable.
XLMiner calculates the interval as the Maximum value for the x3 variable Minimum value for the x3 variable) / #bins specified by the user or in this
Frontline Solvers V2014
126
instance (252 96) / 4 which equals 39. This means that the bins will be
assigned x3 variables in accordance to the following rules.
Bin 12: Values 96 < 136
Bin 15: Values 136 < 174
Bin 18: Values 174 213
Bin 21: Values 214 252
In the first record, x4 has a value of 151. As a result, this record has been
assigned to Bin 15 since 151 lies in the interval of Bin 15.
Click back to Sheet1 and select any cell in the dataset, say, A2, then click
Transform Bin Continuous Data on the XLMiner ribbon. Select x3 in the
Variables field, enter 4 for #bins for the variable, select Equal interval under
Bins to be made with, select Mid Value for Value in the binned variable is, then
click Apply this option to the selected variable.
As shown in the output above, XLMiner created 4 bins with intervals from 90 to
130 (Bin 1), 130 170 (Bin 2), 170 210 (Bin 3), and 210 253 (Bin 4). The
value of the binned variable is the midpoint of each interval: 110 for Bin 1, 150
for Bin 2, 190 for Bin 3 and 210 for Bin 4. In the first record, x3s value is 151.
Since this value lies in the interval for Bin 2 (130 170) the mid value of this
interval is reported for the Binned_x3 variable, 150. In the last record, x3s
value is 174. Since this value lies in the interval for Bin 3 (170 210), the mid
value of this interval is reported for the Binned_x3 variable, 190.
Equal count
When this option is selected, the binning procedure will assign an equal number
of records to each bin. Note: There is a possibility that the number of records in
a bin may not be equal due to factors such as border values, the number of
records being divisible by the number of bins, etc. The options for Value of the
binned variable for this process are Rank, Mean, and Median. See below for
explanations of each.
Equal interval
When this option is selected, the binning procedure will assign records to bins if
the records value falls in the interval of the bin. Bin intervals are calculated by
subtracting the Minimum variable value from the Maximum variable value and
dividing by the number of bins ((Max Value Min Value) / # bins). The
options for Value of the binned variable for this process are Rank and Mid value.
See below for explanations of each.
Mid Value
When the Equal Interval option is selected, this option is enabled. The mid
value of the interval will be displayed on the output report for the assigned bin.
Select Species_name in the Variables field and then > to move the variable to
the Variables to be factored field.
Click OK and view the output on the CategoryVar1 worksheet (inserted directly
after Sheet1.)
Select Species_name in the Variables field and click > to move the variable to
the Variables to be factored field. Keep the default option of Assign numbers
1,2,3....
Click OK and view the results on the CategoryVar2 worksheet which is inserted
directly to the right of Sheet 1 and the CategoryVar1 worksheets.
XLMiner has sorted the values of the Species_name variable alphabetically and
then assigned values of 1, 2 or 3 to each record depending on the species type.
(Starting from 1 because we selected Assign numbers 1,2,3.... To have XLMiner
start from 0, select the option Assign numbers 0, 1, 2, on the Create Category
Scores dialog.) A variable, Species_name_ord is created to store these assigned
numbers. Again, XLMiner has converted this dataset to an entirely numeric
dataset.
Open the Iris.xls example dataset by clicking Help Examples on the XLMiner
ribbon. Select a cell within the dataset, say cell A2, then click Transform -Transform Categorical Data --> Reduce Categories to open the XLMiner
Reduce Categories dialog.
Select Petal_length as the variable, then select the Manually radio button under
Limit to 30 categories heading. In the Categories in selected variable box on
the right, all unique values of this variable are listed. Select all categories with
values less than 2; choose 1 for Category Number (Under Assign Category ID),
then click Apply. Repeat these steps for categories with values from 3 to 3.9
and apply a Category Number of 2. Continue repeating these steps until
category numbers from 4 thru 4.9 are assigned a Category Number of 3,
categories from 5 thru 5.9 are assigned a Category Number of 4, and categories
6 thru 6.9 are assigned a Category Number of 5.
Note: XLMiner is limited to 30 categories. If you pick By Frequency,
XLMiner assigns category numbers 1 through 29 to the most frequent 29 unique
values; and category number 30 to all other unique values. If you pick Manually,
XLMiner lets you map unique values to categories. You can pick multiple
unique values and map them to a single new category.
In the output XLMiner has assigned new categories as shown in the column,
Petal_length CatNo, based on the choices made in the Reduce Categories
dialog.
Frontline Solvers V2014
136
As you can see on the ReduceCat2 worksheet, XLMiner has classified the
Petal_length variable using 30 different categories. The values of 1.4 and 1.5
appear the most frequently (13 occurrences each) and have thus been labeled as
categories 1 and 2, respectively. The values of 4.5 and 5.1 each occur 8 times
and have thus been assigned to categories 3 and 4, respectively. The values of
1.3 and 1.6 occur seven times and, as a result, have been assigned to categories 5
and 6, respectively. Incremental category numbers are assigned to each
decreasing block of values until the 29th category is assigned. All remaining
values are then lumped into the last category, 30.
Data Range
Either type the cell address directly into this field, or using the reference button;
select the required data range from the worksheet. If the cell pointer (active cell)
is already somewhere in the data range, XLMiner automatically picks up the
contiguous data range surrounding the active cell. When the data range is
selected XLMiner displays the number of records in the selected range.
Variables
This list box contains the names of the variables in the selected data range. To
select a variable, simply click to highlight, then click the > button. Use the
CTRL key to select multiple variables.
Options
The user can specify the number with which to start categorization 0 or 1. Select
the appropriate option.
Category Number
After manually selecting values from the list box, pick the category number to
assign. Click Apply to apply this mapping, or Reset to start over.
X12
X13
X21
X22
X23
X31
X32
X33
X41
X42
X43
1.
The first step in reducing the number of columns (variables) in the X matrix
using the Principal Components Analysis algorithm is to find the mean of
each column.
(X11 + X21 + X31 + X41)/4 = Mu1
(X12 + X22 + X32 + X42)/4 = Mu2
(X13 + X23 + X33 + X43)/4 = Mu3
2.
Next, the algorithm subtracts each element in the database by the mean
(Mu) thereby obtaining a new matrix, , which also contains 4 rows and 3
columns.
X11 Mu1 = 11
X12 Mu2 = 12
X13 Mu3 = 13
X21 Mu1 = 21
X22 Mu2 = 22
X23 Mu3 = 23
X31 Mu1 = 31
X32 Mu2 = 32
X33 Mu3 = 33
X41 Mu1 = 41
X42 Mu2 = 42
X43 Mu3 = 43
User Guide Page
3.
4.
5.
6.
7.
The original matrix X which has 4 rows and 3 columns will be multiplied by
the V matrix, containing 4 rows and 2 columns. This matrix multiplication
results in the new reduced Y matrix, containing 4 rows and 2 columns.
where the coefficient vectors l1, l2 ,..etc. are chosen such that they satisfy the
following conditions:
First Principal Component = Linear combination l1'X that maximizes Var(l1'X)
and || l1 || =1
Second Principal Component = Linear combination l2'X that maximizes
Var(l2'X) and || l2 || =1
and Cov(l1'X , l2'X) =0
jth Principal Component = Linear combination lj'X that maximizes Var(lj'X) and
|| lj || =1
and Cov(lk'X , lj'X) =0 for all k < j
These functions indicate that the principal components are those linear
combinations of the original variables which maximize the variance of the linear
combination and which have zero covariance (and hence zero correlation) with
the previous principal components.
It can be proved that there are exactly p such linear combinations. However,
typically, the first few principal components explain most of the variance in the
original data. As a result, instead of working with all the original variables X1,
X2, ..., Xp, you would typically first perform PCA and then use only the first two
or three principal components, say Y1 and Y2, in a subsequent analysis.
Select a cell within the dataset, say A2, then click Transform Principal
Components on the XLMiner ribbon to open the Principal Components dialog
shown below.
Select variables x1 to x8, then click the > command button to move them to the
Input variables field. (Perform error based clustering is not supported in the
PCA algorithm and is disabled). The figure below shows the first dialog box of
the Principal Components Analysis method. Click Next.
The top section of the PCA_Output1 spreadsheet simply displays the number of
principal components created (as selected in the Step 2 of 3 dialog above), the
number of records in the dataset and the method chosen, Correlation matrix
(also selected in the Step 2 of 3 dialog).
Click back to the Data worksheet, select any cell in the dataset, and then click
Transform Principal Components. Cells x1 through x8 are already selected
so simply click Next on this dialog to advance to the Step 2 of 3 dialog.
As you can see from the output worksheet PCA_Output2, only the first two
components are included in the output file since these two components account
for over 50% of the variation.
After applying the Principal Components Analysis algorithm, users may proceed
to analyze their dataset by applying additional data mining algorithms featured
in XLMiner.
Principal Components
Select the number of principal components displayed in the output.
Frontline Solvers V2014
148
Fixed # of components
Specify a fixed number of components by selecting this option and entering an
integer value from 1 to n where n is the number of Input variables selected in
the Step 1 of 3 dialog.
Smallest #components explaining
Select this option to specify a percentage. XLMiner will calculate the minimum
number of principal components required to account for that percentage of
variance.
Method
To compute Principal Components the data is matrix multiplied by a
transformation matrix. This option lets you specify the choice of calculating this
transformation matrix.
Use Covariance matrix
The covariance matrix is a square, symmetric matrix of size n x n (number of
variables by number of variables). The diagonal elements are variances and the
off-diagonals are covariances. The eigenvalues and eigenvectors of the
covariance matrix are computed and the transformation matrix is defined as the
transpose of this eigenvector matrix. If the covariance method is selected, the
dataset should first be normalized. One way to organize the data is to divide
each variable by its standard deviation. Normalizing gives all variables equal
importance in terms of variability. 1
Use Correlation matrix
An alternative method is to derive the transformation matrix on the eigenvectors
of the correlation matrix instead of the covariance matrix. The correlation
matrix is equivalent to a covariance matrix for the data where each variable has
been standardized to zero mean and unit variance. This method tends to
equalize the influence of each variable, inflating the influence of variables with
relatively small variance and reducing the influence of variables with high
variance.
Shmueli, Galit, Nitin R. Patel, and Peter C. Bruce. Data Mining for Business Intelligence. 2 nd ed. New Jersey: Wiley, 2010
k-Means Clustering
Introduction
Cluster Analysis, also called data segmentation, has a variety of goals which all
relate to grouping or segmenting a collection of objects (also called
observations, individuals, cases, or data rows) into subsets or "clusters". These
clusters are grouped in such a way that the observations included in each
cluster are more closely related to one another than objects assigned to different
clusters. The most important goal of cluster analysis is the notion of the degree
of similarity (or dissimilarity) between the individual objects being clustered.
There are two major methods of clustering -- hierarchical clustering and kmeans clustering. (See the k-means clustering chapter for information on this
type of clustering analysis.)
This chapter explains the k-Means Clustering algorithm. The goal of this
process is to divide the data into a set number of clusters (k) and to assign each
record to a cluster while minimizing the distribution within each cluster. A nonhierarchical approach to forming good clusters is to specify a desired number of
clusters, say, k, then assign each case (object) to one of k clusters so as to
minimize a measure of dispersion within the clusters. A very common measure
is the sum of distances or sum of squared Euclidean distances from the mean of
each cluster. The problem can be set up as an integer programming problem but
because solving integer programs with a large number of variables is time
consuming, clusters are often computed using a fast, heuristic method that
generally produces good (but not necessarily optimal) solutions. The k-means
algorithm is one such method.
Select a cell within the dataset, say A2, and then click XLMiner Cluster kMeans Clustering. The following dialog box will appear.
Select all variables under Variables in data source except Type, then click the >
button to shift the selected variables to the Input variables field.
Click Next.
On the Step 3 of 3 dialog, select Show data summary (default) and Show
distances from each cluster center (default). Then click Finish.
The K-Means Clustering method will start with k initial clusters as specified by
the user. At each iteration, the records are assigned to the cluster with the closest
centroid, or center. After each iteration, the distance from each record to the
center of the cluster is calculated. These two steps are repeated (the record
assignment and distance calculation) until the redistribution of a record results in
an increased distance value.
When the user specifies a random start, the algorithm generates the k cluster
centers randomly and fits the data points in those clusters. This process is
repeated for as many random starts as the user specifies. The output will be
based on the clusters that exhibit the best fit.
The worksheet, KM_Output1 is inserted immediately after the Description
worksheet. In the top section of the output worksheet, the options that were
selected are listed.
In the middle section of the output worksheet, XLMiner has calculated the sum
of the squared distances and has determined the start with the lowest Sum of
Square Distance as the Best Start (#3). After the Best Start is determined,
XLMiner generates the remaining output using the Best Start as the starting
point.
In the bottom portion of the output worksheet, XLMiner has listed the "cluster
centers" (shown below). The upper box shows the variable values at the cluster
centers. As you can see, the first cluster has the highest average
Nonflavanoid_Phenol and Ash_Alcalinity, and a very high Magnesium average.
Compare this cluster to Cluster 5 which has the highest average Proline,
Flavanoids, and Color_Intensity, and a very high Malic_Acid average.
The lower box shows the distance between the cluster centers. From the values
in this table, we can discern that cluster 5 is very different from cluster 6 due to
the high distance value of 1,484.51 and cluster 7 is very close to cluster 3 (low
distance value of 31.27). It is possible that these two clusters could be merged
into one cluster.
Clustering Method
At this time, XLMiner supports standard clustering only.
# Clusters
Enter the number of final clusters (k) to be formed here. The number of clusters
should be at least 2 and at most the number of observations in the data range.
This value should be based on your knowledge of the data and the number of
projected clusters. It is a good idea to repeat the procedure with several
different k values.
# Iterations
Enter the number of times the program will perform the clustering algorithm.
The configuration of clusters (and how good a job they do of separating the
data) may differ from one starting partition to another. The algorithm will
complete the specified number of iterations and select the cluster configuration
that minimizes the distance measure.
Options
If Fixed start is selected, XLMiner will start building the model with a single
fixed starting point. If Random starts is selected, the algorithm will start at any
random point.
If Random starts is selected, two additional options are enabled: No. of Starts
and Seed. Enter the number of desired clusters for No. of Starts. XLMiner will
complete the desired number of clusters and generate the output for the best
cluster set. To enter a fixed seed, select Fixed and then enter an integer value.
Hierarchical Clustering
Introduction
Cluster Analysis, also called data segmentation, has a variety of goals. All
relate to grouping or segmenting a collection of objects (also called
observations, individuals, cases, or data rows) into subsets or "clusters", such
that those within each cluster are more closely related to one another than
objects assigned to different clusters. The most important goal of cluster
analysis is the notion of degree of similarity (or dissimilarity) between the
individual objects being clustered. There are two major methods of clustering -hierarchical clustering and k-means clustering. (See the k-means clustering
chapter for information on this type of clustering analysis.)
In hierarchical clustering the data are not partitioned into a particular cluster in
a single step. Instead, a series of partitions takes place, which may run from a
single cluster containing all objects to n clusters each containing a single object.
Hierarchical Clustering is subdivided into agglomerative methods, which
proceed by a series of fusions of the n objects into groups, and divisive
methods, which separate n objects successively into finer groupings. The
hierarchical clustering technique employed by XLMiner is an Agglomerative
technique. Hierarchical clustering may be represented by a two dimensional
diagram known as a dendrogram which illustrates the fusions or divisions made
at each successive stage of analysis. An example of such a dendrogram is given
below:
Agglomerative methods
An agglomerative hierarchical clustering procedure produces a series of
partitions of the data, Pn, Pn-1, ....... , P1. The first Pn consists of n single object
'clusters', the last P1, consists of a single group containing all n cases.
At each particular stage the method joins the two clusters which are closest
together (most similar). (At the first stage, this amounts to joining together the
two objects that are closest together, since at the initial stage each cluster has
one object.)
groups is defined as the distance between the most distant pair of objects, one
from each group.
In the complete linkage method, D(r,s) is computed as
D(r,s) = Max { d(i,j) : Where object i is in cluster r and object j is cluster s }
Here the distance between every possible object pair (i,j) is computed, where
object i is in cluster r and object j is in cluster s and the maximum value of these
distances is said to be the distance between clusters r and s. In other words, the
distance between two clusters is given by the value of the longest link between
the clusters.
At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is
minimum, are merged.
The measure is illustrated in the figure below:
would result from treating the ten scores as one group with a mean of 2.5 is
represented by ESS given by,
ESS One group = (2 -2.5)2 + (6 -2.5)2 + ....... + (0 -2.5)2 = 50.5
On the other hand, if the 10 objects are classified according to their scores into
four sets,
{0,0,0}, {2,2,2,2}, {5}, {6,6}
The ESS can be evaluated as the sum of squares of four separate error sums of
squares
ESS One group = ESS group1 + ESSgroup2 + ESSgroup3 + ESSgroup4 = 0.0
Clustering the 10 scores into 4 clusters results in no loss of information.
Select any cell in the dataset, for example A2, then click Cluster -Hierarchical Clustering to bring up the Hierarchical Clustering dialog.
Select variables x1 through x8 in the Variables field, then click > to move the
selected variables to the Selected variables field.
At the top of the dialog, select Normalize input data. When this option is
selected, XLMiner will normalize the data by subtracting the variables mean
from each observation and dividing by the standard deviation. Normalizing the
data is important to ensure that the distance measure accords equal weight to
each variable -- without normalization, the variable with the largest scale will
dominate the measure.
This output details the history of the cluster formation. Initially, each individual
case is considered its own cluster (single member in each cluster). Since there
are 21 records, XLMiner begins the method with # clusters = # cases. At stage
Frontline Solvers V2014
167
1, above, clusters (i.e. cases) 10 and 13 were found to be closer together than
any other two clusters (i.e. cases), so they are joined together in a cluster called
Cluster 10. At this point there is one cluster with two cases (cases 10 and 13),
and 19 additional clusters that still have just one case in each. At stage 2,
clusters 7 and 12 are found to be closer together than any other two clusters, so
they are joined together into cluster 7.
This process continues until there is just one cluster. At various stages of the
clustering process, there are different numbers of clusters. A graph called a
dendrogram illustrates these steps.
In the above dendrogram, the Sub Cluster IDs are listed along the x-axis (in an
order convenient for showing the cluster structure). The y-axis measures intercluster distance. Consider cases 10 and 13 -- they have an inter-cluster distance
of 1.51. No other cases have a smaller inter-cluster distance, so 10 and 13 are
joined into one cluster, indicated by the horizontal line linking them. Next, we
see that cases 7 and 12 have the next smallest inter-cluster distance, so they are
joined into one cluster. The next smallest inter-cluster distance is between
clusters 14 and 19 and so on.
If we draw a horizontal line through the diagram at any level on the y-axis (the
distance measure), the vertical cluster lines that intersect the horizontal line
indicate clusters whose members are at least that close to each other. If we draw
a horizontal line at distance = 2.3, for example, we see that there are 11 clusters.
In addition, we can see that a case can belong to multiple clusters, depending on
where we draw the line (i.e. how close we require the cluster members to be to
each other).
For purposes of assigning cases to clusters, we must specify the number of
clusters in advance. In this example, we specified a limit of 4.
If the number of training rows exceeds 30 then the dendrogram also displays
Cluster Legends.
The HC_Clusters1 output worksheet includes the following table.
This table displays the assignment of each record to the four clusters.
This next example illustrates k-Means Clustering when the data represents the
distance between the ith and jth records. (When applied to raw data, Hierarchical
clustering converts the data into the distance matrix format before proceeding
with the clustering algorithm. Providing the distance measures in the data
requires one less step for the Hierarchical clustering algorithm.)
Select a cell in the database, say A2, click XLMiner Help Examples and
open the file, DistMatrix.xls. Then, click XLMiner Cluster Hierarchical
Clustering to open the following dialog.
Click the down arrow next to Data Type and select Distance Matrix.
Again, select Average group linkage as the Clustering method. Then click
Next.
Select Draw dendrogram (default) and Show cluster membership (default)
and enter 4 for # Clusters.
2.
The algorithm makes only one pass through the dataset. As a result, records
that are assigned erroneously will not be reassigned later in the process.
3.
4.
Data Type
The Hierarchical clustering method can be used on raw data as well as the data
in Distance Matrix format. Choose the appropriate option to fit your dataset.
Similarity Measures
The Hierarchical clustering uses the Euclidean Distance as the similarity
measure for working on raw numeric data. When the data is binary, the
remaining two options, Jaccard's coefficients and Matching coefficient are
enabled.
Suppose we have binary values for all the xij s. See the table below for
individual is and js.
Clustering Method
See the introduction to this chapter for descriptions of each method.
Draw Dendrogram
Select this option to have XLMiner create a dendrogram to illustrate the
clustering process.
# Clusters
Recall that the agglomerative method of hierarchical clustering continues to
form clusters until only one cluster is left. This option lets you stop the process
at a given number of clusters.
Autocorrelation (ACF)
Autocorrelation (ACF) is the correlation between neighboring observations in a
time series. When determining if an autocorrelation exists, the original time
series is compared to the lagged series. This lagged series is simply the
original series moved one time period forward (xn vs xn+1). Suppose there are 5
time based observations: 10, 20, 30, 40, and 50. When lag = 1, the original
series is moved forward one time period. When lag = 2, the original series is
moved forward two time periods.
Day
Observed Value
Lag-1
Lag-2
10
20
10
30
20
10
40
30
20
50
40
30
=+1( )( )
( )2
=1
where k = 0, 1, 2, ., n
Where Yt is the Observed Value at time t, is the mean of the Observed Values
and Yt k is the value for Lag-k.
For example, using the values above, the autocorrelation for Lag-1 and Lag - 2
can be calculated as follows.
= (10 + 20 + 30 + 40 + 50) / 5 = 30
r1 = ((20 30) * (10 - 30) + (30 - 30) * (20 - 30) + (40 - 30) * (30 - 30) + (50
30) * (40 30)) / ((10 30)2 + (20 - 30)2 + (30 30)2 + (40 30)2 + (50 30)2)
= 0.4
r2 =( (30 30) * (10 30) + (40 30) * (20 30) + (50 30) * (30 30)) / (((10
30)2 + (20 - 30)2 + (30 30)2 + (40 30)2 + (50 30)2) = -0.1
The two red horizontal lines on the graph below delineate the Upper confidence
level (UCL) and the Lower confidence level (LCL). If the data is random, then
the plot should be within the UCL and LCL. If the plot exceeds either of these
two levels, as seen in the plot above, then it can be presumed that some
correlation exists in the data.
)))))))))Pp .
ARIMA
An ARIMA (autoregressive integrated moving-average models) model is a
regression-type model that includes autocorrelation. The basic assumption in
estimating the ARIMA coefficients is that the data are stationary, that is, the
Frontline Solvers V2014
177
trend or seasonality cannot affect the variance. This is generally not true. To
achieve the stationary data, XLMiner will first apply differencing: ordinary,
seasonal or both.
After XLMiner fits the model, various results will be available. The quality of
the model can be evaluated by comparing the time plot of the actual values with
the forecasted values. If both curves are close, then it can be assumed that the
model is a good fit. The model should expose any trends and seasonality, if any
exist. If the residuals are random then the model can be assumed a good fit.
However, if the residuals exhibit a trend, then the model should be refined.
Fitting an ARIMA model with parameters (0,1,1) will give the same results as
exponential smoothing. Fitting an ARIMA model with parameters (0,2,2) will
give the same results as double exponential smoothing.
Partitioning
To avoid over fitting of the data and to be able to evaluate the predictive
performance of the model on new data, we must first partition the data into
training and validation sets, and possibly a test set, using XLMiners time series
partitioning utility. After the data is partitioned, ACF, PACF, and ARIMA can
be applied to the dataset.
The data is first partitioned into two sets with 60% of the data assigned to
the training set and 40% of the data assigned to validation.
2.
Exploratory techniques are applied to both the training and validation sets.
If the results are in synch then the model can be fit. If the ACF and PACF
plots are the same, then the same model can be used for both sets.
3.
4.
When we fit a model using the ARIMA method, XLMiner displays the ACF
and PACF plots for residuals. If these plots are in the band of UCL and
LCL then it indicates that the residuals are random and the model is
adequate.
5.
If the residuals are not within the bands, then some correlation exists, and
the model should be improved.
First we must perform a partition on the data. Select a cell within the dataset,
such as A2, then click Partition within the Time Series group on the XLMiner
ribbon to open the following dialog.
Select Year under Variables and click > to define the variable as the Time
Variable. Select the remaining variables under Variables and click > to include
them in the partitioned data.
Select Specify #Records under Specify Partitioning Options to specify the
number of records assigned to the training and validation sets. Then select
Specify #Records under Specify #Records for Partitioning. Enter 50 for the
number of Training Set records and 21 for the number of Validation Set records.
If Specify Percentages is selected under Specify Partitioning Options, XLMiner
will assign a percentage of records to each set according to the values entered by
the user or automatically entered by XLMiner under Specify Percentages for
Partitioning.
Note in the output above, the partitioning method is sequential (rather than
random). The first 50 observations have been assigned to the training set and
the remaining 21 observations have been assigned to the validation set.
Select a cell on the Data_PartitionTS1 worksheet then click ARIMA
Autocorrelations on the XLMiner ribbon to display the ACF dialog.
Frontline Solvers V2014
180
Select CA as the Selected variable, enter 10 for both ACF Parameters for
Training Data and ACF Parameters for Validation Data. Then select Plot ACF
chart.
Click OK. The worksheet ACF_Output1 will be inserted directly after the
Data_PartitionTS1 worksheet.
Both ACF functions show a definite pattern where the autocorrelation decreases
as the number of lags increases. Since the pattern does not repeat, it can be
assumed that no seasonality is included in the data. In addition, since both
charts exhibit a similar pattern, we can fit the same model to both the validation
and training sets.
Click back to the Data_PartitionTS1 worksheet and click ARIMA -- Partial
Autocorrelations to open the PACF dialog as shown below.
Frontline Solvers V2014
181
Select CA under Variables in input data, then click > to move the variable to
Selected variable. Enter 40 for Maximum Lag for PACF Parameters for
Training Data and 15 PACF Parameters for Validation Data. Select Plot
PACF chart.
validation and training sets. As a result, we can use the same model for both
sets.
The PACF function shows a definite pattern which means there is a trend in the
data. However, since the pattern does not repeat, we can conclude that the data
does not show any seasonality.
The ARIMA model accepts three parameters: p the number of autoregressive
terms, d the number of non-seasonal differences, and q the number of lagged
errors (moving averages).
Recall that the ACF plot showed no seasonality in the data which means that
autocorrelation is almost static, decreasing with the number of lags increasing.
This suggests setting q = 0 since there appears to be no lagged errors. The
PACF plot displayed a large value for the first lag but minimal plots for
successive lags. This suggest setting p =1. With most datasets, setting d =1 is
sufficient or can at least be a starting point.
Click back to the Data_PartitionTS1 worksheet and click ARIMA ARIMA
model to bring up the ARIMA dialog shown below.
Select CA under Variables in input data then click > to move the variable to the
Selected Variable field. Under Nonseasonal Parameters set Autoregressive (p)
to 1, Difference (d) to 1 and Moving Average (q) to 0.
On this same worksheet, XLMiner has calculated the constant term and the AR1
term for our model, as seen above. These are the constant and f1 terms of our
forecasting equation. See the following output of the Chi - square test.
Since the p-Value is small, we can conclude that the model is a good fit.
Now open the worksheet ARIMA_Residuals1. This table plots the actual and
fitted values and the resulting residuals. As shown in the graph below, the
Actual and Forecasted values match up fairly well. The usefulness of the model
in forecasting will depend upon how close the actual and forecasted values are
in the Time plot of validation set which we will inspect later.
Let us take a look at the ACF and PACF plots for residuals.
Most of the correlations are within the UCL and LCL band. This indicates that
the residuals are random, they are not correlated and is the first indication that
the model parameters are adequate for this data.
Open the sheet ARIMA_Output1. See the Forecast table.
Frontline Solvers V2014
186
The table shows the actual and forecasted value. The "Lower" and "Upper"
values represent the lower and upper bounds of the confidence interval. There is
a 95% chance that the forecasted value will fall into this range.
Let us take a look at the Time Plot below. It is plotted with the values in the
table above. It indicates how the model which we fitted using the Training data
performs on the validation data.
The actual and forecasted values are fairly close. This is a confirmation that our
model is good for forecasting. To plot the values under the "Lower" and
"Upper" column in the same chart (using the EXCEL chart facility), select the
graph and then click Design Select Data to open the Select Data Source
dialog.
Click Add to open the Edit Series dialog. Enter Lower for Series name and
G52:G72 for the Series values.
Click OK and repeat these steps entering Upper for the Series name and
H52:H72 for the Series values. Then click OK on the Edit Series dialog and
OK again on the Select Data Source dialog to produce the graph below.
The plot shows that the Actual values lie well within the bands created by the
Upper and Lower values in the table. In fact, for the majority of the graph, the
Actual and Forecasted values are located in the center, or very close to the
center, of the two bands. As a result, it can be assumed that we have fitted an
adequate model.
Now lets fit a model to a dataset containing seasonality. Click XLMiner
Help Examples and open the Airpass.xlsx example dataset. This is the classic
Box & Jenkins dataset containing monthly totals of international airline
Frontline Solvers V2014
188
passengers from 1949 to 1960. Clearly, this dataset will contain some
seasonality as air traffic increases each year during the summer and holiday
season. A portion of the dataset is shown below.
First, the data must be partitioned. Click XLMiner Partition in the Time
Series group. Select Month as the time variable and Passengers as the Variable
in the Partitioned Data. Select Specify #Records under both Specify
Partitioning Options and Specify #Records for Partitioning. Then enter 120 for
the number of records in the Training set. XLMiner will automatically enter the
remaining number of records for the Validation set. Finally, click OK.
Both plots clearly show a repetition in the pattern indicating that the data does
contain seasonality. Now lets create the PACF chart.
Click back to the Data_PartitionTS1, then click XLMiner ARIMA Partial
Autocorrelations. Select Passengers as the Selected variable. Enter 40 for
PACF Parameters for Training Data Maximum Lag and 20 for PACF
Parameters for Validation Data Maximum Lag. Select Plot PACF chart, then
click OK.
Both plots are similar and both show seasonality. As a result, it can be assumed
that the same model can be applied to both the Training and Validation sets.
Lets try fitting an ARIMA model with the following parameters p = 1, d = 1
and q = 0 or (1, 1, 0)12. This means that we are applying a seasonal model with
period =12. Selection of the value of period depends on the nature of data. In
this case we can make a fair guess that the # passengers increases during the
holidays, or every 12 months.
Click back to the Data_PartitionTS1 worksheet, then click XLMiner ARIMA
ARIMA Model. Select Passengers for the Selected variable, select Fit
Seasonal Model, and enter 12 for Period. Under Nonseasonal Parameters,
enter 1 for Autoregressive (p), 1 for Difference (d), and 0 for Moving Average
(q). Under Seasonal Parameters, enter 1 for Autoregressive (P), 1 for
Difference (D), and 0 for Moving Average (Q). Then click Advanced.
The small p-values are our first indication that our model is a good fit to the
data.
Scroll down to the Forecast table. This table holds the actual and forecasted
values as predicted by the model. The "Lower" and "Upper" values represent the
95% confidence interval in which the forecasted values lie.
The time plot below graphs the values in the Forecast table above and indicates
how well the model performs. The actual and forecasted values are fairly close,
though not as close as in the earlier example. Still, this is a second indication
that this model fits well.
Click the graph and select Design Select Data. Click Add, then enter Lower
for Series name and select cells G59:G82 for Series values.
Then click OK. Repeat the steps above using Upper as the Series name and cells
H59:H82 as the Series values. Click OK on the Edit Series dialog and OK
again on the Select Data Source dialog. The updated graph is shown below.
The plot shows that the Actual values lie well inside the band created by the
upper and lower values of the 95% percentile. In fact, the Actual and Forecasted
values appear in the center of the range in the majority of the graph. This is our
third indication that our model is a good fit.
Now open the worksheet ARIMA_Residuals1. This table plots the actual and
fitted values and the resulting residuals. As you can see in the graph below, the
Actual and Forecasted values match up fairly well. This is yet another
indication that our model is performing well.
Scrolling down, we find the ACF and PACF plots. In both plots, most of the
residuals are within the UCL and LCL range which indicates that the residuals
are random and not correlated which, once more, suggests a good fit.
The options below appear on the Time Series Partition Data dialog.
Time variable
Select a time variable from the available variables and click the > button. If a
Time Variable is not selected, XLMiner will assign one to the partitioned data.
Selected Variable
Select a variable under Variables in input data.
Lags
Specify the number of desired lags here. XLMiner will display the ACF output
for all lags between 0 and the specified number.
Selected variable
The selected variable appears here.
Time Variable
Select the desired Time Variable by clicking the > button.
Period
Seasonality in a dataset appears as patterns at specific periods in the time series.
Enter 12 if the seasonality only appears once in a year. Enter 6 if the seasonality
appears twice in one year.
Nonseasonal Parameters
Enter the nonseasonal parameters here for Autoregressive (p), Difference (d),
and Moving Average (q).
Seasonal Parameters
Enter the Seasonal parameters here for Autoregressive (p), Difference (d), and
Moving Average (q).
Variance-covariance matrix
XLMiner will include the variance-covariance matrix in the output if this option
is selected.
Produce forecasts
If this option is selected, XLMiner will display the desired number of forecasts.
If the data has been partitioned, XLMiner will display the forecasts on the
validation data.
Smoothing Techniques
Introduction
Data collected over time is likely to show some form of random variation.
"Smoothing techniques" can be used to reduce or cancel the effect of these
variations. These techniques, when properly applied, will smooth out the
random variation in the time series data to reveal any underlying trends that may
exist.
XLMiner features four different smoothing techniques: Exponential, Moving
Average, Double Exponential, and Holt Winters. The first two techniques,
Exponential and Moving Average, are relatively simple smoothing techniques
and should not be performed on datasets involving seasonality. The last two
techniques are more advanced techniques which can be used on datasets
involving seasonality.
Exponential smoothing
Exponential smoothing is one of the more popular smoothing techniques due to
its flexibility, ease in calculation and good performance. As in Moving Average
Smoothing, a simple average calculation is used. Exponential Smoothing,
however, assigns exponentially decreasing weights starting with the most recent
observations. In other words, new observations are given relatively more weight
in the average calculation than older observations.
In this smoothing technique, a calculated Si stands for a smoothed observation
for the original observation, xi. The subscripts refer to the time periods, 1, 2, ... ,
n The smoothing constant will be denoted by a where 0 <= a <= 1. The value
of this constant determines the weights assigned to the observations.
For the first period: S1 = x1.
The smoothed series starts with the smoothed version of the second observation,
S2 = ax2 + (1-a)S1.
For the third period, S3 = ax3 + (1-a)S2, and so on.
For any time period i, the smoothed value Si is found by computing
Si = a xi + (1-a) Si-1. This formula can be rewritten as Ft = Ft-1+aEt-1
where F is the forecast and E is the distance from the forecast to the actual
observed value. (E is otherwise known as the residual.)
Since the previous forecast and the previous forecasts residual is included in the
current periods forecast, if the previous periods forecast was too high, the
current periods forecast will be adjusted downward. Vice versa, if the previous
periods forecast was too low, the current periods forecast will be adjusted
upward. The smoothing parameter, a, determines the magnitude of the
adjustment.
As with Moving Average Smoothing, Exponential Smoothing should only be
used when the dataset contains no seasonality. The forecast will be a constant
value which is the smoothed value of the last observation.
Errors measures:
Mean Absolute Percent Error:
n
MAPE
x t / xt
t 1
n
Mean Absolute Deviation:
n
MAD
100
x t
t 1
n
Mean Square Error:
n
MSE
x t
t 1
Select Month as the Time Variable and Passengers as the Variables in the
partitioned data. Then click OK to partition the data into training and
validation sets.
Click the Data_PartitionTS1 worksheet, then click Smoothing Exponential to
open the Exponential Smoothing dialog, as shown below.
Month has already been selected as the Time Variable. Select Passengers as
the Selected variable and also Give Forecast on validation.
Now lets take a look at an example that does not include seasonality. Open the
Income.xlsx example dataset. This dataset contains the average income of tax
payers by state. First partition the dataset into training and validation sets using
Year as the Time Variable and CA as the Variables in the partitioned data.
Then click OK to accept the partitioning defaults and create the two sets
(Training and Validation). The worksheet, Data_PartitionTS, will be inserted
immediately following the Description worksheet.
Click the Data_PartitionTS1 worksheet, then click Smoothing Exponential
from the XLMiner ribbon to open the Exponential Smoothing dialog.
If we instead select Optimize, then click OK, XLMiner will choose an Alpha of
1 which results in a MSE of 22,548.69 for the Training Set and a MSE of
193,113,481 for the Validation Set. Using the Optimize algorithm results in a
better model in this instance.
Select Month as the Time Variable and Passengers as the Variables in the
partitioned data. Then click OK to partition the data into training and
validation sets.
Click the Data_PartitionTS1 worksheet, then click Smoothing Moving
Average to open the Moving Average Smoothing dialog, as shown below.
Click the down arrow next to Worksheet in the Data source section at the top of
the dialog and select Data_PartitionTS1. Month has already been selected as
the Time Variable. Select Passengers as the Selected variable. Since this
dataset is expected to include some seasonality (i.e. airline passenger numbers
increase during the holidays and summer months), the value for the Interval
parameter should be the length of one seasonal cycle, i.e. 12 months. As a
result, enter 12 for Interval.
Partitioning the data is optional. If you choose to not partition before running
the smoothing technique, then you will be given the option to specify the
number of desired forecasts on the Moving Average Smoothing dialog.
Now lets take a look at an example that does not include seasonality. Open the
Income.xlsx example dataset. This dataset contains the average income of tax
payers by state. First partition the dataset into training and validation sets using
Year as the Time Variable and CA as the Variables in the partitioned data.
Then click OK to accept the partitioning defaults and create the two sets
(Training and Validation). The worksheet, Data_PartitionTS1 will be inserted
immediately following the Description worksheet.
Click the Data_PartitionTS1 worksheet, then click Smoothing Moving
Average from the XLMiner ribbon to open the Moving Average Smoothing
dialog. Select Year as the Time Variable and CA as the Selected variable.
Select Month as the Time Variable and Passengers as the Variables in the
partitioned data. Then click OK to partition the data into training and
validation sets. The Data_PartitionTS1 worksheet will be inserted immediately
after the Data worksheet.
Click the Data_PartitionTS1 worksheet, then click Smoothing Double
Exponential to open the Double Exponential Smoothing dialog, as shown
below.
Select Month as the Time Variable and Passengers as the Variables in the
partitioned data. Then click OK to partition the data into training and
Month has already been selected for the Time Variable. Select Passengers as
the Selected variable. Since the seasonality in this dataset appears every 12
months, enter 12 for Period, # Complete seasons is automatically entered with
the number 7. This example will use the defaults for the three parameters:
Alpha, Beta, and Gamma.
Values between 0 and 1 can be entered for each parameter. As with Exponential
Smoothing, values close to 1 will result in the most recent observations being
weighted more than earlier observations.
In the Multiplicative model, it is assumed that the values for the different
seasons differ by percentage amounts.
Now lets create a new model using the Additive model. This technique
assumes the values for the different seasons differ by a constant amount. Click
back to the Data_PartitionTS1 tab and then click Smoothing Holt Winters
Additive to open the Holt Winters Smoothing (Additive Model) dialog.
Month has already been selected for the Time Variable, select Passengers for
Selected variable. Again, enter 12 for Period and select Give Forecast on
validation.
Frontline Solvers V2014
222
The last Holt Winters model should be used with time series that contain
seasonality, but no trends. Click back to the Data_PartitionTS1 worksheet and
click Smoothing Holt Winters No Trend to open the Holt Winters
Smoothing (No trend Model) dialog.
Month has already been selected as the Time Variable, select Passengers as the
Selected variable. Enter 12 for Period and select Give Forecast on validation.
Notice that the trend parameter is missing. Values for Alpha and Gamma can
Frontline Solvers V2014
224
range from 0 to 1. A value of 1 for each parameter will assign higher weights to
the most recent observations and lower weights to the earliers observations.
This example will accept the default values.
Taking into account all three methods, the best MSE for the validation set is
from the Additive model (3,247.49).
Time Variable
Select a variable associated with time from the Variables in input data list box.
Frontline Solvers V2014
226
Selected Variable
Select a variable to apply the smoothing technique.
Output Options
If applying this smoothing technique to raw data, rather than partitioned data,
the Output Options will contain two options, Give Forecast and #forecasts.
Give Forecast
If this option is selected, XLMiner will include a forecast on the output results.
#Forecasts
If Give Forecast is selected, enter the desired number of forecasts here.
Optimize
Select this option if you want XLMiner to select the Alpha Level that minimizes
the residual mean squared errors in the training and validation sets. Take care
when using this feature as this option can result in an over fitted model. This
option is not selected by default.
Level (Alpha)
Enter the smoothing parameter here. This parameter is used in the weighted
average calculation and can be from 0 to 1. A value of 1 or close to 1 will result
in the most recent observations being assigned the largest weights and the
earliest observations being assigned the smallest weights. A value of 0 or close
to 0 will result in the most recent observations being assigned the smallest
weights and the earliest observations being assigned the largest weights. The
default value is 0.2.
Interval
Enter the window width of the moving average here. This parameter accepts a
value of 2 up to N -1(where N is the number of observations in the dataset). If a
value of 5 is entered for the Interval, then XLMiner will use the average of the
last five observations for the last smoothed point or Ft = (Yt + Yt-1 + Yt-2 + Yt 3 +
Yt-4) / 5. The default value is 2.
Optimize
Select this option if you want XLMiner to select the Alpha and Beta values that
minimize the residual mean squared errors in the training and validation sets.
Take care when using this feature as this option can result in an over fitted
model. This option is not selected by default.
Level (Alpha)
Enter the smoothing parameter here. This parameter is used in the weighted
average calculation and can be from 0 to 1. A value of 1 or close to 1 will result
in the most recent observations being assigned the largest weights and the
earliest observations being assigned the smallest weights in the weighted
average calculation. A value of 0 or close to 0 will result in the most recent
observations being assigned the smallest weights and the earliest observations
being assigned the largest weights in the weighted average calculation. The
default is 0.2.
Trend (Beta)
The Double Exponential Smoothing technique includes an additional parameter,
Beta, to contend with trends in the data. This parameter is also used in the
weighted average calculation and can be from 0 to 1. A value of 1 or close to 1
will result in the most recent observations being assigned the largest weights and
the earliest observations being assigned the smallest weights in the weighted
average calculation. A value of 0 or close to 0 will result in the most recent
observations being assigned the smallest weights and the earliest observations
being assigned the largest weights in the weighted average calculation. The
default is 0.15.
Parameters
Enter the number of periods that make up one season. The value for # Complete
seasons will be automatically filled.
Level (Alpha)
Enter the smoothing parameter here. This parameter is used in the weighted
average calculation and can be from 0 to 1. A value of 1 or close to 1 will result
in the most recent observations being assigned the largest weights and the
earliest observations being assigned the smallest weights in the weighted
average calculation. A value of 0 or close to 0 will result in the most recent
observations being assigned the smallest weights and the earliest observations
being assigned the largest weights in the weighted average calculation. The
default is 0.2.
Trend (Beta)
The Holt Winters Smoothing also utilizes the Trend parameter, Beta, to contend
with trends in the data. This parameter is also used in the weighted average
calculation and can be from 0 to 1. A value of 1 or close to 1 will result in the
most recent observations being assigned the largest weights and the earliest
observations being assigned the smallest weights in the weighted average
calculation. A value of 0 or close to 0 will result in the most recent observations
being assigned the smallest weights and the earliest observations being assigned
the largest weights in the weighted average calculation. The default is 0.15.
This option is not included on the No Trend Model dialog.
Seasonal (Gamma)
The Holt Winters Smoothing technique utilizes an additional seasonal
parameter, Gamma, to manage the presence of seasonality in the data. This
parameter is also used in the weighted average calculation and can be from 0 to
1. A value of 1 or close to 1 will result in the most recent observations being
assigned the largest weights and the earliest observations being assigned the
smallest weights in the weighted average calculation. A value of 0 or close to 0
will result in the most recent observations being assigned the smallest weights
and the earliest observations being assigned the largest weights in the weighted
average calculation. The default is 0.05.
Give Forecast
XLMiner will generate a forecast if this option is selected.
Frontline Solvers V2014
230
#Forecasts
Enter the desired number of forecasts here.
Training Set
The training dataset is used to train or build a model. For example, in a linear
regression, the training dataset is used to fit the linear regression model, i.e. to
compute the regression coefficients. In a neural network model, the training
dataset is used to obtain the network weights. After fitting the model on the
training dataset, the performance of the model should be tested on the validation
dataset.
Validation Set
Once a model is built using the training dataset, the performance of the model
must be validated using new data. If the training data itself was utilized to
compute the accuracy of the model fit, the result would be an overly optimistic
estimate of the accuracy of the model. This is because the training or model
fitting process ensures that the accuracy of the model for the training data is as
high as possible -- the model is specifically suited to the training data. To obtain
a more realistic estimate of how the model would perform with unseen data, we
must set aside a part of the original data and not include this set in the training
process. This dataset is known as the validation dataset.
To validate the performance of the model, XLMiner measures the discrepancy
between the actual observed values and the predicted value of the observation.
This discrepancy is known as the error in prediction and is used to measure the
overall accuracy of the model.
Test Set
The validation dataset is often used to fine-tune models. For example, you might
try out neural network models with various architectures and test the accuracy of
each on the validation dataset to choose the best performer among the competing
architectures. In such a case, when a model is finally chosen, its accuracy with
the validation dataset is still an optimistic estimate of how it would perform with
unseen data. This is because the final model has come out as the winner among
the competing models based on the fact that its accuracy with the validation
dataset is highest. As a result, it is a good idea to set aside yet another portion of
data which is used neither in training nor in validation. This set is known as the
Frontline Solvers V2014
232
test dataset. The accuracy of the model on the test data gives a realistic estimate
of the performance of the model on completely unseen data.
XLMiner provides two methods of partitioning: Standard Partitioning and
Partitioning with Oversampling. XLMiner provides two approaches to standard
partitioning: random partitioning and user-defined partitioning.
Random Partitioning
In simple random sampling, every observation in the main dataset has equal
probability of being selected for the partition dataset. For example, if you
specify 60% for the training dataset, then 60% of the total observations are
randomly selected for the training dataset. In other words, each observation has
a 60% chance of being selected.
Random partitioning uses the system clock as a default to initialize the random
number seed. Alternatively, the random seed can be manually set which will
result in the same observations being chosen for the training/validation/test sets
each time a standard partition is created.
1.
2.
3.
4.
Highlight all variables in the Variables listbox, then click > to include them in
the partitioned data. Then click OK to accept the remainder of the default
settings. Sixty percent of the observations will be assigned to the Training set
and forty percent of the observations will be assigned to the Validation set.
The worksheet Data_Partition1 will be inserted immediately after the
Description workbook.
From the figure above, taken from the Data_Partition1 worksheet, 107
observations were assigned to the training set and 71 observations were assigned
to the validation set, or roughly 60% and 40% of the observations, respectively.
Frontline Solvers V2014
234
It is also possible for the user to specify which sets each observation should be
assigned. In column 0, enter a t, v or s for each record, as shown below.
Then click Partition Standard Partition on the XLMiner ribbon to open the
Standard Partition dialog.
Select Use Partition Variable in the Partitioning options section, select
Partition Variable in the Variables listbox, then click > next to Use Partition
Variable. XLMiner will use the values in the Partition Variable column to
create the training, validation, and test sets. Records with a t in the O column
will be designated as training records. Records with a v in the O column will
be designated as validating records and records with a c in this column will be
designated as testing records. Now highlight all remaining variables in the
listbox and click > to include them in the partitioned data.
v now belong to the validation set, and all records assigned an s now belong
to the test set.
First confirm that Data Range at the top of the dialog is showing as
$A$1:$V$58206. If not, simply click in the Data Range field and type the
correct range.
Select all variables in the Variables list box then click > to move all variables to
the Variables in the partitioned data listbox. Afterwards, highlight Target
dependent variable: buyer(yes = 1) in the Variables in the partitioned data
listbox then click the > immediately to the left of Output variable to designate
this variable as the output variable. Reminder: this output variable is limited to
two classes, e.g. 0/1 or yes/no.
Enter 50 for the Specify % validation data to be taken away as test data.
The output variable (Target dependent variable: buyer (yes = 1)) contains 576
successes or 1s. All of these records have been allocated to the Training set.
The percentage of success records in the original data set is 0.9896 or 576/58204
(number of successes / number of total rows in original dataset). 50% was
specified for both % Success in Training data and % Validation data taken away
as test in the Partition with Oversampling dialog. As a result, XLMiner has
randomly allocated 50% of the successes (the 1s) to the training set and the
remaining 50% to the validation set. This means that there are 288 successes
in the training set and 288 successes in the validation set. To complete the
training set, XLMiner randomly selected 288 non successes (0s). The training
set has 561 rows (288 1s + 288 0s).
The output above shows that the % Success in original data set is .9896.
XLMiner will maintain this percentage in the validation set as well by allocating
Frontline Solvers V2014
238
as many 0s as needed. Since 288 successes (1s) have already been allocated to
the validation set, 14,263 non successes (0s) must be added to the validation set
to maintain the .98% ratio.
Since we specified 50% of validation data should be taken as test data, XLMiner
has allocated 50% of the validation records to the test set. Each set contains
14,551 rows.
Set Seed
Random partitioning uses the system clock as a default to initialize the random
number seed. By default this option is selected to specify a non-negative seed
for random number generation for the partitioning. Setting this option will result
Frontline Solvers V2014
239
in the same records being assigned to the same set on successive runs. The
default seed entry is 12345.
Automatic
If Pick up rows randomly is selected under Partitioning options, this option will
be enabled. Select this option to accept the defaults of 60% and 40% for the
percentages of records to be included in the training and validation sets. This is
the default selection.
Specify percentages
If Pick up rows randomly is selected under Partitioning options, this option will
be enabled. Select this option to manually enter percentages for training set,
validation set and test sets. Records will be randomly allocated to each set
according to these percentages.
Set seed
Random partitioning uses the system clock as a default to initialize the random
number seed. By default this option is selected to specify a non-negative seed
for random number generation for the partition. Setting this option will result in
the same records being assigned to the same set on successive runs. The default
seed entry is 12345.
Output variable
Select the output variable from the variables listed in the Variables in the
partitioned data listbox.
#Classes
After the output variable is chosen, the number of classes (distinct values) for
the output variable will be displayed here. XLMiner supports a class size of 2.
Discriminant Analysis
Classification Method
Introduction
Discriminant analysis is a technique for classifying a set of observations into
predefined classes in order to determine the class of an observation based on a
set of variables. These variables are known as predictors or input variables. The
model is built based on a set of observations for which the classes are known.
This set of observations is sometimes referred to as the training set. Based on
the training set, the technique constructs a set of linear functions of the
predictors, known as discriminant functions, such that L = b1x1 + b2x2 + +
bnxn + c, where the b's are discriminant coefficients, the x's are the input
variables or predictors and c is a constant.
These discriminant functions are used to predict the class of a new observation
with an unknown class. For a k class problem k discriminant functions are
constructed. Given a new observation, all the k discriminant functions are
evaluated and the observation is assigned to class i if the ith discriminant
function has the highest value.
Select the CAT. MEDV variable in the Variables in input data listbox then
click > to select as the Output variable. Afterwards, select CRIM, ZN, INDUS,
NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, & B in the Variables in input
data listbox then click > to move to the Input variables listbox. (CHAS,
LSTAT, & MEDV should remain in the Variables in input data listbox as shown
below.)
If the second option is selected, Use equal prior probabilities, XLMiner will
assume that all classes occur with equal probability.
The third option, User specified prior probabilities, is only available when the
output variable handles two classes. Select this option to manually enter the
desired class and probability value.
We will select this option, then enter 1 for Class and enter 0.7 for Probability.
XLMiner gives the option of specifying the cost of misclassification when there
are two classes; where the success class is judged as failure and the nonsuccess
as a success. XLMiner takes into consideration the relative costs of
misclassification, and attempts to fit a model that minimizes the total cost.
Leave these options at their defaults of 1. Click Next to advance to the 3rd
Discriminant Analysis dialog.
Click Finish to view the output. The output worksheets will be inserted at the
end of the workbook. The first output worksheet, DA_Output1, contains the
Output Navigator which can be used to navigate to various sections of the
output.
Click the Class Funs link to view the Classification Function table. In this
example, there are 2 functions -- one for each class. Each variable is assigned to
the class that contains the higher value.
Click the Training Lift Charts link to navigate to the Training Data Lift Charts.
In a Lift Chart, the x axis is the cumulative number of cases and the y axis is the
cumulative number of true positives. The red line originating from the origin
and connecting to the point (400, 65) is a reference line that represents the
expected number of CAT MEDV predictions if XLMiner simply selected
random cases i.e. no model was used. This reference line provides a yardstick
against which the user can compare the model performance. From the Lift Chart
below we can infer that if we assigned 200 cases to class 1, 62 1s would be
included. If 200 cases were selected at random, we could expect 33 1s (200 *
65/400 = 32.5).
Input variables
The variables to be included in the Discriminant Analysis algorithm are listed
here.
Weight Variables
This option is not used for Discriminant Analysis.
Output variable
The selected output variable is displayed here. XLMiner supports a maximum
of 30 classes in the output variable.
#Classes
This value is the number of classes in the output variable.
Misclassification Costs of
XLMiner allows the option of specifying the cost of misclassification when
there are two classes; where the success class is judged as a failure and the
nonsuccess as a success. XLMiner takes into consideration the relative costs of
misclassification, and attempts to fit a model that minimizes the total cost.
Canonical Scores
The values of the variables X1 and X2 for the ith observation are known as the
canonical scores for that observation. In this example, the pair of canonical
scores for each observation represents the observation in a two dimensional
space. The purpose of the canonical score is to separate the classes as much as
possible. Thus, when the observations are plotted with the canonical scores as
the coordinates, the observations belonging to the same class are grouped
together. When this option is selected for either the Training, Validation or Test
sets, XLMiner reports the scores of the first few observations.
See the Scoring chapter for more information on the options located in the Score
Test Data and Score New Data groups.
Logistic Regression
Introduction
Logistic regression is a variation of ordinary regression which is used when the
dependent (response) variable is a dichotomous variable. A dichotomous
variable takes only two values, which typically represent the occurrence or nonoccurrence of some outcome event and are usually coded as 0 or 1 (success).
The independent (input) variables are continuous, categorical, or both. An
example of a categorical variable in a medical study would be a patients status
after five years - the patient can either survive (1) or die (0).
Unlike ordinary linear regression, logistic regression does not assume that the
relationship between the independent variables and the dependent variable is a
linear one. Nor does it assume that the dependent variable or the error terms are
distributed normally.
The form of the model is
where p is the probability that Y=1 and X1, X2,.. .,Xk are the independent
variables (predictors). b0 , b1, b2, .... bk are known as the regression coefficients,
which have to be estimated from the data. Logistic regression estimates the
probability of a certain event occurring.
Logistic regression thus forms a predictor variable (log (p/(1-p)) which is a
linear combination of the explanatory variables. The values of this predictor
variable are then transformed into probabilities by a logistic function. Such a
function has the shape of an S. On the horizontal axis we have the values of the
predictor variable, and on the vertical axis we have the probabilities.
Logistic regression also produces Odds Ratios (O.R.) associated with each
predictor value. The "odds" of an event is defined as the probability of the
outcome event occurring divided by the probability of the event not occurring.
Frontline Solvers V2014
256
In general, the "odds ratio" is one set of odds divided by another. The odds ratio
for a predictor is defined as the relative amount by which the odds of the
outcome increase (O.R. greater than 1.0) or decrease (O.R. less than 1.0) when
the value of the predictor variable is increased by 1.0 units. In other words,
(odds for PV+1)/(odds for PV) where PV is the value of the predictor variable.
First, we partition the data using a standard partition with percentages of 70%
training and 30% validation. (For more information on how to partition a
dataset, please see the previous Data Mining Partitioning chapter.)
Choose the value that will be the indicator of Success by clicking the down
arrow next to Specify Success class (necessary). In this example, we will use
the default of 1.
Enter a value between 0 and 1 for Specify the initial cutoff probability for
success. If the Probability of success (probability of the output variable = 1) is
less than this value, then a 0 will be entered for the class value, otherwise a 1
will be entered for the class value. In this example, we will keep the default of
0.5.
Click Next to advance to Step 2 of 3 of the Logistic Regression algorithm.
Selecting Set confidence level for odds alters the level of confidence for the
confidence levels displayed in the results for the odds ratio.
Selecting Force constant term to zero omits the constant term in the regression.
For this example, select Set confidence level for odds leaving the percentage at
95%.
Frontline Solvers V2014
260
Keep the default of 50 for the Maximum # iterations. Estimating the coefficients
in the Logistic Regression algorithm requires an iterative non-linear
maximization procedure. You can specify a maximum number of iterations to
prevent the program from getting lost in very lengthy iterative loops. This value
must be an integer greater than 0 or less than or equal to 100 (1< value <= 100).
Keep the default of 1 for the Initial marquardt overshoot factor. This overshoot
factor is used in the iterative non-linear maximization procedure. Reducing this
value speeds the operation by reducing the number of iterations required but
increases the chances that the maximization procedure will fail due to overshoot.
This value must be an integer greater than 0 or less than or equal to 50 (0 <
value <=50).
At times, variables can be highly correlated with one another which can result in
large standard errors for the affected coefficients. XLMiner will display
information useful in dealing with this problem if Perform Collinearity
diagnostics is selected. For this example, select Perform Collinearity
diagnostics and enter 2 for the Number of collinearity components. This option
can take on integer values from 2 to 15 (2 <= value <= 15).
Select Perform best subset selection. Often a subset of variables (rather than
all of the variables) performs the best job of classification. Selecting Perform
best subset selection enables the Best Subset options.
Using the spinner controls, specify 15 for the Maximum size of best subset. Its
possible that XLMiner could find a smaller subset of variables. This option can
take on values of 1 up to N where N is the number of input variables. The
default setting is 15.
Using the spinner controls, specify 15 for the Number of best subsets. XLMiner
can provide up to 20 different subsets. The default setting is 1.
XLMiner offers five different selection procedures for selecting the best subset
of variables.
Backward elimination in which variables are eliminated one at a time,
starting with the least significant.
Forward selection in which variables are added one at a time, starting
with the most significant.
Exhaustive search where searches of all combinations of variables are
performed to observe which combination has the best fit. (This option
can become quite time consuming depending on the number of input
variables.)
Frontline Solvers V2014
262
Click OK to return to the Step 2 of 3 dialog. Click Next to advance to the Step 3
of 3 dialog.
Click Finish. The logistic regression output worksheets are inserted at the end
of the workbook. Use the Output Navigator on the first output worksheet,
LR_Output 1.
A number of sections of output are available, including Classification of the
Training Data as shown below. Note that XLMiner has not, strictly speaking,
classified the data -- it has assigned a "predicted probability of success" to each
case. This is the predicted probability, based on the input (independent) variable
values for a case, that the output (dependent) variable for the case will be a "1".
Since the logistic regression procedure works not with the actual values of the
variable but with the logs of the odds ratios, this value is shown in the output
(the predicted probability of success is derived from it).
To classify each record as a "1" or a "0," we would simply assign a "1" to the
record if the predicted probability of success exceeds a certain value. In this
example the initial cutoff probability was set to 0.5 on the first Logistic
Regression dialog. A value of "0" will be assigned if the prediction probability
of success is less than the cutoff probability.
Since we selected Perform best subset selection on the Best Subset dialog,
XLMiner has produced the following output which displays the variables that
are included in the subsets. Since we selected 15 as the size of the subset, we are
shown the best subset of 1 variable (plus the constant), up to the best subset for
15 variables (plus the constant). This list comprises several different models
XLMiner generated using the Backward Elimination Selection Procedure as
chosen on the Best Subset dialog.
Refer to the Best Subset output above. In this section, every model includes a
constant term (since Force constant term to zero was not selected in Step 2 of 2)
and one or more variables as the additional coefficients. We can use any of these
models for further analysis by clicking on the respective link Choose Subset.
The choice of model depends on the calculated values of various error values
and the probability. RSS is the residual sum of squares, or the sum of squared
deviations between the predicted probability of success and the actual value (1
or 0). Cp is "Mallows Cp" and is a measure of the error in the best subset model,
relative to the error incorporating all variables. Adequate models are those for
which Cp is roughly equal to the number of parameters in the model (including
the constant), and/or Cp is at a minimum. "Probability" is a quasi hypothesis test
of the proposition that a given subset is acceptable; if Probability < .05 we can
rule out that subset.
When hoving over Choose Subset, the mouse icon will change to a grabber
hand. If Choose Subset is clicked, XLMiner opens the Logistic Regression
Step 1 of 1 dialog displaying the input variables included in that particular
subset. Scroll down to the end of the table.
The considerations about RSS, Cp and Probability would lead us to believe that
the subsets with 10 or 11 coefficients are the best models in this example.
Model terms are shown in the Regression Model output shown below.
Frontline Solvers V2014
266
This table contains the coefficient, the standard error of the coefficient, the pvalue and the odds ratio for each variable (which is simply e x where x is the
value of the coefficient) and confidence interval for the odds.
Summary statistics to the right (above) show the residual degrees of freedom
(#observations - #predictors), a standard deviation type measure for the model
(which typically has a chi-square distribution), the percentage of successes (1's)
in the training data, the number of iterations required to fit the model, and the
Multiple R-squared value.
The multiple R-squared value shown here is the r-squared value for a logistic
regression model , defined as R2 = (D0-D)/D0 ,
where D is the Deviance based on the fitted model and D0 is the deviance based
on the null model. The null model is defined as the model containing no
predictor variables apart from the constant.
Collinearity Diagnostics help assess whether two or more variables so closely
track one another as to provide essentially the same information.
After the model is built using the training data set, the model is used to score on
the training data set and the validation data set (if one exists). Then the data
set(s) are sorted using the predicted output variable value (or predicted
probability of success in the logistic regression case). After sorting, the actual
outcome values of the output variable are cumulated and the lift curve is drawn
as the number of cases versus the cumulated value. The baseline (red line
connecting the origin to the end point of the blue line) is drawn as the number of
cases versus the average of actual output variable values multiplied by the
number of cases. The decilewise lift curve is drawn as the decile number versus
the cumulative actual output variable value divided by the decile's average
output variable value.
See the chapter on Stored Model Sheets for more information on the
LR_Stored_1 worksheet.
Input variables
Variables listed here will be utilized in the XLMiner output.
Weight variable
One major assumption of Logistic Regression is that each observation provides
equal information. XLMiner offers an opportunity to provide a Weight variable.
Using a Weight variable allows the user to allocate a weight to each record. A
record with a large weight will influence the model more than a record with a
smaller weight.
Output Variable
Select the variable whose outcome is to be predicted.
# Classes
Displays the number of classes in the Output variable.
Maximum # iterations
Estimating the coefficients in the Logistic Regression algorithm requires an
iterative non-linear maximization procedure. You can specify a maximum
number of iterations to prevent the program from getting lost in very lengthy
iterative loops. This value must be an integer greater than 0 or less than or equal
to 100 (1< value <= 100). The default value is 50.
Selection Procedure
XLMiner offers five different selection procedures for selecting the best subset
of variables.
Backward elimination in which variables are eliminated one at a time,
starting with the least significant.
Forward selection in which variables are added one at a time, starting
with the most significant.
Exhaustive search where searches of all combinations of variables are
performed to observe which combination has the best fit. (This option
can become quite time consuming depending on the number of input
variables.)
Sequential replacement in which variables are sequentially replaced
and replacements that improve performance are retained.
Stepwise selection is similar to Forward selection except that at each
stage, XLMiner considers dropping variables that are not statistically
significant. When this procedure is selected, the Stepwise selection
options FIN and FOUT are enabled.
In the stepwise selection procedure a statistic is calculated when
variables are added or eliminated. For a variable to come into the
regression, the statistics value must be greater than the value for FIN
(default = 3.84). For a variable to leave the regression, the statistics
value must be less than the value of FOUT (default = 2.71). The value
for FIN must be greater than the value for FOUT.
Residuals
When this option is selected, XLMiner will produce a two-column array of fitted
values and their residuals in the output.
k Nearest Neighbors
Classification Method
Introduction
In the k-nearest-neighbor classification method, the training dataset is used to
classify each member of a "target" dataset. The structure of the data is that there
is a classification (categorical) variable ("buyer," or "non-buyer," for example),
and a number of additional predictor variables (age, income, location, etc.).
1.
For each row (case) in the target dataset (the set to be classified), the k
closest members (the k nearest neighbors) of the training dataset are located.
A Euclidean Distance measure is used to calculate how close each member
of the training set is to the target row that is being examined.
2.
3.
Repeat this procedure for the remaining rows (cases) in the target set.
4.
XLMiner allows the user to select a maximum value for k and builds
models in parallel on all values of k up to the maximum specified value.
Additional scoring can be performed on the best of these models.
First, we partition the data using a standard partition with percentages of 60%
training and 40% validation (the default settings for the Automatic choice).
For more information on how to partition a dataset, please see the previous Data
Mining Partitioning chapter.
Select Normalize input data. When this option is selected, XLMiner will
normalize the data by expressing the entire dataset in terms of standard
deviations. This is done so that the distance measure is not dominated by a large
Frontline Solvers V2014
279
magnitude variable. In this example, the values for Petal_width are between .1
and 2.5 while the values for Sepal_length are between 4.3 and 7.9. When the
data is normalized the actual variable value (say 4.3) is replaced with the
standard deviation from the mean of that variable. This option is not selected by
default.
Enter 10 for Number of nearest neighbors (k). (This number is based on
standard practice from the literature.) This is the parameter k in the k-Nearest
Neighbor algorithm. The value of k should be between 1 and the total number
of observations (rows). Note that if k is chosen as the total number of
observations in the training set, then for any new observation, all the
observations in the training set become nearest neighbors. The default value for
this option is 1.
Select Score on best k between 1 and specified value under Scoring option.
When this option is selected, XLMiner will display the output for the best k
between 1 and the value entered for Number of nearest neighbors (k). If Score
on specified value of k as above is selected, the output will be displayed for the
specified value of k.
Select Detailed scoring and Summary report under both Score training data
and Score validation data. XLMiner will create detailed and summary reports
for both the training and validation sets.
For more information on the Score new data options, please see the Scoring
chapter.
the output. Scroll down on the Output1 worksheet to view the Validation error
log.
The Validation error log for the different k's lists the % Errors for all values of k
for both the training and validation data sets. The k with the smallest % Error is
selected as the Best k. Scoring is performed later using this best value of k.
A little further down on the Output1 worksheet, is the Validation Data scoring
table.
This Summary report tallies the actual and predicted classifications. (Predicted
classifications were generated by applying the model to the validation data.)
Correct classification counts are along the diagonal from the upper left to the
lower right. In this example, there were three misclassification errors (3 cases
where Verginicas were misclassified as Versicolors).
Click the Valid. Score Detailed Rep. link on the Output Navigator to be
routed to the ValidScore1 worksheet.
This table shows the predicted class for each record, the percent of the nearest
neighbors belonging to that class and the actual class. The class with the highest
probability is highlighted in yellow. Mismatches between Predicted and Actual
class are highlighted in green.
Scroll down to view record 107 which is one of the three misclassified records.
The additional two misclassified records are 120 and 134.
Input variables
The variables selected as input variables appear here
Weight variable
This option is not used in the k-Nearest Neighbors classification method.
Output variable
The variable to be classified is entered here.
Scoring Option
If Score on best k between 1 and specified value is selected, XLMiner will
display the output for the best k between 1 and the value entered for Number of
nearest neighbors (k).
If Score on specified value of k as above is selected, the output will be displayed
for the specified value of k.
Classification Tree
Classification Method
Introduction
Classification tree methods (also known as decision tree methods) are a good
choice when the data mining task is classification or prediction of outcomes and
the goal is to generate rules that can be easily understood, explained, and
translated into SQL or a natural query language.
A Classification tree labels, records and assigns variables to discrete classes. A
Classification tree can also provide a measure of confidence that the
classification is correct.
A Classification tree is built through a process known as binary recursive
partitioning. This is an iterative process of splitting the data into partitions, and
then splitting it up further on each of the branches.
Initially, a training set is created where the classification label (say, "purchaser"
or "non-purchaser") is known (pre-classified) for each record. In the next step,
the algorithm systematically assigns each record to one of two subsets on the
some basis, for example income > $75,000 or income <= $75,000). The object is
to attain as homogeneous set of labels (say, "purchaser" or "non-purchaser") as
possible in each partition. This splitting (or partitioning) is then applied to each
of the new partitions. The process continues until no more useful splits can be
found. The heart of the algorithm is the rule that determines the initial split rule
(see figure below).
As explained above, the process starts with a training set consisting of preclassified records (target field or dependent variable with a known class or label
such as "purchaser" or "non-purchaser"). The goal is to build a tree that
distinguishes among the classes. For simplicity, assume that there are only two
target classes and that each split is a binary partition. The splitting criterion
Frontline Solvers V2014
286
CRIM
ZN
INDUS
Frontline Solvers V2014
287
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
MEDV
First, we partition the data into training and validation sets using the Standard
Data Partition defaults of 60% of the data randomly allocated to the Training Set
and 40% of the data randomly allocated to the Validation Set. For more
information on partitioning a dataset, see the Data Mining Partitioning chapter.
Select CAT. MEDV as the Output variable. Then select all remaining
variables except MEDV as Input variables. The MEDV variable is not
included in the Input since the CAT. MEDV variable is derived from the MEDV
variable.
Keep the default settings for Specify Success class and Specify initial cutoff
probability.
XLMiner provides the option to provide maximum #levels in the tree. Set
Maximum # levels to be displayed to 4 to indicate to XLMiner only four levels
in the tree are desired.
Select Full tree (grown using training data) to grow a complete tree using
the training data.
Select Best pruned tree (pruned using validation data). Selecting this option
will result in a tree with the fewest number of nodes, subject to the constraint
that the error be kept below a specified level (minimum error rate plus the
standard error of that error rate).
Select Minimum error tree (pruned using validation data) to produce a tree
that yields the minimum classification error rate when tested on the validation
data.
To create a tree with a specified number of decision nodes select Tree with
specified number of decision nodes and enter the desired number of nodes.
Leave this option unselected for this example.
Select the three options under both Score training data and Score validation
data to produce an assessment of the performance of the tree in both sets.
Please see the Scoring chapter for information on the Score new data options.
Recall that the objective of this example is to classify each case as a 0 (low
median value) or a 1 (high median value). Consider the top decision node
(denoted by a circle). The label beneath this node indicates the variable
represented at this node (i.e. the variable selected for the first split) in this case,
RM = average number of rooms per dwelling. The value inside the node
indicates the split threshold. (Hover over the decision node to read the decision
Frontline Solvers V2014
294
rule.) If the RM value for a specific record is greater than 6.733 (RM > 6.733),
the record will be assigned to the right. If the RM value for the record is less
than or equal to 6.733, the value will be assigned to the left. 63 records
contained RM values greater than 6.733 while 241 records contained RM values
of less than or equal to 6.733. We can think of records with an RM value less
than or equal to 6.733 (RM <= 6.733) as tentatively classified as "0" (low
median value). Any record where RM > 6.733 can be tentatively classified as a
"1" (high median value).
The 241 records with RM values less than 6.733 are further split as we move
down the tree. The second split on this branch occurs with the LSTAT variable
(percent of the population that is of lower socioeconomic status). The LSTAT
values for 74 records (out of 241) fall below the split value of 9.535. These
records are tentatively classified as a 1 meaning these records have low
percentages of the population with lower socioeconomic status. The LSTAT
values for the remaining 167 records are greater than 9.535, and are tentatively
classified as 0".
A square node indicates a terminal node, after which there are no further splits.
For example, the 167 coming from the right of LSTAT are classified as 0's.
There are no further splits for this group. The path of their classification is: If
few rooms, and if a high percentage of the population is of lower socioeconomic
status, then classify as 0 (low median value).
The terminal nodes at the bottom of the tree displaying Sub Tree beneath
indicate that the full tree has not been drawn due to its size. The structure of the
full tree will be clear by reading the Full Tree Rules. Click the Full Tree Rules
link on the Output Navigator to open the Full Tree Rules table, shown below.
The first entry in this table shows a split on the RM variable with a split value of
6.733. The 304 total cases were split between nodes 1( LeftChild column) and
2(Rightchild column).
Moving to NodeID1 we find that 241 cases were assigned to this node (from
node 0) which has a 0 value (Class column). From here, the 241 cases were
split on the LSTAT variable using a value of 9.535 between nodes 3 (LeftChild
column) and 4 (RightChild column).
Moving to NodeID3 we find that 74 cases were assigned to this node (from node
1) which has a 0 value. From here, the 74 cases were split on the DIS variable
using a value of 3.4351 between nodes 7 and 8.
Moving to NodeID7, we find that 18 cases were assigned to this node (from
node 3) which has a 0 value. From here, the 18 cases were split on the RAD
variable using a value of 7.5 between nodes 11 and 12.
Frontline Solvers V2014
295
Moving to NodeID11, we find that 12 cases were assigned to this node (from
node 7) which as a 0 value. From here, the 12 cases were split on the TAX
variable using a value of 207.4998 between nodes 15 and 16.
The user can follow this table in a likewise fashion until a terminal node is
reached.
Click the Minimum Error Tree link on the Output Navigator to view the
Minimum Error Tree on the CT_MinErrTree1 worksheet.
The "minimum error tree" is the tree that yields a minimum classification error
rate when tested on the validation data. The misclassification (error) rate is
measured as the tree is pruned. The tree that produces the lowest error rate is
selected.
Click the Best Pruned Tree link on the Output Navigator to view the Best
Pruned Tree.
Note: The Best Pruned Tree is based on the validation data set, and is the
smallest tree whose misclassification rate is within one standard error of the
misclassification rate of the Minimum Error Tree. In this example the Best
Pruned tree and the Minimum Error Tree happen to be the same because the
#Decision Nodes for them is the same. (Please refer to the Prune Log).
However, you will often find that the Best Pruned Tree has less number of
decision nodes than the Minimum Error Tree.
Click the Train Log link in the Output Navigator to navigate to the Training
Log.
The training log, above, shows the misclassification (error) rate as each
additional node is added to the tree. Starting off at 0 nodes with the full data set,
all records would be classified as "low median value" (0).
Click the Train. Score Summary link to navigate to the Classification
Confusion Matrix.
The confusion matrix, above, displays counts for cases that were correctly and
incorrectly classified in the validation data set. The 1 in the lower left cell, for
example, indicates that there was 1 case that was classified as 1 that was actually
0.
Click the Training Lift Charts and Validation Lift Charts link to navigate to
the Lift Charts.
Lift charts are visual aids for measuring model performance. They consist of a
lift curve and a baseline. The greater the area between the lift curve and the
baseline, the better the model.
After the model is built using the training data set, the model is used to score on
the training data set and the validation data set. Then the data sets are sorted
using the predicted output variable value. After sorting, the actual outcome
values of the output variable are cumulated and the lift curve is drawn as the
number of cases versus the cumulated value. The baseline is drawn as the
number of cases versus the average of actual output variable values multiplied
by the number of cases. The decilewise lift curve is drawn as the decile number
versus the cumulative actual output variable value divided by the decile's
average output variable value.
XLMiner generates the CT_Stored_1 worksheet along with the other outputs.
Please refer to the Scoring chapter for details.
Input variables
Variables selected to be included in the output appear here.
Weight variable
This option is not used with the Classification Tree algorithm.
Output variable
The dependent variable or the variable to be classified appears here.
# Classes
Displays the number of classes in the Output variable.
Prune Tree
XLMiner will prune the tree using the validation set when Prune Tree is
selected. (Pruning the tree using the validation set will reduce the error from
over-fitting the tree using the training data.) This option is selected by default.
If no validation set exists, then this option is disabled.
Bayes Theorem
Let X be the data record (case) whose class label is unknown. Let H be some
hypothesis, such as "data record X belongs to a specified class C." For
classification, we want to determine P (H|X) -- the probability that the
hypothesis H holds, given the observed data record X.
P (H|X) is the posterior probability of H conditioned on X. For example, the
probability that a fruit is an apple, given the condition that it is red and round. In
contrast, P(H) is the prior probability, or apriori probability, of H. In this
example P(H) is the probability that any given data record is an apple, regardless
of how the data record looks. The posterior probability, P (H|X), is based on
First, we partition the data into training and validation sets using the Standard
Data Partition defaults of 60% of the data randomly allocated to the Training Set
and 40% of the data randomly allocated to the Validation Set. For more
information on partitioning a dataset, see the Data Mining Partitioning chapter.
Select Var2, Var3, Var4, Var5, and Var6 as Input variables and
TestRest/Var1 as the Output variable. The # Classes statistic will be
automatically updated with a value of 2 when the Output variable is selected.
This indicates that the Output variable, TestRest/Var1 contains two classes, 0
and 1.
Choose the value that will be the indicator of Success by clicking the down
arrow next to Specify Success class (necessary). In this example, we will use
the default of 1 indicating that a value of 1 will be specified as a success.
Enter a value between 0 and 1 for Specify the initial cutoff probability for
success. If the Probability of success (probability of the output variable = 1) is
less than this value, then a 0 will be entered for the class value, otherwise a 1
will be entered for the class value. In this example, we will keep the default of
0.5.
Select Detailed report, Summary report, and Lift charts under both Score
training data and Score validation data to obtain the complete output results for
this classification method.
For more information on the options for Score new data, please see the Scoring
chapter.
Click the Prior Class Pr link on the Output Navigator to view the Prior Class
Probabilities table on the NNB_Output1 worksheet. As shown, 54.17% of the
training data records belonged to the 1 class and 45.83% of the training data
records belong to the 0 class.
Frontline Solvers V2014
309
Click the Conditional Probabilities link to display the table below. This table
shows the probabilities for each case for each variable. For example, for Var2,
15.38% of the cases were classified as 0, 84.62% of the cases were classified
as 1 and 0 cases were classified as 2.
Click the Training Lift Chart and Validation Lift Chart links.
Lift charts are visual aids for measuring model performance. They consist of a
lift curve and a baseline. The greater the area between the lift curve and the
baseline, the better the model.
After the model is built using the training data set, the model is used to score on
the training data set and the validation data set. Then the data sets are sorted
using the predicted output variable value. After sorting, the actual outcome
values of the output variable are cumulated and the lift curve is drawn as the
number of cases versus the cumulated value. The baseline is drawn as the
number of cases versus the average of actual output variable values multiplied
by the number of cases. The decilewise lift curve is drawn as the decile number
versus the cumulative actual output variable value divided by the decile's
average output variable value.
Please see the Scoring chapter for information on the worksheet NNB_Stored_1.
Input variables
Variables selected to be included in the output appear here.
Frontline Solvers V2014
311
Weight variable
This option is not used with the Nave Bayes algorithm.
Output variable
The dependent variable or the variable to be classified appears here.
# Classes
Displays the number of classes in the Output variable.
2.
A function (g) that sums the weights and maps the results to an output
(y).
Neurons are organized into layers: input, hidden and output. The input layer is
composed not of full neurons, but rather consists simply of the records values
that are inputs to the next layer of neurons. The next layer is the hidden layer.
Several hidden layers can exist in one neural network. The final layer is the
output layer, where there is one node for each class. A single sweep forward
through the network results in the assignment of a value to each output node,
and the record is assigned to the class node with the highest value.
Feedforward, Back-Propagation
The feedforward, back-propagation architecture was developed in the early
1970's by several independent sources (Werbor; Parker; Rumelhart, Hinton and
Williams). This independent co-development was the result of a proliferation of
articles and talks at various conferences which stimulated the entire industry.
Currently, this synergistically developed back-propagation architecture is the
most popular, effective, and easy-to-learn model for complex, multi-layered
networks. Its greatest strength is in non-linear solutions to ill-defined problems.
The typical back-propagation network has an input layer, an output layer, and at
least one hidden layer. There is no theoretical limit on the number of hidden
layers but typically there are just one or two. Some studies have shown that the
total number of layers needed to solve problems of any complexity is 5 (one
input layer, three hidden layers and an output layer). Each layer is fully
connected to the succeeding layer.
As noted above, the training process normally uses some variant of the Delta
Rule, which starts with the calculated difference between the actual outputs and
the desired outputs. Using this error, connection weights are increased in
proportion to the error times, which are a scaling factor for global accuracy. This
means that the inputs, the output, and the desired output all must be present at
the same processing element. The most complex part of this algorithm is
determining which input contributed the most to an incorrect output and how
must the input be modified to correct the error. (An inactive node would not
contribute to the error and would have no need to change its weights.) To solve
this problem, training inputs are applied to the input layer of the network, and
desired outputs are compared at the output layer. During the learning process, a
forward sweep is made through the network, and the output of each element is
computed layer by layer. The difference between the output of the final layer
and the desired output is back-propagated to the previous layer(s), usually
modified by the derivative of the transfer function. The connection weights are
normally adjusted using the Delta Rule. This process proceeds for the previous
layer(s) until the input layer is reached.
Then divide that result again by a scaling factor between five and ten. Larger
scaling factors are used for relatively less noisy data. If too many artificial
neurons are used the training set will be memorized, not generalized, and the
network will be useless on new data sets.
First, we partition the data into training and validation sets using a Standard
Data Partition with percentages of 80% of the data randomly allocated to the
Training Set and 20% of the data randomly allocated to the Validation Set. For
more information on partitioning a dataset, see the Data Mining Partitioning
chapter.
Select Type as the Output variable and the remaining variables as Input
Variables. Since the Output variable contains three classes (A, B, and C) to
denote the three different wineries, the options for Classes in the output variable
are disabled.
XLMiner also allows a Weight variable. This option can be used if the data
contains multiple cases (objects) sharing the same variable values. The weight
variable denotes the number of cases with those values.
This dialog contains the options to define the network architecture. Select
Normalize input data. Normalizing the data (subtracting the mean and
dividing by the standard deviation) is important to ensure that the distance
measure accords equal weight to each variable -- without normalization, the
variable with the largest scale would dominate the measure.
Frontline Solvers V2014
320
XLMiner provides two options for the Network Architecture -- Automatic and
Manual. The default network architecture is 'Automatic'. This option generates
several neural networks in the output sheet for various combinations of hidden
layers and nodes within each layer. The total number of the neural networks
generated using the 'Automatic' option currently is 100. Choose the Manual
option to specify the number of hidden layers and the number of nodes for one
neural network. Please see the example below for explanations of the various
fields to be specified when the "Manual' network architecture is chosen is as
follows. For this example, keep the default setting of Automatic. See the next
example for an illustration of how to use the Manual Network Architecture
setting.
Keep the default setting of 30 for # Epochs. An epoch is one sweep through all
records in the training set.
Keep the default setting of 0.1 for Step size for gradient descent. This is the
multiplying factor for the error correction during backpropagation; it is roughly
equivalent to the learning rate for the neural network. A low value produces
slow but steady learning, a high value produces rapid but erratic learning.
Values for the step size typically range from 0.1 to 0.9.
Keep the default setting of 0.6 for Weight change momentum. In each new
round of error correction, some memory of the prior correction is retained so
that an outlier that crops up does not spoil accumulated learning.
Keep the default setting of 0.01 for Error tolerance. The error in a particular
iteration is backpropagated only if it is greater than the error tolerance. Typically
error tolerance is a small value in the range from 0 to 1.
Keep the default setting of 0 for Weight decay. To prevent over-fitting of the
network on the training data, set a weight decay to penalize the weight in each
iteration. Each calculated weight will be multiplied by (1-decay).
XLMiner provides four options for cost functions -- Squared mirror, Cross
entropy, Maximum likelihood and Perceptron convergence. The user can select
the appropriate one. Keep the default selection, Squared error, for this example.
Nodes in the hidden layer receive input from the input layer. The output of the
hidden nodes is a weighted sum of the input values. This weighted sum is
computed with weights that are initially set at random values. As the network
learns these weights are adjusted. This weighted sum is used to compute the
hidden nodes output using a transfer function. Select Standard (the default
setting) to use a logistic function for the transfer function with a range of 0 and
1. This function has a squashing effect on very small or very large values but
is almost linear in the range where the value of the function is between 0.1 and
0.9.2 Select Symmetric to use the tanh function for the transfer function, the
range being -1 to 1. Keep the default selection, Standard, for this example. If
more than one hidden layer exists, this function is used for all layers.
As in the hidden layer output calculation (explained in the above paragraph), the
output layer is also computed using the same transfer function. Select Standard
(the default setting) to use a logistic function for the transfer function with a
range of 0 and 1. Select Symmetric to use the tanh function for the transfer
function, the range being -1 to 1. Keep the default selection, Standard, for this
example.
2Galit
Shmueli, Nitin R. Patel, and Peter C. Bruce, Data Mining for Business Intelligence (New Jersey: Wiley, 2010) 226.
The above error report gives the total number of errors and the % error in
classification produced by each network ID for the training and validation sets
separately. For example: Net 26 has 2 hidden layers each having one node in
each hidden layer. For this neural network, the percentage of errors in the
training data is 72.54% and the percentage of errors in the validation data is
75%.
XLMiner provides sorting of the error report according to increasing or
decreasing order of the %Error by clicking the up arrow next to % Error. Click
the upgrade arrow to sort in ascending order, and the downward arrow to sort in
descending order.
If you click a hyperlink for a particular Net ID (say Net 26) in the Error Report,
the following dialog appears. Here, the user can select the various options for
scoring data on Net ID 26. See the example below for more on this dialog and
the associated output.
This example will use the same dataset to illustrate the use of the Manual
Network Architecture selection.
Follow the steps above for the Step 1 of 3 dialog. On the Step 2 of 3 dialog,
keep the default setting of 1 for the # hidden layers option. Up to four hidden
layers can be specified for this option.
Keep the default setting of 25 for #Nodes. (Since # hidden layers is set to 1,
only the first text box is enabled.)
Frontline Solvers V2014
324
Select Detailed report and Summary report under both Score training data
and Score validation data.
For more information on the Score new data options, see the Scoring chapter.
Click the Training Epoch Log to display the Neural Network Classification Log.
XLMiner also provides intermediate information produced during the last pass
through the network
Scroll down on the Output1 worksheet to the Interlayer connections' weights
table.
Frontline Solvers V2014
326
Recall that a key element in a neural network is the weights for the connections
between nodes. In this example, we chose to have one hidden layer, and we also
chose to have 25 nodes in that layer. XLMiner's output contains a section that
contains the final values for the weights between the input layer and the hidden
layer, between hidden layers, and between the last hidden layer and the output
layer. This information is useful at viewing the insides of the neural network;
however, it is unlikely to be of use to the data analyst end-user. Displayed above
are the final connection weights between the input layer and the hidden layer for
our example.
Click the Training Epoch Log link on the Output Navigator to display the
following log.
During an epoch, each training record is fed forward in the network and
classified. The error is calculated and is back propagated for the weights
correction. Weights are continuously adjusted during the epoch. The
classification error is computed as the records pass through the network. It does
not report the classification error after the final weight adjustment. Scoring of
the training data is performed using the final weights so the training
classification error may not exactly match with the last epoch error in the Epoch
log.
See the Scoring chapter for information on Stored Model Sheets,
NNC_Stored_1.
The Step 2 of 3 dialog contains options to define the network architecture. For
this example, accept the default values. (Details on these choices are explained
in the above examples.)
The above error report gives the total number of errors, % Error, % Sensitivity
and % Specificity in the classification produced by each network ID for the
training and validation datasets separately. For example: Net 10 has one hidden
layer having 10 nodes. For this neural network, the percentage of errors in the
training data is 4.69% and the percentage of errors in the validation data is
3.96%. The percentage sensitivity is 76.92% and 84.21% for training data and
validation data respectively. The percentage specificity is 98.82% and 98.78%
for training data and validation data respectively.
Numerically, sensitivity is the number of true positive results (TP) divided by
the sum of true positive and false negative (FN) results,
i.e., sensitivity = TP/(TP + FN).
Numerically, specificity is the number of true negative results (TN) divided by
the sum of true negative and false positive (FP) results,
i.e., specificity = TN/(TN + FP).
XLMiner provides sorting of the error report according to increasing or
decreasing order of the %Error, %Sensitivity or %Specificity. This can be done
by clicking the up arrow next to %Error, %Sensitivity or %Specificity,
respectively. Click the upward arrow to sort in ascending order, and the
downward arrow to sort in descending order.
Input variables
Variables selected to be included in the output appear here.
Weight variable
This option is not used with the Neural Network Classification algorithm.
Output variable
The dependent variable or the variable to be classified appears here.
# Classes
Displays the number of classes in the Output variable.
Network Architecture
XLMiner provides two options for the Network Architecture -- Automatic and
Manual. The default network architecture is 'Automatic'. This option generates
several neural networks in the output sheet for various combinations of hidden
layers and nodes within each layer. The total number of the neural networks
generated using the 'Automatic' option currently is 100. Choose the Manual
option to specify the number of hidden layers and the number of nodes for one
neural network.
# Hidden Layers
When Manual is selected, this option is enabled. XLMiner supports up to 4
hidden layers.
Frontline Solvers V2014
332
# Nodes
When Manual is selected, this option is enabled. Enter the number of nodes per
layer here.
# Epochs
An epoch is one sweep through all records in the training set. The default
setting is 30.
Error tolerance
The error in a particular iteration is backpropagated only if it is greater than the
error tolerance. Typically error tolerance is a small value in the range from 0 to
1. The default setting is 0.01.
Weight decay
To prevent over-fitting of the network on the training data, set a weight decay to
penalize the weight in each iteration. Each calculated weight will be multiplied
by (1-decay). The default setting is 0.
Cost Function
XLMiner provides four options for the cost function -- Squared mirror, Cross
entropy, Maximum likelihood and Perceptron convergence. The user can select
the appropriate one. The default setting is Squared error.
range being -1 to 1. If more than one hidden layer exists, this function is used
for all layers. The default selection is Standard.
First, we partition the data into training and validation sets using the Standard
Data Partition defaults with percentages of 60% of the data randomly allocated
to the Training Set and 40% of the data randomly allocated to the Validation
Set. For more information on partitioning a dataset, see the Data Mining
Partitioning chapter.
Select MEDV as the Output variable and all remaining variables (except CAT.
MEDV) as Input variables. (The bottom portion of the dialog is not used with
prediction methods.)
Select Yes to proceed to the Best Subset dialog (see below). Click No to return
to the Multiple Linear Regression Step 1 of 2 dialog.
If Force constant term to zero is selected, there will be no constant term in the
equation. Leave this option unchecked for this example.
Select Fitted values. When this option is selected, the fitted values are
displayed in the output.
Select ANOVA table. When this option is selected, the ANOVA table is
displayed in the output.
Select Standardized under Residuals to display the Standardized Residuals in
the output. Standardized residuals are obtained by dividing the unstandardized
residuals by the respective standard deviations.
Frontline Solvers V2014
340
Select Studentized. When this option is selected the Studentized Residuals are
displayed in the output. Studentized residuals are computed by dividing the
unstandardized residuals by quantities related to the diagonal elements of the hat
matrix, using a common scale estimate computed without the ith case in the
model. These residuals have t - distributions with ( n-k-1) degrees of freedom.
As a result, any residual with absolute value exceeding 3 usually requires
attention.
Select Deleted. When this option is selected the Deleted Residuals are
displayed in the output. This residual is computed for the ith observation by first
fitting a model without the ith observation, then using this model to predict the ith
observation. Afterwards the difference is taken between the predicted
observation and the actual observation.
Select Cook's Distance. When this checkbox is selected the Cook's Distance
for each observation is displayed in the output. This is an overall measure of the
impact of the ith datapoint on the estimated regression coefficient. In linear
models Cook's Distance has, approximately, an F distribution with k and (n-k)
degrees of freedom.
Select DF fits. When this checkbox is selected the DF fits (change in the
regression fit) for each observation is displayed in the output. These reflect
coefficient changes as well as forecasting effects when an observation is deleted.
Select Covariance Ratios. When this checkbox is selected, the covariance
ratios are displayed in the output. This measure reflects the change in the
variance-covariance matrix of the estimated coefficients when the ith observation
is deleted.
Select Hat matrix Diagonal. When this checkbox is selected, the diagonal
elements of the hat matrix are displayed in the output. This measure is also
known as the leverage of the ith observation.
Select Perform Collinearity diagnostics. When this checkbox is selected, the
collinearity diagnostics are displayed in the output.
Frontline Solvers V2014
342
Click OK to return to the Step 2 of 2 dialog, then click Best subset (on the Step
2 of 2 dialog) to open the following dialog.
When you have a large number of predictors and would like to limit the model
to only the significant variables, select Perform best subset selection to select
Frontline Solvers V2014
343
the best subset. For this example, enter 13 (the default value) for the Maximum
size of best subsets (for a model with up to 13 variables). XLMiner accepts an
integer value of 1 up to N where N is the number of Input variables in the
model.
Enter 3 for Number of best subsets. XLMiner will first show the best, then the
next-best, etc., and will show this number of subsets for subsets of one variable,
subsets of two variables, etc., on up to subsets of the size you specified above.
XLMiner allows integer values up to 20.
Select Backward elimination for the Selection procedure.
XLMiner offers five different selection procedures for selecting the best subset
of variables.
Backward elimination in which variables are eliminated one at a time,
starting with the least significant.
Forward selection in which variables are added one at a time, starting
with the most significant.
Exhaustive search where searches of all combinations of variables are
performed to observe which combination has the best fit. (This option
can become quite time consuming depending on the number of input
variables.)
Sequential replacement in which variables are sequentially replaced
and replacements that improve performance are retained.
Stepwise selection is similar to Forward selection except that at each
stage, XLMiner considers dropping variables that are not statistically
significant. When this procedure is selected, the Stepwise selection
options FIN and FOUT are enabled. In the stepwise selection
procedure a statistic is calculated when variables are added or
eliminated. For a variable to come into the regression, the statistics
value must be greater than the value for FIN (default = 3.84). For a
variable to leave the regression, the statistics value must be less than
the value of FOUT (default = 2.71). The value for FIN must be greater
than the value for FOUT.
Click OK to return to the Step 2 of 2 dialog, then click Finish. Click the
Output1 worksheet to find the Output Navigator. Click any link here to display
the selected output.
Frontline Solvers V2014
344
Click the Train. Score Detailed Rep. link to open the Multiple Linear
Regression Prediction of Training Data table. Of primary interest in a datamining context will be the predicted and actual values for each record, along
with the residual (difference), shown here for the training data set:
XLMiner also displays The Total sum of squared errors summaries for both the
training and validation data sets on the Output1 worksheet. The total sum of
squared errors is the sum of the squared errors (deviations between predicted
and actual values) and the root mean square error (square root of the average
squared error). The average error is typically very small, because positive
prediction errors tend to be counterbalanced by negative ones.
Every model includes a constant term (since Force constant term to zero was
not selected on the Step 2 of 2 dialog) and one or more variables as the
additional coefficients. We can use any of these models for further analysis by
clicking on the respective link, "Choose Subset". The choice of model depends
on the calculated values of various error values and the probability. The error
values calculated are
When hovering over Choose Subset, the mouse icon will change to a grabber
hand. If Choose Subset is clicked, XLMiner opens the Multiple Linear
Regression Step 1 of 1 dialog displaying the input variables included in that
particular subset. Scroll down to the end of the table.
Compare the RSS value as the number of coefficients in the subset increases
from 11 to 12 (8923.724 down to 6978.134). The RSS for 12 coefficients is just
slightly higher than the RSS for 14 coefficients suggesting that a model with 12
coefficients may be sufficient to fit a regression. Click the Choose Subset link
next to the first model with 12 coefficients (RSS = 6978.134), the Multiple
Linear Regression The Step 1 of 2 dialog appears with these 12 variables
already selected as Input variables. The User can easily click Next to run a
Multiple Linear Regression on these variables.
Model terms are shown in the Regression Model output shown below along
with the Summary statistics
The Regression Model table contains the coefficient, the standard error of the
coefficient, the p-value and the Sum of Squared Error for each variable included
in the model. The Sum of Squared Errors is calculated as each variable is
introduced in the model beginning with the constant term and continuing with
each variable as it appears in the dataset.
Summary statistics (to the above right) show the residual degrees of freedom
(#observations - #predictors), the R-squared value, a standard deviation type
measure for the model (which typically has a chi-square distribution), and the
Residual Sum of Squares error.
The R-squared value shown here is the r-squared value for a logistic regression
model , defined as R2 = (D0-D)/D0 ,
where D is the Deviance based on the fitted model and D0 is the deviance based
on the null model. The null model is defined as the model containing no
predictor variables apart from the constant.
Click the Collinearity Diagnostics link to display the Collinearity Diagnostics
table. This table helps assess whether two or more variables so closely track one
another as to provide essentially the same information. As you can see the NOX
variable was ignored.
After the model is built using the training data set, the model is used to score on
the training data set and the validation data set (if one exists). Then the data
set(s) are sorted using the predicted output variable value. After sorting, the
actual outcome values of the output variable are cumulated and the lift curve is
drawn as the number of cases versus the cumulated value. The baseline (red line
connecting the origin to the end point of the blue line) is drawn as the number of
cases versus the average of actual output variable values multiplied by the
number of cases. The decilewise lift curve is drawn as the decile number versus
the cumulative actual output variable value divided by the decile's average
output variable value.
See the chapter on Stored Model Sheets for more information on the
MLR_Stored_1 worksheet.
Input variables
Variables listed here will be utilized in the XLMiner output.
Weight variable
One major assumption of Multiple Linear Regression is that each observation
provides equal information. XLMiner offers an opportunity to provide a Weight
variable. Using a Weight variable allows the user to allocate a weight to each
record. A record with a large weight will influence the model more than a
record with a smaller weight.
Output Variable
Select the variable whose outcome is to be predicted here.
Fitted values
When this option is selected, the fitted values are displayed in the output. This
option is not selected by default.
Anova table
When this option is selected, the ANOVA table is displayed in the output. This
option is not selected by default.
Standardized
Select this option under Residuals to display the Standardized Residuals in the
output. Standardized residuals are obtained by dividing the unstandardized
residuals by the respective standard deviations. This option is not selected by
default.
Unstandardized
Select this option under Residuals to display the Unstandardized Residuals in
the output. Unstandardized residuals are computed by the formula:
Unstandardized residual = Actual response Predicted response. This option is
not selected by default.
Studentized
When this option is selected the Studentized Residuals are displayed in the
output. Studentized residuals are computed by dividing the unstandardized
residuals by quantities related to the diagonal elements of the hat matrix, using a
common scale estimate computed without the ith case in the model. These
residuals have t - distributions with ( n-k-1) degrees of freedom. As a result, any
residual with absolute value exceeding 3 usually requires attention. This option
is not selected by default.
Deleted
When this option is selected the Deleted Residuals are displayed in the output.
This residual is computed for the ith observation by first fitting a model without
the ith observation, then using this model to predict the ith observation.
Afterwards the difference is taken between the predicted observation and the
actual observation. This option is not selected by default.
DF fits
When this checkbox is selected the DF fits (change in the regression fit) for each
observation is displayed in the output. These reflect coefficient changes as well
Frontline Solvers V2014
353
Covariance Ratios
When this checkbox is selected, the covariance ratios are displayed in the
output. This measure reflects the change in the variance-covariance matrix of the
estimated coefficients when the ith observation is deleted. This option is not
selected by default.
Multicollinearity Criterion
When Perform Collinearity diagnostics is selected, Multicollinearity criterion is
enabled. Multicollinearity can be defined as the occurrence of two or more
input variables that share the same linear relationship with the outcome variable.
Enter a value between 0 and 1. The default setting is 0.05.
Selection Procedure
XLMiner offers five different selection procedures for selecting the best subset
of variables.
Backward elimination in which variables are eliminated one at a time,
starting with the least significant.
Forward selection in which variables are added one at a time, starting
with the most significant.
Exhaustive search where searches of all combinations of variables are
performed to observe which combination has the best fit. (This option
can become quite time consuming depending on the number of input
variables.)
Sequential replacement in which variables are sequentially replaced
and replacements that improve performance are retained.
Stepwise selection is similar to Forward selection except that at each
stage, XLMiner considers dropping variables that are not statistically
significant. When this procedure is selected, the Stepwise selection
options FIN and FOUT are enabled. In the stepwise selection
procedure a statistic is calculated when variables are added or
eliminated. For a variable to come into the regression, the statistics
value must be greater than the value for FIN (default = 3.84). For a
variable to leave the regression, the statistics value must be less than
the value of FOUT (default = 2.71). The value for FIN must be greater
than the value for FOUT.
For each row (case) in the target data set (the set to be predicted), locate the
k closest members (the k nearest neighbors) of the training data set. A
Euclidean Distance measure is used to calculate how close each member of
the training set is to the target row that is being examined.
2.
Find the weighted sum of the variable of interest for the k nearest neighbors
(the weights are the inverse of the distances).
3.
Repeat this procedure for the remaining rows (cases) in the target set.
4.
Additionally, XLMiner also allows the user to select a maximum value for
k, builds models in parallel on all values of k (up to the maximum specified
value) and performs scoring on the best of these models.
Computing time increases as k increases, but the advantage is that higher values
of k provide smoothing that reduces vulnerability to noise in the training data.
Typically, k is in units of tens rather than in hundreds or thousands of units.
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
LSTAT
MEDV
A portion of the dataset is shown below. The last variable, CAT. MEDV, is a
discrete classification of the MEDV variable and will not be used in this
example.
First, we partition the data into training and validation sets using the Standard
Data Partition defaults with percentages of 60% of the data randomly allocated
to the Training Set and 40% of the data randomly allocated to the Validation
Set. For more information on partitioning a dataset, see the Data Mining
Partitioning chapter.
Select MEDV as the Output variable, and the remaining variables (except CAT.
MEDV) as Input variables. (The Weight variable and Class options are not
supported in this method and are disabled.)
Select Normalize Input data. When this option is selected, the input data is
normalized which means that all data is expressed in terms of standard
deviations. This option is available to ensure that the distance measure is not
dominated by variables with a large scale.
Enter 5 for the Number of Nearest Neighbors. This is the parameter k in the knearest neighbor algorithm. The value of k should be between 1 and the total
number of observations (rows). Typically, this is chosen to be in units of tens.
Select Score on best k between 1 and specified value for the Scoring option.
XLMiner will display the output for the best k between 1 and 5. If Score on
specified value of k as above is selected, the output will be displayed for the
specified value of k.
Select Detailed scoring, Summary report, and Lift charts under both Score
training data and Score validation data to show an assessment of the
performance in predicting the training data.
The options in the Score test data group are enabled only when a test partition is
available.
Please see the Scoring chapter for a complete discussion on the options under
Score New Data.
Click Finish. Worksheets containing the output of the method will be inserted
at the end of the workbook. The Output1 worksheet contains the Output
Navigator which allows easy access to all portions of the output.
Scroll down the Output1 worksheet to the Validation error log (shown below).
As per our specifications XLMiner has calculated the RMS error for all values
of k and denoted the value of k with the smallest RMS Error.
A little further down the page is the Summary Report, shown below. This report
summarizes the prediction error. The first number, the total sum of squared
errors, is the sum of the squared deviations (residuals) between the predicted and
actual values. The second is the square root of the average of the squared
residuals. The third is the average deviation. All these values are calculated for
the best k, i.e. k=2.
Select the Valid. Score Detailed Rep. link in the Output Navigator to display
the Prediction of Validation Data table, shown below. This table displays the
predicted value, the actual value and the difference between them (the
residuals), for each record.
Click the Training Lift Charts and Validation Lift Charts links to display
both charts, respectively. The Lift charts (shown below) are visual aids for
measuring the models performance. They consist of a lift curve and a baseline.
The greater the area between the lift curve and the baseline, the better the model.
After the model is built using the training data set, the model is used to score on
the training data set and the validation data set. Then the data sets are sorted
using the predicted output variable value (or predicted probability of success in
Frontline Solvers V2014
363
the logistic regression case). After sorting, the actual outcome values of the
output variable are cumulated and the lift curve is drawn as the number of cases
versus the cumulated value. The baseline is drawn as the number of cases
versus the average of actual output variable values multiplied by the number of
cases. The decilewise lift curve is drawn as the decile number versus the
cumulative actual output variable value divided by the decile's average output
variable value.
Input variables
Variables listed here will be utilized in the XLMiner output.
Output Variable
Select the variable whose outcome is to be predicted here.
Scoring Option
When Score on best k between 1 and specified value is selected, XLMiner will
display the output for the best k between 1 and the value entered for Number of
nearest neighbors (k). If Score on specified value of k as above is selected, the
output will be displayed for the specified value of k. The default setting is Score
on specified value of k.
Methodology
A Regression tree is built through a process known as binary recursive
partitioning. This is an iterative process that splits the data into partitions or
branches, and then continues splitting each partition into smaller groups as the
method moves up each branch.
Initially, all records in the training set (the pre-classified records that are used to
determine the structure of the tree) are grouped into the same partition. The
algorithm then begins allocating the data into the first two partitions or
branches, using every possible binary split on every field. The algorithm
selects the split that minimizes the sum of the squared deviations from the mean
in the two separate partitions. This splitting rule is then applied to each of the
new branches. This process continues until each node reaches a user-specified
minimum node size and becomes a terminal node. (If the sum of squared
deviations from the mean in a node is zero, then that node is considered a
terminal node even if it has not reached the minimum size.)
factor is specified as zero then pruning is simply finding the tree that performs
best on validation data in terms of total terminal node variance. Larger values of
the cost complexity factor result in smaller trees. Pruning is performed on a last
in first out basis meaning the last grown node is the first to be subject to
elimination.
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
LSTAT
MEDV
A portion of the dataset is shown below. The last variable, CAT. MEDV, is a
discrete classification of the MEDV variable and will not be used in this
example.
First, we partition the data into training and validation sets using the Standard
Data Partition with percentages of 50% of the data randomly allocated to the
Training Set and 30% of the data randomly allocated to the Validation Set and
20% of the data randomly allocated to the Test Set (default settings for Specify
percentages). For more information on partitioning a dataset, see the Data
Mining Partitioning chapter.
Select MEDV as the Output variable, then select the remaining variables
(except CAT.MEDV) as the Input variables. (The Weight variable and the
Classes group are not used in the Regression Tree predictive method.)
Leave Normalize input data option unchecked. Normalizing the data only makes
a difference if linear combinations of the input variables are used for splitting.
Enter 100 for the Maximum #splits for input variables. This is the maximum
number of splits allowed for each input variable.
Enter 25 for the Minimum #records in a terminal node. The tree will continue to
grow until all terminal nodes reach this size.
Select Using Best prune tree for the Scoring option. The option, Maximum
#decision nodes is enabled when Using user specified tree is selected.
Frontline Solvers V2014
372
Select Pruned tree (pruned using validation data) to display the tree pruned
using the validation dataset.
Select Minimum error tree (pruned using validation data) to display the
minimum error tree, pruned using the validation dataset.
Select Detailed report, Summary report, and Lift charts under Score training
data, Score validation data, and Score test data to display each in the output.
See the Scoring chapter for details on scoring to a worksheet or database.
Click Finish. Worksheets containing the output of the method will be inserted
at the end of the workbook. The Output1 worksheet contains the Output
Navigator which allows easy access to all portions of the output.
Click the Valid. Score Detailed Rep. link to navigate to the Prediction of
Validation Data table.
The Prune log (shown below) shows the residual sum of squares (RSS) at each
stage of the tree for both the training and validation data sets. This is the sum of
the squared residuals (difference between predicted and actual). The prune log
shows that the validation RSS continues reducing as the tree continues to split.
The cost complexity is calculated at each step. The Cost Complexity Factor is
the parameter that governs how far back the tree should be pruned. XLMiner
chooses the number of decision nodes for the pruned tree and the minimum error
tree from the values of RSS and the Cost Complexity Factor. In the Prune log
shown below, Validation RSS continues to reduce until the number of Decision
Nodes increases from 0 to 5. At node 6, the RSS starts to increase. The
Minimum Error and Best Pruned Tree display 6 decision nodes each.
Click the Best Pruned Tree link to display the Best Pruned Tree (shown
below).
We can read this tree as follows. LSTAT (% of the population that is lower
status) is chosen as the first splitting variable; if this percentage is > 9.54 (95
cases), then LSTAT is again chosen for splitting. Now, if LSTAT <=14.98 (36
cases) then MEDV is predicted as $20.96. So the first rule is if LSTAT >9.54
and LSTAT<=14.895 then MEDV=$20.96.
If LSTAT <= 9.54%, then we move to RM (Average No. of rooms per dwelling)
as the next divider. If RM >7.141 (12 cases), MEDV for those cases is predicted
to be $40.69 ($40.69 is a terminal node). So the second rule is "If LSTAT <=
9.54 AND RM >7.141, then MEDV = $40.69."
The output also contains summary reports on both the training data and the
validation data. These reports contain the total sum of squared errors, the root
mean square error (RMS error, or the square root of the mean squared error),
and also the average error (which is much smaller, since errors fall roughly into
negative and positive errors and tend to cancel each other out unless squared
first.)
Select the Training Lift Charts and Validation Lift Charts links to navigate to
each. Lift charts are visual aids for measuring model performance. They consist
of a lift curve and a baseline. The larger the area between the lift curve and the
baseline, the better the model.
After the model is built using the training data set, the model is used to score on
the training data set and the validation data set. Then the data sets are sorted
using the predicted output variable value. After sorting, the actual outcome
values of the output variable are cumulated and the lift curve is drawn as the
Frontline Solvers V2014
377
number of cases versus the cumulated value. The baseline is drawn as the
number of cases versus the average of actual output variable values multiplied
by the number of cases. The decilewise lift curve is drawn as the decile number
versus the cumulative actual output variable value divided by the decile's
average output variable value.
For information on the RT_Stored_1 worksheet, please see the Scoring chapter.
Input variables
Variables listed here will be utilized in the XLMiner output.
Weight Variable
The Weight variable is not used in this method.
Output Variable
Select the variable whose outcome is to be predicted here.
Scoring option
Select the tree to be used for scoring. Select Using Full tree (the default) to use
the full grown tree for scoring. Select Using Best prune tree to use the best
pruned tree for scoring. Select Using minimum error tree to use the minimum
error tree for scoring. Select Using user specified tree to use a tree specified by
the user. The option, Maximum #decision nodes in the pruned tree, is enabled
when Using user specified tree is selected.
2.
A input function (g) that sums the weights and maps the results to an output
function(y).
Neurons are organized into layers: input, hidden and output. The input layer is
composed not of full neurons, but simply of the values in a record that are inputs
to the next layer of neurons. The next layer is the hidden layer of which there
could be several. The final layer is the output layer, where there is one node for
each class. A single sweep forward through the network results in the
assignment of a value to each output node. The record is assigned to the class
node with the highest value.
Note that some networks never learn. This could be because the input data do
not contain the specific information from which the desired output is derived.
Networks also will not converge if there is not enough data to enable complete
learning. Ideally, there should be enough data available to create a validation set.
Feedforward, Back-Propagation
The feedforward, back-propagation architecture was developed in the early
1970's by several independent sources (Werbor; Parker; Rumelhart, Hinton and
Williams). This independent co-development was the result of a proliferation of
articles and talks at various conferences which stimulated the entire industry.
Currently, this synergistically developed back-propagation architecture is the
most popular, effective, and easy-to-learn model for complex, multi-layered
networks. Its greatest strength is in non-linear solutions to ill-defined problems.
The typical back-propagation network has an input layer, an output layer, and at
least one hidden layer. Theoretically, there is no limit on the number of hidden
layers but typically there are just one or two. Some studies have shown that the
total number of layers needed to solve problems of any complexity is 5 (one
input layer, three hidden layers and an output layer). Each layer is fully
connected to the succeeding layer.
As noted above, the training process normally uses some variant of the Delta
Rule, which starts with the calculated difference between the actual outputs and
the desired outputs. Using this error, connection weights are increased in
proportion to the error times, which are a scaling factor for global accuracy. This
means that the inputs, the output, and the desired output all must be present at
the same processing element. The most complex part of this algorithm is
determining which input contributed the most to an incorrect output and how to
modify the input to correct the error. (An inactive node would not contribute to
the error and would have no need to change its weights.) To solve this problem,
training inputs are applied to the input layer of the network, and desired outputs
are compared at the output layer. During the learning process, a forward sweep
is made through the network, and the output of each element is computed layer
by layer. The difference between the output of the final layer and the desired
output is back-propagated to the previous layer(s), usually modified by the
derivative of the transfer function. The connection weights are normally
adjusted using the Delta Rule. This process proceeds for the previous layer(s)
until the input layer is reached.
Rule Three: The amount of training data available sets an upper bound for the
number of processing elements in the hidden layer(s). To calculate this upper
bound, use the number of cases in the training data set and divide that number
by the sum of the number of nodes in the input and output layers in the network.
Then divide that result again by a scaling factor between five and ten. Larger
scaling factors are used for relatively less noisy data. If too many artificial
neurons are used the training set will be memorized, not generalized, and the
network will be useless on new data sets.
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
LSTAT
MEDV
First, we partition the data into training and validation sets using the Standard
Data Partition defaults with percentages of 60% of the data randomly allocated
to the Training Set and 40% of the data randomly allocated to the Validation
Set. For more information on partitioning a dataset, see the Data Mining
Partitioning chapter.
Select a cell on the newly created Data_Partition1 worksheet, then click Predict
Neural Network on the XLMiner ribbon. The following dialog appears.
Select MEDV as the Output variable and the remaining variables as Input
Variables (except the CAT.MEDV variable). (The option, Classes in the Output
Variable, is disabled as this feature is not applicable for prediction algorithms.)
The Weight variable and the Success options are not used in this method and are
therefore not enabled.
This dialog contains the options to define the network architecture. Keep
Normalize input data unselected in this example. Normalizing the data
(subtracting the mean and dividing by the standard deviation) is important to
ensure that the distance measure accords equal weight to each variable -without normalization, the variable with the largest scale would dominate the
measure.
Keep the # Hidden Layers at the default value of 1. Click the up arrow to
increase the number of hidden layers. The default setting is 1. XLMiner
supports up to 4 hidden layers.
Enter 25 for the # Nodes for the Hidden layer.
Enter 500 for # Epochs. An epoch is one sweep through all records in the
training set.
Keep the default setting of 0.1 for Step size for gradient descent. This is the
multiplying factor for the error correction during backpropagation; it is roughly
equivalent to the learning rate for the neural network. A low value produces
slow but steady learning, a high value produces rapid but erratic learning.
Values for the step size typically range from 0.1 to 0.9.
Keep the default setting of 0.6 for Weight change momentum. In each new
round of error correction, some memory of the prior correction is retained so
that an outlier does not spoil accumulated learning.
Keep the default setting of 0.01 for Error tolerance. The error in a particular
iteration is backpropagated only if it is greater than the error tolerance. Typically
error tolerance is a small value in the range from 0 to 1.
Keep the default setting of 0 for Weight decay. To prevent over-fitting of the
network on the training data, set a weight decay to penalize the weight in each
iteration. Each calculated weight will be multiplied by 1-decay.
Select Detailed report, Summary report, and Lift charts under Score training
data and Score validation data to show an assessment of the performance of the
tree in predicting the output variable. The output is displayed according to the
users specifications Detailed, Summary, and/or Lift charts.
If a test dataset exists, the options under Score test data will be enabled. Select
Detailed report, Summary report, and Lift charts under Score test data to show
an assessment of the performance of the test dataset in predicting the output
variable. The output is displayed according to the users specifications
Detailed, Summary, and/or Lift charts.
See the Scoring chapter for information on options under Score new data.
Click Finish to initiate the output. Worksheets containing the output of the
model will be inserted at the end of the workbook. The Output Navigator
appears on the NNP_Output1 worksheet. Click any link to easily view the
results.
The Data, Variables, and Parameters Options sections reflect the user inputs.
Click the Valid. Score Detailed Rep. link on the Output Navigator to
navigate to the Prediction of Validation Data table on the NNP_ValidScore1
worksheet. This table displays the Actual and Predicted values for the
validation dataset.
XLMiner also provides intermediate information produced during the last pass
through the network. Scroll down on the Output1 worksheet to the Interlayer
connections' weights table.
Recall that a key element in a neural network is the weights for the connections
between nodes. In this example, we chose to have one hidden layer, and we also
chose to have 25 nodes in that layer. XLMiner's output contains a section that
contains the final values for the weights between the input layer and the hidden
layer, between hidden layers, and between the last hidden layer and the output
layer. This information is useful at viewing the insides of the neural network;
however, it is unlikely to be of utility to the data analyst end-user. Displayed
above are the final connection weights between the input layer and the hidden
layer for our example.
Click the Training Epoch Log link on the Output Navigator to display the
following log.
During an epoch, each training record is fed forward in the network and
classified. The error is calculated and is back propagated for the weights
correction. As a result, weights are continuously adjusted during the epoch. The
Frontline Solvers V2014
393
classification error is computed as the records pass through the network. It does
not report the classification error after the final weight adjustment. Scoring of
the training data is performed using the final weights so the training
classification error may not exactly match with the last epoch error in the Epoch
log.
See the Scoring chapter for information on the Stored Model Sheet,
NNC_Stored_1.
Input variables
Variables listed here will be utilized in the XLMiner output.
Weight Variable
The Weight variable is not used in this method.
Output Variable
Select the variable whose outcome is to be predicted here.
# Hidden Layers
Click the up and down arrows until the desired number of hidden layers appears.
The default setting is 1. XLMiner supports up to 4 hidden layers.
# Nodes
Enter the number of nodes per layer here. The first field is for the first hidden
layer, the second field is for the second hidden layer, etc.
# Epochs
An epoch is one sweep through all records in the training set. The default
setting is 30.
Error tolerance
The error in a particular iteration is backpropagated only if it is greater than the
error tolerance. Typically error tolerance is a small value in the range from 0 to
1. The default setting is 0.01.
Weight decay
To prevent over-fitting of the network on the training data, set a weight decay to
penalize the weight in each iteration. Each calculated weight will be multiplied
by (1-decay). The default setting is 0.
Association Rules
Introduction
The goal of association rule mining is to recognize associations and/or
correlations among large sets of data items. A typical and widely-used example
of association rule mining is the Market Basket Analysis. Most market basket
databases consist of a large number of transaction records where each record
lists all items purchased by a customer during a trip through the check-out line.
Data is easily and accurately collected through the bar-code scanners.
Supermarket managers are interested in determining what foods customers
purchase together, like, for instance, bread and milk, bacon and eggs, wine and
cheese, etc. This information is useful in planning store layouts (placing items
optimally with respect to each other), cross-selling promotions, coupon offers,
etc.
Association rules provide results in the form of "if-then" statements. These rules
are computed from the data and, unlike the if-then rules of logic, are
probabilistic in nature. The if portion of the statement is referred to as the
antecedent and the then portion of the statement is referred to as the
consequent.
In addition to the antecedent (the "if" part) and the consequent (the "then" part),
an association rule contains two numbers that express the degree of uncertainty
about the rule. In association analysis the antecedent and consequent are sets of
items (called itemsets) that are disjoint meaning they do not have any items in
common. The first number is called the support which is simply the number of
transactions that include all items in the antecedent and consequent. (The
support is sometimes expressed as a percentage of the total number of records in
the database.) The second number is known as the confidence which is the ratio
of the number of transactions that include all items in the consequent as well as
the antecedent (namely, the support) to the number of transactions that include
all items in the antecedent. For example, assume a supermarket database has
100,000 point-of-sale transactions, out of which 2,000 include both items A and
B and 800 of these include item C. The association rule "If A and B are
purchased then C is purchased on the same trip" has a support of 800
transactions (alternatively 0.8% = 800/100,000) and a confidence of 40%
(=800/2,000). In other works, support is the probability that a randomly selected
transaction from the database will contain all items in the antecedent and the
consequent. Confidence is the conditional probability that a randomly selected
transaction will include all the items in the consequent given that the transaction
includes all the items in the antecedent.
Lift is one more parameter of interest in the association analysis. Lift is the ratio
of Confidence to Expected Confidence. Expected Confidence, in the example
above, is the "confidence of buying A and B does not enhance the probability of
buying C." or the number of transactions that include the consequent divided by
the total number of transactions. Suppose the total number of transactions for C
is 5,000. Expected Confidence is computed as 5% (5,000/1,000,000) while the
ratio of Lift Confidence to Expected Confidence is 8 (40%/5%). Hence, Lift is a
value that provides information about the increase in probability of the "then"
(consequent) given the "if" (antecedent).
Frontline Solvers V2014
398
A lift ratio larger than 1.0 implies that the relationship between the antecedent
and the consequent is more significant than would be expected if the two sets
were independent. The larger the lift ratio, the more significant the association.
Select a cell in the dataset, say, A2, then click Associate Association Rules to
open the Association Rule dialog, shown below.
Since the data contained in the Associations.xlsx dataset are all 0s and 1s,
select Data in binary matrix format for the Input data format. This option
should be selected if each column in the data represents a distinct item.
XLMiner will treat the data as a matrix of two entities -- zeros and non-zeros.
All non-zeros are treated as 1's. A 0 signifies that the item is absent in that
transaction and a 1 signifies the item is present. Select Data in item list format
when each row of data consists of item codes or names that are present in that
Frontline Solvers V2014
399
transaction. Enter 100 for the Minimum Support (# transactions). This option
specifies the minimum number of transactions in which a particular item-set
must appear to qualify for inclusion in an association rule. Enter 90 for
Minimum confidence. This option specifies the minimum confidence threshold
for rule generation. If A is the set of Antecedents and C the set of Consequents,
then only those A =>C ("Antecedent implies Consequent") rules will qualify, for
which the ratio (support of A U C) / (support of A) at least equals this
percentage.
Rule 2 indicates that if an Italian cookbook and a Youth book are purchased,
then with 100% confidence a second cookbook will also be purchased. Support
(a) indicates that the rule has support of all 118 transactions, meaning that 118
people bought an Italian cookbook and a Youth book. Support (c) indicates the
number of transactions involving the purchase of cookbooks. Support (a U c)
indicates the number of transactions where an Italian cookbook and a Youth
book as well as a second cookbook were purchased.
The Lift Ratio indicates how much more likely a transaction will be found
where an Italian cookbook and a Youth book are purchased, as compared to the
entire population of transactions. In other words, the Lift Ratio is the
confidence divided by support (c), where the latter is expressed as a percentage.
For Rule 2, with a confidence of 100%, support is calculated as 862/2000 * 100
= 43.1. Consequently, the Lift ratio is calculated as 100/43.1 or 2.320186.
Given support at 100% and a lift ratio of 2.320186, this rule can be considered
useful.
Scoring to a Database
Refer to the Discriminant Analysis example in the previous chapter
Discriminant Analysis Classification Method for instructions on advancing to
the Discriminant Analysis Step 3 of 3 dialog. This feature can be used with any
of the Classification or Prediction algorithms and can be found on the last dialog
for each method. In the Discriminant Analysis method, this feature is found on
the Step 3 of 3 dialog.
In the Score new data in group, select Database. The Scoring to Database
dialog opens.
The first step on this dialog is to select the Data source. Once the Data source
is selected, Connect to a database will be enabled.
If your database is a SQL Server database, select SQL Server for Data source
then click Connect to a database, the following dialog will appear. Enter the
appropriate details, then click OK to be connected to the database.
If your data source is an Oracle database, select Oracle as Data source, then
click Connect to a database, the following dialog will appear.
Enter the appropriate details and click OK to connect to the database.
This example illustrates how to score to an MS-Access database. Select MSAccess for the Data source, then click Connect to a Database The following
dialog appears.
Click OK to close the MS-Access database file dialog. The Scoring to Database
dialog re-appears. Select Boston_Housing for Select table/view. The dialog
will be populated with variables from the database, dataset.mdb, under Fields in
table and with variables from the Boston_Housing.xlsx workbook under
Variables in input data.
2.
3.
If Match the first 11 variables in the same sequence is clicked, the first 11
variables in Boston_Housing.xlsx will be matched with the first 11 variables in
the dataset.mdb database.
The first 11 variables in both the database and the dataset are now matched
under Variables in input data. The additional database fields remain under
Fields in table.
Note: The 11 in Match the first 11 variables in the same sequence command
button title will change with the number of input variables.
To manually map variables from the dataset to the database, select a variable
from the database in the Fields in table listbox, then select the variable to be
matched in the dataset in the Variables in input data listbox, then click Match.
For example to match the CRIM variable in the database to the CRIM variable
in the dataset, select CRIM from the dataset.mdb database in the Fields in table
listbox, select CRIM from the Boston_Housing.xlsx dataset in the Variables in
input data listbox, then click Match CRIM < -- > CRIM to match the two
variables.
Notice that CRIM has been removed from the Fields in table listbox and is now
listed next to CRIM in the Variables in input data listbox. Continue with these
steps to match the remaining 10 variables in the Boston_Housing.xlsx dataset.
The Output Field can be selected from the remaining database fields listed under
Fields in table or a new Output Field can be added. Note: An output field
must be a string. To select an output field from the remaining database fields,
select the field to be added as the output field, in this case, nfld, then click > to
the right of the Select output field radio button.
In this example, the field nfld is the only remaining database field which is a
string so it is the only choice for the output field. To choose a different output
field click the < command button to return the nfld field to the Fields in table
listbox.
To add a new field for the output, select Add new field for output radio button,
then type a name for this new field such as Output_Field. XLMiner will
create the new field in the dataset.mdb database
After all the desired variables in the input data have been matched, OK will be
enabled, click to return to the original Step 3 of 3 dialog. Notice that Database
is selected.
Click Next to advance to the Step 2 of 3 dialog. Then click Next on the Step 2
of 3 dialog to accept the defaults.
On the Step 3 of 3 dialog, select Detailed report in the Score new data in group.
In the dialog above, the variables listed under Variables in new data are from
Digits.xlsx and the variables listed under Variables in input data are from
Flying_Fitness.xlsx. Again, variables can be matched in three different ways.
1.
2.
3.
Notice y has been removed from the Variables in new data listbox and added to
the Variables in input data listbox.
To unmatch all matched variables, click Unmatch all. To unmatch only one set
of matched variables, select the matched variables in the Variables in input data
listbox, then select Unmatch.
Click OK to return to the Step 3 of 3 dialog. Notice Detailed report is now
selected in the Score new data in group and Canonical Scores has been enabled
within that same group.
Click Finish to review the output. Click the DA_NewScore1 worksheet to view
the output as shown below. All variables in the input data have been matched
Frontline Solvers V2014
417
with the variables in the new data. Instead of Var2, y is listed, instead of Var3,
x1 is listed, instead of Var4, x2 is listed, etc.
The material saved to the Stored Model Sheet varies with the classification or
prediction method used. Please see the table below for details.
Classification/Prediction Method
Nave Bayes
k-Nearest Neighbors
Neural Networks
Discriminant Analysis
For example, assume the Multiple Linear Regression prediction method has just
finished. The Stored Model Sheet (MLR_Stored_1) will contain the regression
equation. When the Score Test Data utility is invoked, XLMiner will apply this
equation from the Stored Model Sheet to the test data.
Along with values required to generate the output, the Stored Model Sheet also
contains information associated with the input variables that were present in the
training data. The dataset on which the scoring will be performed should
contain at least these original Input variables. XLMiner offers a matching
utility that will match the Input variables in the training set to the variables in
the new dataset so the variable names are not required to be identical in both
data sets (training and test).
Click Next. XLMiner will open the Match variables Step 2 dialog which is
where the matching of the Input variables to the New Data variables will take
place.
XLMiner displays the list of variables on the Stored Model Sheet under
Variables in stored model and the variables in the new data under Variables in
new data.
Variables may be matched using three easy techniques: by name, by sequence
or manually.
If Match variable(s) with same names(s) is clicked, all similar named
variables in the stored model sheet will be matched with similar named variables
in the new dataset. However, since none of the variables in either list are named
similarly, no variables are matched.
If Match variables in stored model in same sequence is clicked, the Variables
in the stored model will be matched with the Variables in the new data in order
that they appear in the two listboxes. For example, the variable CRIM from the
new dataset will be matched with the variable CRIM_Scr from the stored model
sheet, the variable ZN from the new data will be matched with the variable
ZN_Scr from the stored model sheet and so on.
Frontline Solvers V2014
421
Since the stored model sheet only contains 13 variables while the new data
contains 15 variables, two variables will remain in the Variables in new data
listbox. Note: It is essential that the variables in the new data set appear in the
same sequence as the variables in the stored model when using this matching
technique.
To manually map variables from the stored model sheet to the new data set,
select a variable from the new data set in the Variables in new data listbox, then
select the variable to be matched in the stored model sheet in the Variables in
stored model listbox, then click Match. For example to match the CRIM
variable in the new dataset to the CRIM_Scr variable in the stored model sheet,
select CRIM from the Variables in new data listbox, select CRIM_Scr from
the stored model sheet in the Variables in stored model listbox, then click
Match CRIM < -- > CRIM to match the two variables.
Notice that CRIM has been removed from the Variables in new data listbox and
is now listed next to CRIM_SCR in the Variables in stored model listbox.
Continue with these steps to match the remaining 12 variables in the stored
model sheet.
Frontline Solvers V2014
422
Now lets apply these same steps to a stored model sheet created by the
Classification Tree classification method.
Click Score on the XLMiner ribbon. Confirm that Boston_Housing.xlsx is
selected for Workbook and Data is selected for Worksheet under Data to be
scored. Then, under Stored Model, select Scoring.xlsm for Workbook and
CT_Stored_1 for Worksheet.
Click Next to advance to the Step 2 dialog. Click Match variables in stored
model in same sequence.
Click the down arrow to select the desired scoring option, then click OK. (Since
this stored model sheet was created when only the Full Tree Rules option was
selected during the Classification Tree method, this is the only option.) The
worksheet CT_Score1 will be added at the end of the workbook. A portion of
the output is shown below.
Data to be Scored
Workbook: Select the open workbook containing the data to be scored here.
Worksheet: Select the worksheet, from the Workbook selection, containing the
data to be scored here.
Data Range: The dataset range will be prefilled here. If not prefilled, enter the
dataset range here.
First Row Contains Headers: This option is selected by default and indicates to
XLMiner to list variables in the Step 2 dialog by their column headings.
Stored Model
Workbook: Select the open workbook containing the Stored Model Sheet here.
Worksheet: Select the Stored Model worksheet, from the Workbook selection,
here.
XLMiner displays the list of variables on the Stored Model Sheet under
Variables in stored model and the variables in the new data under Variables in
new data.
Variables may be matched using three easy techniques: by name, by sequence,
or manually.
Match by Name
If Match variable(s) with same names(s) is clicked, all similar named
variables in the stored model sheet will be matched with similar named variables
in the new dataset. However, if none of the variables in either list are named
similarly, no variables will be matched.
Match by Sequence
If Match variables in stored model in same sequence is clicked, the Variables
in the stored model will be matched with the Variables in the new data in order
that they appear in the two listboxes. In the dialog above, the variable CRIM
from the new dataset will be matched with the variable CRIM_Scr from the
stored model sheet, the variable ZN from the new data will be matched with the
variable ZN_Scr from the stored model sheet and so on.
Manual Match
To manually map variables from the stored model sheet to the new data set,
select a variable from the new data set in the Variables in new data listbox, then
select the variable to be matched in the stored model sheet in the Variables in
stored model listbox, then click Match. To match the CRIM variable in the new
dataset to the CRIM_Scr variable in the stored model sheet in the dialog above,
select CRIM from the Variables in new data listbox, select CRIM_Scr from
the stored model sheet in the Variables in stored model listbox, then click
Match CRIM < -- > CRIM.