Você está na página 1de 7

Predictive Analytics PLG

Predictive Analytics
Use Case
Predictive analytics encompasses a variety of statistical techniques
and data mining solutions that enable building predictive models and
visually interact with the data to discover hidden insights and
relationships, and thereby provide the basis for making predictions
about future events.
As example, predictive models enable to answer following questions:
Forecasting: How do historical sales, costs, key performance
metrics, and so on, translate to future performance? How do
predicted results compare with goals?
Key Influencers: What are the main influencers of customer
satisfaction, customer churn, employee turnover, and so on, that
impact success?
Trends: What are the trends: Historical and emerging, sudden
step changes, unusual numeric values that impact the business?
Relationships: What are the correlations in the data? What are
the cross-sell and up-sell opportunities?
Anomalies: What anomalies might exist and conversely what
groupings or clusters might exist for specific analysis?

Solution Architecture
Predictive Analytics Process
Many statistical or data mining algorithms consist of two steps, model training and model execution. During model training, statistically
representative data is analyzed to discover hidden insights and relationships. In many cases the training data has to contain the outcome of
prediction as a known fact (supervised learning). At model execution the trained model is applied to new data to calculate the predicted outcome.
The key to make predictive functionality consumable by end users who have no statistical expertise, is to establish a highly integrated end-to-end
business process of predictive analytics. In general this process consists out of three steps.
1. Implementation of a predictive functionality.
2. Model fitting and training.
3. Model execution and consumption of predictive results.
Step 1 is executed once. Step 2 is executed once for each business context that requires an own model. Step 3 is executed real time when
accessing the predictive functionality.

The Advanced Analyst provides the implementation(s) of a predictive functionality to enable the work of the Business Analyst. Therefore, the
implementation usually is not specific to a concrete data set, and it consists of a predictive data model and the predictive process, i.e. statistical
algorithms for model training and model execution. The implementation requires deciding on concrete algorithms, identifying explanatory and
target variables, and defining necessary data transformations for data preparation, such as data enrichment, cleansing, categorizations,
aggregations, etc, by analyzing the available business data.
The Business Analyst provides a trained model for a predictive functionality with respect to certain business context. For instance this could mean
to provide a decision tree predicting the buying probability of product X for customers in the European market. The available data set of, let's say,
German and Spain customers is considered as representative training sample. To maximize predictive quality, the control parameters of the
algorithms and the selection of variables from the predictive data model have to be adjusted. For example, the maximum depth or width of the
tree can be restricted, in order to balance over- and under-fitting. The predictive quality of this adjustment is assessed by validating the model,
which typically means to train the model on a random subset of the training set and to execute the trained model on the complementary set and
measure the error rate. The trained model obtained from the adjustment with the best quality is deployed for use by the end users.
The end user consumes the predictive functionality. For example, calculation of buying probability for customers with respect to a given product
and market definition (business context) allows selecting customer target groups that ensure to execute efficient marketing campaigns. Ideally the
end user is supported with explanatory information on the result of model execution, which could be the visualization of a decision tree model with
the decision path of selected customers highlighted.

Architecture Overview
An application with predictive analytics feature accesses, via OData, a HANA predictive application model that includes the predictive
functionalities. For example this could be a KPI that calculates the product buying probability for customers.
One predictive functionality can be realized by different statistical and data mining algorithms, resulting in multiple predictive process
implementations. The predictive functionality HANA View controls which predictive process shall be used, by checking an input parameter
identifying the model to be executed. Each predictive process implementation contains a procedure for model execution. The statistical and data
mining algorithms used, are PAL functions. The data being processed is provided by the predictive data model, which is based on the application
data models.
Model training could be done on the fly, but for performance consideration it is suggested to store the trained model in the database. Which of the
trained models of a predictive process is used by model execution is controlled also by the input parameter of the predictive functionality HANA
View. The data being processed by model training is provided by the same predictive data model, used for model execution. The management of
the trained models and their metadata is done by a BO with OData and UI on top.

Rules
General Rules

[HPANW-PRED-1]
Predictive functionality shall be enabled to work
out-of-the-box, so that it can be run without the advanced
analyst's work.
Extensibility shall enable the advanced analyst's work.

Predictive Modeling

[HPANW-PRED-2]
Statistical and data mining algorithms shall be processed in
HANA.
Background: In predictive analytics high data volumes are
often processed by complex calculations. Therefore the
according statistical and data mining algorithms are usually
very performance intensive.
Predictive Analysis Library (PAL) shall be used.
Background: HANA provides with PAL an SAP application
function library that offers statistical and data mining
algorithms. The implementations are done in C++, executed in
the database kernel, and can fully leverage the capabilities of
the HANA architecture. Therefore PAL is most optimal for high
performance.
For detailed information please consider the PAL wiki: https://
wiki.wdf.sap.corp/wiki/display/TIPDNA/Predictive+Analysis+Li
brary.
If PAL does not provide the required statistical or data mining
method, own algorithms shall be implemented using
SQLScript or L. Each usage of L requires approval by HANA
DB development team.
For detailed information please consider the L wiki: https://wiki
.wdf.sap.corp/wiki/display/LLVM/LLVM_NewDB_L.
The PAL shall be accessed by repository objects created with
the Application Function Modeler (AFM).
Statistical libraries that are not natively implemented in HANA,
like the International and Statistics Library (IMSL), shall be
avoided. R usage shall be requested as exemption from the
PLG.
Background: From lifecycle perspective those libraries usually
increase TCO. Even if the data exchange between HANA and
those libraries are optimized, there is still a performance
drawback. The customers of course are welcome to use those
statistical libraries.
R-based procedures shall only be used as an optional part of
an SAP product.
Background: The R library provides a huge variety of
algorithms (more than 4000) and is very widespread.
However, R is Open Source and therefore any license
violation has to be strictly avoided.

For detailed information please consider the R wiki: https://wiki


.wdf.sap.corp/wiki/display/ngdb/R-Project.

[HPANW-PRED-3]
Predictive Functionality shall be implemented as HANA view.
Performance issue addressed as central suite requirement for
TIP (ID: 367): When executing a JOIN with having a
calculation view on the right side, that internally calls a
procedure, it is not possible to restrict the data processed by
that procedure through the result set of the left side of the join.
Model training shall be implemented as procedure, and the
result should be stored as trained model for reuse by model
execution.
Background: See [HPANW-PRED-5].
The predictive data model shall be implemented as HANA
view based on the application data models. It provides the
input data for the statistical and data mining algorithms.
Background: The HANA view interface best enables for reuse
by external consumers.
The predictive data model consumed for model execution has
to have exactly the same structure and semantic as used for
model training.
A predictive use case may involve different business contexts,
so that typically several different models have to be trained
and stored for the same predictive use case. Model metadata
has to enable for determination of which model to be
executed.

Model Management
[HPANW-PRED-4]
Model metadata should include algorithm control parameter
configuration, selected attributes of predictive data model,
predictive data model parameter configuration, and selection
criteria of the training data.
Background: This enables model re-training on current data.
Model validation shall be supported to estimate the predictive
quality of a trained model.
Preferred data format to import, export or store a trained
model should be Predictive Model Markup Language (PMML).
Background: This enables exchange with third-party tools.

Performance

[HPANW-PRED-5]

To support real-time prediction, model execution is most


critical. Ideally a native implementation shall be used.
Background: In many cases, model execution is similar to
evaluating a formula, so even in case no proper
implementation is available in PAL, a native implementation on
HANA (with SQLScript or L) will require quite low TCD.
Model training typically is very expensive, but does not require
real-time results for many use cases.
Background: The training result is obtained from complex
calculations on a desirably large set of rich historical data,
and usually is not much influenced by most recent data.
To optimize predictive quality while ensuring optimal prediction
run-time performance, it is an important option to use an
advanced state-of-the-art algorithm for model training, only
available as non-native implementation (e.g. some R library),
and to develop the according algorithm for model execution as
native implementation in HANA.

Further Information
Suite Guideline: https://wiki.wdf.sap.corp/wiki/display/NAnalytics/Predictive+Analytics.