Escolar Documentos
Profissional Documentos
Cultura Documentos
Outline
Overview & Key Findings
Focus on CRM
Big Data
The Ascendance of R
Challenges in the Use of Analytics
Engagement & Job Satisfaction
Analytic Software
Other Findings
Appendix: Rexer Analytics
2013 Rexer Analytics
Vendors are
included in this
analysis.
6th
68 questions
Academics
18%
Corporate
35%
15%
26%
Consultants
11%
41%
Europe
Germany 8%
UK 5%
France 4%
Poland 3%
North America
USA 37%
Canada 3%
41%
Key Findings
FOCUS ON CRM: In the past few years, there has been an increase among data miners in the already
substantial area of customer-focused analytics. Respondents are looking for a better understanding of
customers and seeking to improve the customer experience. This can be seen in their goals, analyses,
big data endeavors, and in the focus of their text mining.
BIG DATA: Many in the field are talking about the phenomena of Big Data. There are clearly some
areas in which the volume and sources of data have grown. However it is unclear how much Big Data
has impacted the typical data miner. While data miners believe that the size of their datasets have
increased over the past year, data from previous surveys indicate that the size of datasets have been
fairly consistent over time.
THE ASCENDANCE OF R: The proportion of data miners using R is rapidly growing, and since 2010, R
has been the most-used data mining tool. While R is frequently used along with other tools, an
increasing number of data miners also select R as their primary tool.
CHALLENGES IN THE USE OF ANALYTICS: Data miners continue to report challenges at each level
of the analytic process. Companies often are not using analytics to their fullest and have continuing
issues in the areas of deployment and performance measurement.
ENGAGEMENT & JOB SATISFACTION: The Data Miners in our survey are highly engaged with the
analytic community: consuming and producing content, entering competitions and searching for
education and growth within their jobs. All of these activities lead to high job satisfaction, which has been
increasing over time.
ANALYTIC SOFTWARE: Data miners are a diverse group who are looking for different things from their
data mining tools. Ease-of-use and cost are two distinguishing dimensions. Software packages vary in
their strengths and features. STATISTICA, KNIME, SAS JMP and IBM SPSS Modeler all receive high
satisfaction ratings.
Focus on CRM
2011
2013
33%
30%
22%
29%
23%
23%
22%
19%
22%
21%
14%
12%
10%
11%
8%
7%
7%
4%
6%
4%
4%
3%
4%
3%
2%
45%
36%
36%
36%
33%
32%
27%
27%
26%
23%
22%
17%
15%
13%
12%
11%
9%
8%
7%
7%
5%
5%
4%
3%
2%
Question: What were the goals of your analyses in the past year? (select all that apply) (Substantial changes noted in red)
2013 Rexer Analytics
32%
31%
Academic
24%
27%
Financial
Retail
16%
12%
Telecommunications
14%
13%
Insurance
14%
14%
Technology
14%
14%
Medical
13%
12%
Internet-based
13%
11%
Manufacturing
12%
10%
Government
12%
11%
Pharmaceutical
Question: In what fields do you TYPICALLY apply data mining? (Select all that apply)
2013 Rexer Analytics
36%
33%
CRM/Marketing
10%
10%
2013
2011
60%
Text data
43%
32%
24%
24%
24%
22%
Geospatial data
18%
14%
12%
9%
RFID
3%
Audio data
3%
Question: What are the sources of data for your large datasets? (select all that apply)
2013 Rexer Analytics
31%
34%
34%
33%
40%
20%
32%
36%
33%
30%
2010
2011
2013
2013
39%
36%
28%
News articles
25%
27%
23%
22%
16%
27%
10%
15%
15%
11%
25%
25%
25%
22%
20%
18%
14%
14%
38%
60%
2011
33%
38%
21%
Current Text
Miners
Corporate
Plan to Start
Consultants
No Plans to
Start
0%
Academic
NGO / Gov't
34%
34%
41%
38%
45%
35%
23%
32%
23%
40%
26%
29%
10
Big Data
11
18%
26%
26%
Increased
Somewhat
Increased
a Lot
46%
26%
Same
30%
8%
No Big
Data Plan
13%
2009
9% 15%
2007 5% 11%
21%
20%
24%
29%
25%
30%
7%
7%
1,001-10,000 records
100,001-1,000,000 records
More than 100,000,000 records
Question: What size data sets did you typically data mine in the past year?
2013 Rexer Analytics
Pilot Program
13%
10%
Plan to
Implement
32%
32%
Exploring
43%
Text data
32%
24%
24%
24%
22%
18%
Geospatial data
Mobile device data
Email
Image or video data
14%
12%
9%
3%
Audio data 3%
New sources
Volume alone
New types
62%
45%
RFID
Question: What are the sources of data for your large datasets? (select all
that apply)
2013 Rexer Analytics
Number of Respondents
Time
required
Timeand
andeffort
Effort Required
Available
Availablecomputing
Computing power
Power
Distributing Distributing
or parallel
processing
or Parallel
Processing
39
Data storage
Storage
Data
Model
Modelperformance
Performance
31
17
Solutions
87
59
28
24
Pre-processing
and data
checks
Pre-Processing
and Data
Checks
42
Noproblems
Problems
No
Better software,
algorithms,
or orcode
Better Software,
Algorithms,
Code
Sampling, partitioning,
or reducing
dataset
Sampling, Partitioning,
or Reducing
Datasetsize
Size
56
Data
Datamanagement
Management
Number of Respondents
Upgrading Upgrading
or replacing
hardware
or Replacing
Hardware
95
18
14
The Ascendance of R
15
R Usage
80%
70% of data miners
report using R
70%
60%
50%
40%
30%
20%
10%
0%
2007 2008 2009 2010 2011 2012 2013
2013 Rexer Analytics
16
R is primary tool
#2: Dependability of
software
The quality of the user interface was rated as significantly less important by primary R users than by
other data miners.
Interestingly, there was no difference in the stated importance of cost of tool between those using R as
their primary package and others. However, primary R users are more satisfied than other tool users
with the cost of their software (see page 33). They are also more satisfied with the variety of available
algorithms and the ability to modify algorithms to fine-tune analyses.
17
18
Only Corporate
respondents are
included in this
analysis.
As in previous years, data miners report challenges at each level of the analytic process.
Companies often are not using analytics to their fullest and have continuing issues in the
areas of deployment and performance measurement. Only 16% of companies always use
analytics to address appropriate questions and 7% rarely or never do. Additionally, corporate
analytic sophistication is only considered high or very high by 38% of respondents.
Never
1%
Use Analytics
Rarely
6%
Sometimes
Usually
Always
34%
43%
16%$
Question: When there are questions that can be addressed by analytics, how
often does your company / organization use analytics to address them?
Very Low
3%
Sophistication
Low
Moderate
14%
40%
High
24%
Very High
14%$
19
Frequency of Deployment
Overall
5%
28%
Corporate 3%
Consultants
Academic
NGO / Gov't
50%
22%
58%
26%
52%
13%
5%
Never
40%
45%
Rarely
16%
17%
20%
36%
37%
Sometimes
9%
13%
Always
Question: How often are results of your analytics deployed and/or utilized?
2013 Rexer Analytics
20
Time to Deployment
33%
10%
Minutes
22%
17%
Hours
Days
Weeks
32%
24%
15%
Months
3%
4%
Year or
More
Minutes
22%
11%
3%
Hours
Days
Weeks
Months
Year or
More
6%
Not
Deployed
Performance Measurement
Model Updating
Daily or More
Frequently
Never
Rarely
14%
Sometimes
Always
7%
25%
22%
Annually
33%
7%
16%
32%
Most of the
Time
13%
Weekly
Monthly
31%
Quarterly
Question: How frequently are models
typically updated in your organization?
22
23
20%
26%
Read journals
Read analytic newsletters / newsgroups
41%
16%
4%
46%
30%
34%
Presented at conferences
29%
26%
4%
10%
6%
15%
10%
15% 5%
10% 5%
Weekly or More
28%
20%
14%
9% 8%
100%
34%
52%
Attended webinars
80%
31%
20%
Attended conferences
Submitted articles
60%
39%
Read blogs
40%
Monthly
A Few Times
74%
Academic
NGO /
Govt
86%
67%
Once
Question: How often in the past year have you participated in the following activities to stay informed and connect with other data miners?
2013 Rexer Analytics
24
10%
8%
Kaggle
KDD Cup
4%
Conferences
4%
Health
Prize
HeritageHeritage
Health Data
Analysis
20%
30%
40%
23%
4%
18%
16%
3%
CrowdAnalytix 2%
TunedIT
NITRD
Innocentive
13%
13%
10%
11%
12%
Question: Which statement best describes your background and plans regarding data mining/analytic competitions? (Have competed and
plan to again, Have competed but do not plan to again, Never competed but plan to in the future, Never competed and do not plan to)
2013 Rexer Analytics
25
Vendors are
included in this
analysis.
Job Satisfaction
Corporate
5%
15%
Consultants
Consultant 4% 10%
Academics
Academic 3% 15%
NGO
/ Govt
NGO/Govt.
Vendors
Vendor
8%
10%
37%
48%
13%
Very unsatisfied
31%
49%
33%
62%
16%
34%
Unsatisfied
53%
Neutral
Satisfied
Very satisfied
Number of Projects
Corporate
11%
52%
35%
Consultants
13%
51%
35%
NGO / Gov't
23%
Decrease Substantially
Increase Somewhat
46%
Decrease Somewhat
Increase Substantially
27%
No Change
5%
Consultants
NGO / Gov't 5% 9%
Decreased Significantly
Increased Slightly
48%
Question: How will the number of data mining projects your organization
conducts this year compare to what has been typical in the past few years?
38%
39%
16%
42%
37%
17%
47%
Decreased Slightly
Increased Significantly
33%
5%
Same
26
Vendors are
included in this
analysis.
Despite the high satisfaction rates, data miners are able to identify several ways their job
satisfaction can be increased (other than being paid more). The number one way:
greater appreciation by management or clients and greater autonomy while working on
analytic projects. Interesting projects, educational opportunities, and expansion of
analytics are also cited by a number of respondents as ways to enhance job satisfaction.
Number of Respondents
165
63
57
52
47
45
Question: Other than being paid more, what one thing would increase your satisfaction with your job?
2013 Rexer Analytics
27
Analytic Software
28
Tool Selection
A
15%
B
Ability to
write ones
own code is
important
Everything is important
Ease-of-use
& interface
quality are
important
18%
E
35%
C
21%
D
11%
29
Importance of cost
Very high
High
Moderate
Low / Moderate
Very high
Importance of ease-of-use
High
Low / Moderate
Moderate
High
Very high
Importance of user
interface quality
High
Low
High
Very high
Very high
Importance of ability to
write ones own code
Low
Very high
High
Low
High
R (56%)
SAS (10%)
R (26%)
SAS (19%)
STATISTICA (31%)
IBM Modeler (20%)
Rapid Miner (12%)
R (19%)
STATISTICA (16%)
KNIME (10%)
Rapid Miner (10%)
R (62%)
Rapid Miner (50%)
IBM Statistics (40%)
IBM Modeler (36%)
Weka (33%)
R (90%)
Weka (37%)
SAS (33%)
Matlab (31%)
R (73%)
SAS (43%)
IBM Statistics (35%)
Matlab (32%)
SQL Server (32%)
SAS-EM (32%)
R (51%)
IBM Statistics (38%)
STATISTICA (37%)
IBM Modeler (32%)
R (73%)
IBM Statistics (35%)
Rapid Miner (34%)
Weka (32%)
SQL Server (30%)
SAS (30%)
---
---
---
Less Likely
More Likely
---
Many experienced
data miners
Primary tools
Tool use
30
20%
40%
60%
Corporate
80% 0%
20%
40%
60%
Consultants
80% 0%
20%
40%
60%
Academics
80% 0%
20%
40%
60%
NGO / Govt
80% 0%
20%
40%
60%
80%
R
IBM SPSS Statistics
Rapid Miner
SAS
Weka
Matlab
Microsoft SQL Server
IBM SPSS Modeler
SAS Enterprise Miner
KNIME
STATISTICA
Mathematica
Minitab
SAS JMP
IBM Cognos
Oracle Advanced Analytics
C45 / C50 / See5
Orange
SAP
Salford Systems
TIBCO S+ / Spotfire Miner
KXEN
What Data mining / analytic tools did you use in the past
year? (rate each as never, occasionally, or frequently)
2013 Rexer Analytics
31
Tool Satisfaction
Most data miners are happy with their analytic software. STATISTICA and KNIME have particularly high
satisfaction ratings (they also had the highest ratings in the 2011 survey). SAS JMP, IBM SPSS Modeler,
Rapid Miner and R also have high ratings. While people are more satisfied with their primary tools, the
patterns of primary and secondary tool satisfaction are generally similar. However, people choosing IBM
SPSS Statistics as their secondary tool give it high ratings, while people using SAS Enterprise Miner and
IBM SPSS Modeler as their secondary tools give these tools lower ratings.
Most people also report that they will continue using their primary tools the highest continuation rate is
among people choosing KNIME as their primary tool: 85% report that they are extremely likely to
continue using it as their primary tool for the next 3 years. R and STATISTICA users also report especially
high continuation plans. Across all tools, when people say they are likely to switch primary tools, many are
choosing R (see page 16).
Satisfaction with Primary & Secondary Tools
29%
STATISTICA 4%
44%
KNIME 4%
9%
45%
SAS JMP
41%
IBM SPSS Modeler 3% 9%
7%
48%
Rapid Miner
10%
46%
R
12%
IBM SPSS Statistics 4%
Oracle Advanced Analytics 5% 5%
13%
KXEN 4%
25%
Weka
3%
8%
11%
SAS
19%
Matlab
11%
15%
SAS Enterprise Miner
40%
Minitab
9%
23%
Microsoft SQL Server
Extremely Dissatisfied
67%
52%
45%
47%
42%
42%
60%
70%
57%
Dissatisfied
56%
52%
67%
50%
50%
60%
Neutral
Satisfied
25%
20%
26%
19%
26%
13%
24%
10%
9%
Extremely Satisfied
Satisfaction question: Please rate your overall satisfaction with [insert name of previously identified software package].
2013 Rexer Analytics
32
KNIME
Rapid
Miner
SAS
SAS
Enterprise STATISTICA
Miner
4.48
4.62
4.23
4.59
3.74
4.52
4.16
4.51
4.10
4.44
4.10
4.59
4.27
4.58
4.17
4.50
3.77
4.41
Weka
4.28
4.27
4.19
4.19
4.18
4.11
4.11
4.08
4.05
3.96
3.66
3.91
4.02
3.79
3.87
4.10
3.72
3.72
4.15
4.05
4.36
3.96
3.76
3.89
4.67
3.89
4.27
4.30
4.36
4.54
4.27
4.42
4.17
4.58
3.91
4.37
4.39
4.74
4.24
4.24
4.35
4.10
3.59
4.19
4.02
4.25
4.55
4.07
4.07
4.18
4.18
4.39
4.17
4.00
4.20
3.91
4.50
4.28
4.26
3.84
3.77
4.01
4.26
4.03
3.87
4.25
4.21
3.94
4.04
4.14
4.10
4.30
3.59
Cost of software
4.03
3.02
2.89
4.85
4.93
4.86
2.33
2.70
3.91
4.89
4.01
3.26
3.63
3.80
4.35
4.10
3.91
3.94
4.28
4.18
4.00
3.97
3.97
3.95
3.92
3.84
3.83
3.64
4.02
3.42
3.62
3.59
3.65
2.94
4.16
4.47
4.01
4.01
4.18
4.19
3.60
4.07
4.54
4.21
4.00
4.08
3.90
3.90
4.03
3.49
3.87
3.69
3.92
3.27
4.14
4.06
4.37
4.19
3.95
3.83
3.59
4.01
3.78
3.66
3.92
3.93
3.78
4.35
3.09
4.23
4.10
4.00
3.97
3.93
4.30
3.77
4.42
4.53
4.43
4.54
4.26
4.56
4.58
3.77
3.54
3.75
3.70
3.69
3.18
3.38
3.82
3.87
3.82
4.05
3.86
3.54
3.67
3.90
4.23
3.50
Higher Satisfaction
4.16
4.46
3.48
4.03
3.76
3.82
4.03
4.06
3.47
Lower Satisfaction
Question: Rate how satisfied you are with the performance of your primary data mining package (identified earlier) on each of these factors.
2013 Rexer Analytics
33
Other Findings
34
Vendors are
included in this
analysis.
A variety of labels are used to describe analytic professionals. The most common
descriptors chosen by survey respondents are Data Scientist, Researcher, Data Analyst,
and Business Analyst.
Other
Software Developer
3%
4%
Computer Scientist
Engineer
Predictive Modeler
Data Miner
8%
Data Scientist
17%
5%
15%
8%
8%
12%
Statistician
Researcher
9%
Data Analyst
11%
Business Analyst
35
Algorithms
Regression, decision trees, and cluster analysis continue to form a triad of core algorithms for
most data miners. This has been consistent since the first Data Miner Survey in 2007.
The average respondent reports typically using 12 algorithms. People with more years of
experience use more algorithms, and consultants use more algorithms (13) than people
working in other settings (11).
0%
20%
40%
60%
80%
100%
31%
38%
15%
6%
Regression
22%
34%
18%
9%
Decision trees
15%
35%
26%
11%
Cluster analysis
13%
22%
22%
18%
Time series
9%
16%
20%
19%
Text mining
9%
14%
18%
17%
Ensemble models
8%
17%
22%
19%
Factor analysis
8%
15%
23%
19%
Neural nets
8%
13%
16%
16%
Random forests
16%
24%
17%
Association rules 6%
15%
23%
19%
Bayesian 6%
14%
18%
17%
Support vector machines (SVM) 6%
14%
20%
16%
Anomaly detection 6%
15%
15%
Proprietary algorithms 6% 10%
The number of algorithms used varies by the
18%
18%
Rule induction 4% 10%
14%
18%
Social network analysis 4% 10%
labels people use to describe themselves, with
13%
16%
Uplift modeling 4% 10%
Data Miners (14) and Data Scientists (14)
8%
14%
20%
Survival analysis
using the most, and Software Developers (9)
8%
13%
16%
Link analysis
and Programmers (8) the fewest.
7%
14%
19%
Genetic algorithms
15%
MARS 4% 9%
Often
Sometimes
Rarely
Question: What algorithms / analytic methods do you TYPICALLY use? (Select all that apply)
2013 Rexer Analytics
36
Computing Environments
There have been notable increases across the past four years in the use of servers
(local or mainframe) and cloud computing for data mining. Meanwhile processing
locally (on a desktop or laptop) has remained fairly constant.
Computing Environment
80%#
70%#
89%
Windows
60%#
50%#
40%#
37%
Linux
#Local#processing#
#Server#processing#
30%#
Unix
15%
Mac OS
14%
#Cloud#compu>ng#
20%#
10%#
0%#
2010#
2011#
2012#
2013#
38
Senior Staff
Karl Rexer, PhD
Paul Gearan
Heather Allen, PhD
Key Partners
Example Projects
Customer attrition analysis & prediction
IBM (SPSS)
Student retention analysis & prediction
Oracle
Analytic CRM strategy
Bernett Research
Fraud detection
Vlamis Software
Models to predict loan default
Customer segmentation
Sales forecasting
Market basket analysis
Product allocation optimization
CRM metric design & measurement
Predictive models for customer acquisition and cross-sell campaign targeting
Survey research (to understand customer needs & customer decision making)
2013 Rexer Analytics
39
2013 YTD
Pricewaterhouse
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
Hewlett
Hewlett
Packard
Hewlett
Packard
Coverall
Intellidyn
Hewlett
Packard
Coverall
Hewlett
Packard
Analytics
Banks
BBIQ
7 Retail
Banks
Security
Coverall
Quest
Packard Quest
ath Power
Analytics
Analytics
CVS
2002
Forbes
Pharmacy Verizon
ath Power
CVS
ath Power
Consulting
Quest
Bridgewater
Pharmacy Fiserv
Analytics Bridgewater State College Overture
Fiserv
Salford
Networks
State
New
Systems
Performance
College
Fleet
Direct
Performance
Programs
Bank
Plymouth
Programs
DocSite
Objective
2 Retail
Bank
Banks
Management Objective
4 Retail
Management
BBIQ
Banks
8 Retail
DLA Piper
MIT
Epidemiology
McGraw-Hill Group
Construction McGraw-Hill
Construction
Palladium
(9 clients)
Palladium
(5 clients)
Nexus
Direct
Parc
Management
Quest
Analytics
Quest
Analytics
Sage
ath Power
Telecom
ath Power
Forbes
Consulting
Leader
Networks
One Day
University
5 Retail
Banks
Oracle
Redbox
Coverall
Packard
Packard
Raytheon
Raytheon
Palladium
Quest
Hewlett
Hewlett
Leader
Networks
(3 clients)
Accudata
(2 clients)
ITT Flow
Control
Stethographics
ADT
Davol
CR Bard
DLA Piper
Accudata
(2 clients)
ITT Flow
Control
SNCR
Lincoln Peak
SNCR
10 Retail
AboutFace
13 Retail
ADT Security
(2 divisions)
New Balance
MIT
Epidemiology
Group
McGraw-Hill
Networks
(3 clients)
Objective
Management
SNCR
Loan Depot
Shasta
Partners
9 Retail
Banks
Deutsche Bank
Pricewaterhouse
Redbox
Construction
Construction Meredith
Corporation
Palladium
(2 clients)
Palladium
Quest
(4 clients)
Analytics
Quest
ath Power
Analytics
Leader
ath Power
Networks
(2 clients)
Leader
Banks
HBO
McGraw-Hill
Loan Depot
Banks
Oracle
Coopers
Coopers
Oracle
Deutsche Bank
Redbox
HBO
ADT Security
Tyco Integrated
Security
West Corporation
Coverall
MIT
Epidemiology
Group
McGraw-Hill
Construction
Mundial
Quest
Analytics
ZaPOP
ath Power
IDG World Expo
Objective
Management
Palladium
(2 clients)
Leader
Networks
(4 clients)
NSCA
Jet Advisors
SNCR
DomainsBot
Faze1 Solar
6 Retail Banks
Oracle
AS Watson
Redbox
HBO
Tyco Integrated
Security
MIT Epidemiology
Group
McGraw-Hill
Construction
Hult International
Business School
GFR Media
Rezolve
Guidewire
ath Power
IDG World Expo
Faze1 Solar
Jet Advisors
Fourth Millennium
Technologies
DomainsBot
Leader Networks
(2 clients)
Forbes Consulting
Objective
Management
Cogent Consulting
4 Retail Banks
40
41