Escolar Documentos
Profissional Documentos
Cultura Documentos
Analysis in R
Lee E. Edlefsen
Chief Scientist
UserR! 2011
1
Introduction Revolution Confidential
Revolution R Enterprise 4
Scalability Revolution Confidential
Revolution R Enterprise 5
Keys to scalability Revolution Confidential
Revolution R Enterprise 7
Huge benefits to huge data Revolution Confidential
Revolution R Enterprise 8
Two huge problems: capacity and speed Revolution Confidential
9
We are currently incapable of analyzing a
lot of the data we have Revolution Confidential
Revolution R Enterprise 10
Requires software solutions Revolution Confidential
Revolution R Enterprise 11
High Performance Analytics: HPA Revolution Confidential
Revolution R Enterprise 12
Revolution’s approach to HPA and
scalability in the RevoScaleR package Revolution Confidential
User interface
Set of R functions (but mostly implemented in
C++)
Keep things as familiar as possible
Capacity and speed
Handle data in large “chunks” – blocks of rows
and columns
Use parallel external memory algorithms
Revolution R Enterprise 13
Some interface design goals Revolution Confidential
Revolution R Enterprise 15
Sample code for logit on a cluster Revolution Confidential
Revolution R Enterprise 16
Sample data transformation code Revolution Confidential
Revolution R Enterprise 18
The basis for a solution for capacity, speed,
distributed and streaming data – PEMA’s Revolution Confidential
20
Parallel algorithms are widely available
Revolution Confidential
21
Not much literature on PEMA’s Revolution Confidential
Revolution R Enterprise 22
Parallelizing external memory algorithms Revolution Confidential
23
Parallel Computations with
Synchronization and Communication
Revolution Confidential
24
Embarrassingly Parallel Computations
Revolution Confidential
25
Sequential External Memory Computations
Revolution Confidential
26
Parallel External Memory Computations
Revolution Confidential
27
Example external memory algorithm for the
mean of a variable Revolution Confidential
28
PEMA’s in RevoScaleR Revolution Confidential
Revolution R Enterprise 29
PEMA’s in RevoScaleR (2) Revolution Confidential
Revolution R Enterprise 35
Use of multiple cores per computer Revolution Confidential
Shared Memory
RevoScaleR
Compute
Data Node
Partition (RevoScaleR) • Portions of the data source are
made available to each compute
Compute node
Data Node
Partition • RevoScaleR on the master node
(RevoScaleR)
Master assigns a task to each compute
node
Node
Compute (RevoScaleR)
Data • Each compute node independently
Node processes its data, and returns it‟s
Partition (RevoScaleR) intermediate results back to the
master node
Compute
Data • master node aggregates all of the
Node intermediate results from each
Partition (RevoScaleR) compute node and produces the
final result
Data management capabilities Revolution Confidential
Revolution R Enterprise 41
Benchmarks using the airline data Revolution Confidential
Revolution R Enterprise 43
Scalability of RevoScaleR with Rows
Regression, 1 million - 1.1 billion rows, 443 betas
Revolution Confidential
(4 core laptop)
Revolution R Enterprise 44
Scalability of RevoScaleR with Nodes
Regression, 1 billion rows, 443 betas
Revolution Confidential
Revolution R Enterprise 45
Scalability of RevoScaleR with Nodes
Revolution Confidential
Logit, 123.5 million rows, 443 betas, 5 iterations
(1 to 5 nodes, 4 cores per node)
Revolution R Enterprise 46
Scalability of RevoScaleR with Nodes
Logit, 1 billion rows, 443 betas, 5 iterations
Revolution Confidential
Revolution R Enterprise 47
Comparative benchmarks Revolution Confidential
Lee Edlefsen
lee@revolutionanalytics.com
Revolution R Enterprise 49