Escolar Documentos
Profissional Documentos
Cultura Documentos
Best practices
Adarsh Pannu
IBM Analytics Platform
DRAFT: This is work in progress. Please send comments to adarshrp@us.ibm.com
Standalone: Bundled with Spark, doesnt play well with other applications, fine for PoCs
Each mode has a similar logical architecture although physical details differ in terms of which/where
processes and threads are launched.
Cache
Task
Driver runs the main() function of the application. This can run outside (client) or
inside the cluster (cluster)
SparkContext is the main entry point for Spark functionality. Represents the
connection to a Spark cluster.
Executor is a JVM that runs tasks and keeps data in memory or disk storage across
them. Each application has its own executors spread across a cluster.
Cached RDD
partitions from yet
another RDD
Shuffle, Transport,
GC, and other system
threads
RDD P1
Task
RDD P2
Task
RDD P3
Task
RDD P2
RDD P1
Internal
Threads
Client 2
Machine 2
Machine 1
Master
Worker
Worker
(Client 1)
(Client 2)
(Client 1)
Executor
Executor
Executor
Per Application
CPU
SPARK_WORKER_CORES
spark.cores.max
Memory
SPARK_WORKER_MEMORY
SPARK_WORKER_CORES
SPARK_WORKER_MEMORY
spark.cores.max
spark.executor.memory
Per Executor
spark.executor.memory
Standalone mode uses as FIFO scheduler. As applications launch, it will try to balance the resource
consumption across the cluster. Strangely, cores are specified per application, yet memory is per
executor!
Machine 0
Resource Manager
Inter-process communication
not shown.
All orange boxes are JVMs
Machine 2
Machine 1
Node Manager
Node Manager
Container
Container
Container
Executor
Spark
Application
Master
Executor
Spark Configuration
Spark has scores of configuration options:
For many options, defaults generally work alright
However, there are some critical knobs that should be carefully tuned
Several settings are cluster manager specific. When running Spark on YARN, you must examine:
Yarn-specific settings: scheduler type and queues
Spark specific settings for YARN: # of executors, per-executor memory and cores, and more
Other general techniques will improve your applications on any cluster manager. For example:
Java object serialization schemes (Kryo vs Java)
Proper partitioning and parallelism levels
On-disk data formats (Parquet vs AVRO vs JSON vs ...)
And many more ... (to be covered elsewhere)
CPU
yarn.nodemanager.resource.memory-mb
--executor-cores OR
spark.executor.cores
Memory
yarn.nodemanager.resource.cpu-vcores
--executor-memory OR
spark.executor.memory
Need to
specify
these
Spark internally adds an overhead to spark.executor.memory to account for off-heap JVM usage:
overhead = MIN(384 MB, 10% of spark.executor.memory)
// As of Spark 1.4
Yarn further adjusts requested container size:
1. Ensures memory is a multiple of yarn.scheduler.minimum-allocation-mb. Unlike its name, this
isnt merely a minimum bound. CAUTION: Setting yarn.scheduler.minimum-allocation-mb too
high can over-allocate memory because of rounding up.
2. Ensures request size is bounded by yarn.scheduler.maximum-allocation-mb
App
Objects
spark.storage.memoryFraction
Default = 0.6 (60%)
Used for cached RDDs, useful
if .cache() or .persist() called.
Cache
Shuffle
spark.shuffle.memoryFraction
Default = 0.2 (20%)
Used for shuffles. Increase this
for shuffle-intensive applications
wherein spills happen often.
Guideline: Stick with defaults, and check execution statistics to tweak settings.
Spark tries to execute tasks on nodes such that there will be minimal data movement (data locality)
! Loss of data locality = suboptimal performance
These tasks are run on executors, which are (usually) launched when a SparkContext is spawned,
and well before Spark knows what data will be touched.
Your application can tell Spark the list of nodes that hold data (preferred locations). Using a simple
API, you can supply this information when instantiating a SparkContext
Prior to Release 1.3, Spark acquired all executors at application startup and held onto them for the
lifetime of an application.
Starting Release 1.3, Spark supports dynamic allocation of executors. This allows applications to
launch executors when more tasks are queued up, and release resources when the application is
idle.
Ideally suited for many interactive applications that might have see user down-time.
Major caveat: Spark may release executors with cached RDDs! Ouch! So if youre application uses
rdd.cache() or rdd.persist() to materialize expensive computations, you may not want to use dynamic
allocation for that application.
On the other hand, you could consider caching expensive computations in HDFS.
Default
Description
spark.dynamicAllocation.enabled
false
spark.dynamicAllocation.minExecutors
spark.dynamicAllocation.maxExecutors
<Infinity>
spark.dynamicAllocation.executorIdleTimeout
600 secs
(10 mins)
spark.dynamicAllocation.schedulerBacklogTim
eout
spark.dynamicAllocation.sustainedSchedulerB
acklogTimeout
5 secs