Escolar Documentos
Profissional Documentos
Cultura Documentos
Performance
WORKSTATION
Engineering
Dell, Inc.
1
frequency. For heavy computational workloads,
it makes the most financial sense to spend CPU
money on the relevant components that
undertake that work, like CPU and GPU. Also, When choosing a CPU, first think about how
multi-socket platforms can provide great much time is spent in computational workloads,
performance improvements by reducing the where all available cores will be driven for long
amount of time a task requires to complete, so durations at high utilization. The more time
long as the software processing the work is able spent in these usage types, the more of the
to scale in performance as processor count workstation budget should be spent on
increases. If the application does not scale maximizing core count. Begin by maximizing
across the available processors, either due to core count in a single socket, while considering
architectural or licensing limitation, the budgetary requirements for other components.
additional cost and complexity of the second Then, if additional computational performance
socket may not be justified. is desired, move to a platform with dual CPU
sockets to further increase computational
performance.
2
Figure 4 CPU Impact on Graphics Performance important to compare if the CPU doesnt
support Turbo. Minimum Turbo frequency is
While it is best to maximize core counts for important to compare if the CPU will spend
computational workloads, interactive usage most of its time running computational
models provide the best performance with the workloads. And finally, the maximum Turbo
highest CPU frequency. This is because frequency is important to compare if the CPU
interactivity (as measured by frames per will spend most of its time running interactive
second) is often limited by the efficiency of a or otherwise single-threaded workloads.
single core to feed the GPU with instructions
and data. Most modern graphics programming When weighing whether to maximize CPU core
interfaces today can only feed data and count, or maximum CPU frequency, or some
instructions to the GPU using a single thread, blend between them, one should always seek
despite the GPU driver being multi-threaded. As the latest CPU microarchitecture and
a result, performance benefits with increasing generation. Newer CPU generations typically
core count are negligible beyond four cores. come with either a process shrink (smaller
The more time spent in interactive usage transistors), or a new architecture. Newer
models, the more of the workstation budget architectures often bring greater performance
should be used to increase the maximum CPU at the same frequency, and this extends beyond
frequency. just CPU performance. Because many
applications spend time waiting on a single core
Most Intel CPUs available in Precision to feed the GPU instructions and data, as the
workstations support a featured called turbo. CPUs integer performance increases, so does
Turbo mode is when a CPU adjusts its frequency the graphics performance. What this means is
based on the workload distributed across its that the same frequency CPU on newer
cores. When fewer cores are busy, the CPU runs generation architectures can provide higher
at a higher frequency. The highest Turbo frames per second with the same graphics card!
frequencies are possible when only a single core
is active. The lowest Turbo frequencies are used Graphics
when many, or even all, cores are active. This
dynamic clocking allows interactive workloads In general, when it comes to graphics cards, the
to operate at peak turbo frequencies, while more you spend the more speed you can buy.
computational workloads still operate above Speed in graphics is most commonly associated
the nominal frequency of the CPU. This is with real-time rendering performance, as
important because comparing the nominal measured in frames per second. The higher
frequency of two CPUs (or their rated your frames per second in an application, the
frequency, commonly quoted alongside the more fluid your interactions with the data
model name) isnt always representative of the model, and the more productive you can be.
frequency they will be operating the majority of Computational capabilities aside, finding the
the time. right graphics solution for a workstation
To more precisely compare CPUs, one should depends on the desired frames per second in
compare the Low Frequency Mode (LFM), High the applications of most interest.
Frequency Mode (HFM), minimum Turbo
frequency (all cores loaded) and maximum A good rule of thumb for graphics performance
Turbo frequency (one core loaded). LFM is is to look for a card that is capable of delivering
important to compare if you want more power more than 30 frames per second in the most
efficiency at idle. If the CPU isnt doing any important applications, using data models and
work, how important is it that the CPU rendering modes most like those in your day-to-
consumes as little power as possible? HFM is day use. While the persistence of vision
3
phenomenon suggests that 25 frames per most, however, this is impractical, so its
second is the minimum required to maintain important to use some other standard method
the illusion of smooth animation, more is of measurement as a proxy.
always better. If a particular graphics card is
able to deliver more than 100 frames per SPECviewperf (available at SPEC.org) is an
second in a particular rendering method using a excellent benchmark for comparing different
specific model size and type, it is reasonable to workstation graphics cards because it measures
assume that you can increase the complexity the frames per second of several varied
and/or size of the model and still be able to workloads using rendering methods that mirror
interact with that model without observable that of the many popular workstation
stuttering. applications. With this benchmark, anyone can
view the detailed frames per second
120 Graphics Performance measurements of several different methods of
Frames Per Second
4
populate eight DIMMs of an 8 GB capacity each, fluid dynamics and similar. It should be noted
or four DIMMs of a 16 GB capacity each, choose that the benefits are specific to computational
the option that populates more DIMM slots. workloads. However, the benefits of populating
The increase in available memory bandwidth more DIMM slots for interactive workloads are
will reduce the likelihood that memory difficult to measure, as most graphics workloads
bandwidth is the bottleneck to computational fit into graphics memory and most graphics
workloads, shifting the computational burden cards have dedicated memory on the card.
back to the CPU cores, frequency, and cache.
Memory Bandwidth Comparison
4
Percent of
Choosing the right frequency is also important,
Baseline
and varies depending on the workload. In 2
applications requiring maximum memory 0
Prod. Dev. Prod. Dev. Life Sci. Energy FFTW
bandwidth, populating all available DIMM slots Rodinia WPCcfd Lammps
with the highest-frequency memory is
4 x 8GB 1866 MHz 8 x 2GB 1600 MHz 16 x 8GB 1866
important.
Figure 6 SPECwpc Improvements with
However, some applications require the lowest increased DIMM slot population
latency possible, irrespective of available
bandwidth, and in that case you would want to Lastly, when the integrity of data used in
populate all available DIMM slots with the individual computations is paramount to the
lower-frequency memory. An example of this is end result, Error Checking & Correction (ECC)
in random accesses of memory that is small memory should be used. For example, when
enough to fit in the CPU cache but there is no iterating across a large dataset where the
way for the CPU to predict what memory outputs of computations are continually
location to access next. provided as inputs into another sequence of
computations, one mistake missed in early
While memory bandwidth remains important, computations can have a dramatic impact on
the lower latency of the slower memory speed the final outcome.
can provide benefits to these random reads and
writes. For example, in financial markets such as One unique feature of Dell Precision
high speed trading transactions, the Workstations is the incorporation of Dells
applications that monitor the current prices of exclusive, patented Reliable Memory
investments like stocks and commodities Technology (RMT). RMT uses the Error
require the lowest latency so they react as Correcting Code (ECC) data to identify specific
quickly to market changes as possible. Because locations in the DIMM where errors are
great fortunes can be gained and lost in occurring, if they occur. RMT can identify an
fractions of a second, lower latency of slower error at a particular memory bit location, and
frequency memory can provide advantages, automatically mask that location in memory to
even if overall memory bandwidth is lower. ensure that a healthy location is used for
subsequent reads and writes. The net effect of
Figure 6, below, illustrates this concept using this is that, rather than replacing an entire
SPECwpc computational workloads. A Precision DIMM as ECC errors become overwhelming,
T7610 platform with dual Xeon E5 CPUs RMT masks only the small regions of concern
produces significantly higher scores as the and allows the DIMM to operating normally,
number of DIMMs increase, all else being equal. increasing system availability and reliability, and
What this translates to in the real world is lower extending the usable life of the memory by
times to complete jobs like rendering, finite allowing the user to continue to work using a
element method (FEM) analysis, computational
5
DIMM that would require immediate because it is faster to access than from the
replacement on other systems. rotating media in the drive. As long as the files
in use are relatively small, data is kept on the
Storage flash memory resulting in faster performance.
There are a wide variety of storage Figure 7, below, is an illustration of the typical
performance considerations that depend scaling one might expect across the various
completely on the usage model. For instance, is usage models and drive types. SATA and serial-
the data on the network or stored locally? If so, attached-SCSI (SAS) drives are architecturally
how frequently are updates committed to the very similar, so their scaling is uniform across
network resource? If not, how much capacity is the workloads and depend primarily on disk
required locally? Is redundancy required on the interface type, rotational speed, and on-board
local storage? All of these factors are important memory. Hybrid drive performance can vary
to determining the right storage components greatly depending on the workload. The more
for the workstation, and due to this complexity deterministic and repetitive the workload, the
the subject deserves much greater attention better the hybrid performs. However, hybrids
than given here. are limited by the size of their flash cache, thus
for computational workloads that iterate across
For simplicity, well assume the data is stored a large dataset in a non-deterministic way,
locally on the workstation and not concern hybrids provide less benefit than office
ourselves with network bandwidth, frequency productivity usages. SATA and PCIe SSD provide
of updates, and check-in/check-out procedures. significant benefits over rotational drives, most
A single user of the workstation will have a notably in random reads. SATA SSDs are limited
blend of three common local storage use cases: by their interface type, so for maximum
throughput in interactive and computational
Office Productivity reading and workloads a PCIe SSD provides significant
writing small files with occasional large improvements.
file transfers
Interactive Workstation opening 5 Single-Drive Performance
Relative to Baseline
6
this option will be the best performing short of RAID 10, consider the option to upgrade to a
a multi-drive RAID 0. discrete RAID controller with on-board memory
and moving to RAID 5. Youre likely to see
RAID arrays enable the creation of a large higher capacity and the addition of a discrete
virtual drive that spans one or more physical (or RAID controller may mean higher performance
logical) drives. Depending on the RAID type, in office productivity and interactive
new features such as redundancy (having more workstation usages, not to mention benefits to
than one copy of the data simultaneously) and computational workstation usage types.
greater performance are possible. If
redundancy is more than or equally as Figure 8, below, illustrates the scaling of
important as performance, having a RAID array performance across various usage models with
such as a RAID 1, 10 or 5 would be a better different RAID types. The scaling here assumes
choice. Then, it becomes a decision between a full hardware RAID controller card, such that
available (matching) drives to build the array. parity computation and the full RAID protocol
Moving to a RAID can increase storage costs stack is offloaded from the CPU. All other
considerably, making it prohibitive to include variables of system configuration are kept
high performing drives in the array. One way to constant. Each bar is also a mean across a
mitigate this cost while maintaining high number of different measurements that include
performance is to use an SSD boot drive with reads, writes, random, sequential and variations
the operating system and applications on it, of these. As you can see, the two-drive RAID 0
while building a RAID array out of lower-cost performs well, but offers no redundancy. RAID 1
rotational disk drives to store the larger offers redundancy and the presence of two
datasets. drives does improve read performance, but
there is a slight penalty in write operations
For the computational workstation wherein whenever the writes must be committed to the
significantly large datasets are used, the only disks. RAID 5 sees great benefit with three
option for this type of usage may be a RAID drives, and the hardware RAID controller
array composed of large drives. By combining unburdens the CPU from parity calculation,
the smaller capacity drives into a single large lessening the impact to write performance of
volume, the application can use all of this those operations. Its important to avoid RAID 5
capacity as if it were a single large drive. if using a software RAID controller, because the
Multiple drives in RAID 0 will maximize CPU will be taxed by parity computation and
performance and capacity, but this provides no performance will be much lower in many cases
redundancy. Multiple drives in RAID 1 provide than other RAID types. RAID 10 offers an
redundancy but dont maximize performance or excellent blend of performance and
capacity. RAID 10 increases performance, redundancy, though it requires four drives to
capacity and adds redundancy, but is the most build.
costly in terms of the number of drives
required. 1.5 HW RAID Performance
Relative to Baseline
7
Figure 8 Hardware RAID Performance look to memory and maximize memory
bandwidth at the capacity desired. And finally,
Application Certification look to storage, where a single SSD might
address all the interactive usage model needs,
One of the key differentiators of professional unless capacity or redundancy requires a RAID
workstations, as compared with conventional array, or spending limits dictate a single
PCs, is the platform certifications to run specific rotational disk drive.
professional workstation applications.
Considering the complexity of the software For computational usage models, focus on
environment, one can imagine the incredible maximizing core count, followed by CPU
number of variations that might exist in frequency. For individual CPU models, choose
operating system versions, application versions, the latest architecture and look primarily at the
and hardware, firmware, and driver versions. All lowest Turbo frequency (which reflects the
of these variables can have an effect on lowest frequency the CPU will Turbo up to
application stability and performance. under heavy load). Look for the best core count
Workstation certification addresses this per dollar, and if the workstation will spend
complexity by carefully documenting which more than half of its life in computational work,
configurations of a system were determined to consider upgrading to a dual-socket
be compatible with the specific application workstation. If the application supports GPU
version of interest. The latest certification list is compute, consider upgrading the GPU to
available on Dell.com/Workstations. This models with more compute cores, as the
reduces the risk to the user when considering a performance per dollar in GPU upgrades will
new workstation or an upgrade to an existing often be higher than the CPU (here again based
workstation, as they can be confident before on the percentage increase in core count).
purchasing that the specific workstation has Upgrade memory by populating as many slots
been certified by the software vendor of their as possible first; except for a limited set of
desired application. applications which are highly sensitive to
latency, it is always best to upgrade to the
Conclusion fastest memory speed for the maximum
possible computational throughput. Finally,
consider the storage requirements primarily in
To find the right workstation configuration, first terms of capacity and bandwidth required by
identify the primary and any secondary or the application, which is often much larger and
tertiary usage models. higher than with other usage models.
For interactive usage models, focus on Optimal performance for a particular usage
maximizing CPU frequency, followed by the model can be achieved by identifying the
class of graphics. For individual CPU models, factors that are most important to that usage:
choose the latest architecture and look those having the highest impact on
primarily at the peak Turbo frequency. Of the performance. Combining those selections in a
available CPU models, determine the best workstation certified for us with the key
frequency for the price. Then look to graphics applications of that usage model will ensure
and compare frames per second using industry- that user has the best possible experience at
standard benchmarks such as SPECviewperf, any price.
focusing on the applications and/or rendering
modes that are most important to your usage.
Judge which of the available GPU models offers
the best frames per second for the price. Then