Você está na página 1de 16

WHITEPAPER

BIG DATA VISUALIZATION


WITH DATASHADER
Dr. James Bednar, Open Source Tech Lead & Solutions Architect, Continuum Analytics
August 2016
In This Whitepaper
Data science is about using data to provide insight and evidence that can lead business, government and academic
leaders to make better decisions. However, making sense of the large data sets now becoming ubiquitous is
difficult, and it is crucial to use appropriate tools that will drive smart decisions.

The beginning and end of nearly any problem in data science is a visualization—first, for understanding the shape
and structure of the raw data and, second, for communicating the final results to drive decision making. In either
case, the goal is to expose the essential properties of the data in a way that can be perceived and understood by
the human visual system.

Traditional visualization systems and techniques were designed in an era of data scarcity, but in today’s Big Data
world of an incredible abundance of information, understanding is the key commodity. Older approaches focused
on rendering individual data points faithfully, which was appropriate for the small data sets previously available.
However, when inappropriately applied to large data sets, these techniques suffer from systematic problems
like overplotting, oversaturation, undersaturation, undersampling and underutilized dynamic range, all of which
obscure the true properties of large data sets and lead to incorrect data-driven decisions. Fortunately, Anaconda
is here to help with datashading technology that is designed to solve these problems head-on.

In this paper, you’ll learn why Open Data Science is the foundation to modernizing data analytics, and:

• The complexity of visualizing large • The power of adding interactivity to


amounts of data your visualization

• How datashading helps tame this complexity

Visualization in the Era of Big Data: you begin to experience difficulties. With as few as 500 data points, it
Getting It Right Is Not Always Easy is much more likely that there will be a large cluster of points that
mostly overlap each other, known as ‘overplotting’, and obscure the
Some of the problems related to the abundance of data can be
structure of the data within the cluster. Also, as they grow, data sets
overcome simply by using more or better hardware. For instance,
can quickly approach the points-per-pixel problem, either overall or
larger data sets can be processed in a given amount of time by
in specific dense clusters of data points.
increasing the amount of computer memory, CPU cores or network
bandwidth. But, other problems are much less tractable, such as
The ‘points-per-pixel’ problem is having more
what might be called the ‘points-per-pixel problem’—which is
anything but trivially easy to solve and requires fundamentally data points than is possible to represent
different approaches.
as pixels on a computer monitor.
The ‘points-per-pixel’ problem is having more data points than is
possible to represent as pixels on a computer monitor. If your data Technical ‘solutions’ are frequently proposed to head off these issues,
set has hundreds of millions or billions of data points—easily but too often these are misapplied. One example is downsampling,
imaginable for Big Data—there are far more than can be displayed where the number of data points is algorithmically reduced, but
on a typical high-end 1920x1080 monitor with 2 million pixels, or which can result in missing important aspects of your data. Another
even on a bleeding edge 8K monitor, which can display only 33 approach is to make data points partially transparent, so that they
million pixels. And yet, data scientists must accurately convey, if not add up, rather than overplot. However, setting the amount of
all the data, at least the shape or scope of the Big Data, despite these transparency correctly is difficult, error-prone and leaves
hard limitations. unavoidable tradeoffs between visibility of isolated samples and
overplotting of dense clusters. Neither approach properly addresses
Very small data sets do not have this problem. For a scatterplot with
the key problem in visualization of large data sets: systematically
only ten or a hundred points, it is easy to display all points, and
and objectively displaying large amounts of data in a way that can be
observers can instantly perceive an outlier off to the side of the data’s
presented effectively to the human visual system.
cluster. But as you increase the data set’s size or sampling density,

2 Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER


Let’s take a deeper dive into five major ‘plotting pitfalls’ and how they Even worse, the visualizations themselves can be highly misleading,
are typically addressed, focusing on problems that are minor as shown in C and D in Figure 1, so that even after visualization, it
inconveniences with small data sets but very serious problems with can be difficult to detect overplotting.
larger ones:
1. Overplotting Occlusion of data by other data is called
2. Oversaturation
overplotting or overdrawing, and it occurs
3. Undersampling
4. Undersaturation whenever a data point or curve is plotted on
5. Underutilized range
top of another point or curve.
OVERPLOTTING. Let’s consider plotting some 2D data points that
come from two separate categories, plotted as blue and red in A and OVERSATURATION. You can reduce problems with overplotting
B below in Figure 1. When the two categories are overlaid, the by using transparency or opacity, via the alpha parameter provided
appearance of the result can be very different, depending on which to control opacity in most plotting programs. For example, if alpha is
one is plotted first. 0.1, full color saturation will be achieved only when 10 points overlap,
which reduces the effects of plot ordering but can make it harder to
Plots C and D shown in the overplotting example are the same
see individual points.
distribution of points, yet they give a very different impression of
which category is more common, which can lead to incorrect decisions In the example in Figure 2, C and D look very similar (as they should,
based on this data. Of course, both are equally common in this case. since the distributions are identical), but there are still a few specific
locations with oversaturation, a problem that will occur when more
The cause for this problem is simply occlusion. Occlusion of data by
than 10 points overlap. The oversaturated points are located near the
other data is called overplotting or overdrawing, and it occurs
middle of the plot, but the only way to know whether they are there
whenever a data point or curve is plotted on top of another data
would be to plot both versions and compare, or to examine the pixel
point or curve, obscuring it. Overplotting is a problem not just for
values to see if any have reached full saturation—a necessary, but not
scatterplots, as shown below, but for curve plots, 3D surface plots, 3D
sufficient, condition for oversaturation. Locations where saturation
bar graphs and any other plot type where data can be occluded.
has been reached have problems similar to overplotting, because only
Overplotting is tricky to avoid, because it depends not only on the
the last 10 points plotted will affect the final color, for alpha of 0.1.
number of data points, but on how much they happen to overlap in a
given data set, which is difficult to know before visualization.

Figure 1. Overplotting Figure 2. Using Transparency to Avoid Overplotting

A B A B

C D C D

Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER 3


Even worse, if just one has set the alpha value to approximately or differences in density. For instance, 10, 20 and 2000 single-category
usually avoid oversaturation, as in the previous plot, the correct value points overlapping will all look the same visually, for alpha=0.1.
still depends on the data set. If there are more points overlapping in
In Figure 5, on the next page, let’s first look at another example that
that particular region, a manually adjusted alpha setting that worked
has a sum of two normal distributions slightly offset from one
well for a previous data set will systematically misrepresent the new
another but no longer uses color to separate them into categories.
data set.
As shown in the examples in the previous sections, finding settings to
Oversaturation obscures spatial avoid overplotting and oversaturation is difficult. The ‘small dots’
parameters used in the A and B (size 0.1, full alpha) of the
differences in density.
undersampling vs overplotting example work fairly well for a sample
of 600 points (A), but those parameters lead to serious overplotting
In the example in Figure 3, C and D again look qualitatively different,
issues for larger data sets, obscuring the shape and density of the
yet still represent the same distributions, just with more points. Since
distribution (B). Switching to 10 times smaller dots with alpha 0.1 to
we are assuming that the goal of the visualization is to faithfully
allow overlap (‘tiny dots’) works well for the larger data set D, but not
reveal the underlying distributions, having to tune visualization
at all for the 600 point data set C. Clearly, not all of these settings are
parameters manually based on the properties of the data set itself is a
accurately conveying the underlying distribution, as they all appear
fundamental problem that wastes time and leads to errors in
quite different from one another, but in each case they are plotting
judgment that could be very costly.
samples from the same distribution. Similar problems occur for the
To make it even more complicated, the correct alpha also depends same size data set, but with greater or lesser levels of overlap
on the dot size, because smaller dots have less overlap for the same between points, which varies with every new data set.
data set. With smaller dots, as shown in Figure 4, C and D look more
In any case, as data set size increases, at some point plotting a full
similar, as desired, but the color of the dots is now difficult to see in
scatterplot like any of these will become impractical with current
all cases, because the dots are too transparent for this size.
plotting technology. At this point, people often simply subsample
As you can see in Figure 4, it is very difficult to find settings for the their data set, plotting 10,000 or perhaps 100,000 randomly selected
dot size and alpha parameters that correctly reveal the data, even for data points. But, as Figure 5 panel A shows, the shape of an
relatively small and obvious data sets like these. With larger data sets undersampled distribution can be very difficult or impossible to
with unknown content, it is often impossible to detect that such make out, leading to incorrect conclusions about the distribution.
problems are occurring, leading to false conclusions based on Such problems can occur even when taking very large numbers of
inappropriately visualized data. samples and examining sparsely populated regions of the space,
which will approximate panel A for some plot settings and panel C
UNDERSAMPLING. With a single category instead of the multiple for others. The actual shape of the distribution is only visible if
categories shown previously, oversaturation simply obscures spatial sufficient data points are available in that region and appropriate
plot settings are used, as in D, but ensuring that both conditions are

Figure 3. Oversaturation Due to More Overlapping Points Figure 4. Reducing Oversaturation by Decreasing Dot Size

A B A B

C D C D

4 Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER


Figure 5. Undersampling vs Overplotting Figure 6. Binning into Heatmaps

A B A B C

C D

true is quite a difficult process of trial and error, making it very In principle, the heatmap approach can entirely avoid the first three
likely that important features of the data set will be missed. problems above:

To avoid undersampling large data sets, researchers often use 2D 1. Overplotting, since multiple data points sum arithmetically into
histograms visualized as heatmaps, rather than scatterplots showing the grid cell, without obscuring one another
individual points. A heatmap has a fixed size grid regardless of the
2. Oversaturation, because the minimum and maximum counts
data set size, so that it can make use of all the data. Heatmaps
observed can automatically be mapped to the two ends of a
effectively approximate a probability density function over the
visible color range
specified space, with coarser heatmaps averaging out noise or
irrelevant variations to reveal an underlying distribution, and finer 3. Undersampling, since the resulting plot size is independent of
heatmaps are able to represent more details in the distribution, as the number of data points, allowing it to use an unbounded
long as the distribution is sufficiently and densely sampled. amount of incoming data

Let’s look at some heatmaps in Figure 6 with different numbers of UNDERSATURATION. Heatmaps come with their own plotting
bins for the same two-Gaussians distribution. pitfalls. One rarely appreciated issue common to both heatmaps and
alpha-based scatterplots is undersaturation, where large numbers of
As you can see, a too coarse binning, like grid A, cannot represent
data points can be missed entirely because they are spread over
this distribution faithfully, but with enough bins, like grid C, the
many different heatmap bins or many nearly-transparent scatter
heatmap will approximate a tiny-dot scatterplot like plot D in the
points. To look at this problem, we can construct a data set
Undersampling in Figure 5. For intermediate grid sizes like B, the
combining multiple 2D Gaussians, each at a different location and
heatmap can average out the effects of undersampling. Grid B is
with a different amount of spread (standard deviation):
actually a more faithful representation of the distribution than C,
given that we know this distribution is two offset 2D Gaussians, while
LOCATION (2,2) (2,-2) (-2,-2) (-2,2) (0,0)
C more faithfully represents the sampling—the individual points
drawn from this distribution. Therefore, choosing a good binning
STANDARD DEVIATION 0.01 0.1 0.5 1.0 2.0
grid size for a heatmap does take some expertise and knowledge of
the goals of the visualization, and it is always useful to look at
multiple binning-grid spacings for comparison. Still, the binning Even though this is still a very simple data set, it has properties
parameter is something meaningful at the data level—how coarse a shared with many real world data sets, namely that there are some
view of the data is desired? Rather than just a plotting detail (what areas of the space that will be very densely populated with points,
size and transparency should I use for the points?), which would while others are only sparsely populated. On the next page we’ll look
need to be determined arbitrarily. at some scatterplots for this data in Figure 7.

Which one of the plots in the undersaturation scatterplot figure shows


The shape of an undersampled distribution the ‘real’ overall distribution that we know is there? None of them—at
can be very difficult or impossible to make least not very well. In Figure 7 plot A, the cluster with the widest
spread (standard deviation of 2.0) covers up everything else,
out, leading to incorrect conclusions about completely obscuring the structure of this data set by overplotting.
the distribution. Plots B and C reveal the structure better, but they required hand-

Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER 5


tuning and neither one is particularly satisfactory. In B, there are four However, despite these plots avoiding overplotting, oversaturation,
clearly visible Gaussians, but all but the largest appear to have the undersampling and undersaturation, the actual structure of this data
same density of points per pixel, which we know is not the case from is still not visible. In Figure 9, plot A, the problem is clearly too-
how the data set was constructed, plus the smallest is nearly invisible. coarse binning, but, even B is somewhat too coarsely binned for this
In addition, each of the five Gaussians has the same number of data data, since the ‘very narrow spread’ and ‘narrow spread’ Gaussians
points (10,000), but the second largest looks like it has more than the show up identically, each mapping entirely into a single bin (the two
ones with smaller spreads, and the narrowest one is likely to be black pixels). Plot C does not suffer from too-coarse binning, yet it
overlooked altogether, which is the clearest example of oversaturation still looks more like a plot of the ‘very large spread’ distribution
obscuring important features. Yet, if we try to combat the alone, rather than a plot of these five distributions that have different
oversaturation by using transparency as in Figure 7 plot C, we now spreads, and it is thus still highly misleading, despite the correction
get a clear problem with undersaturation—the ‘very large spread’ for undersaturation.
Gaussian is now essentially invisible. Again, there are just as many
UNDERUTILIZED RANGE. So, what is the problem in Figure 9,
data points in the widest-spread cluster as in each of the others, but
plot C? By construction, we’ve avoided the first four pitfalls:
we would never even know any of those points were there if we were
overplotting, oversaturation, undersampling and undersaturation.
only looking at C.
But the problem is now more subtle—differences in data point
To put it in a real-world context, with plot settings like plot C, a large density are not visible between the five Gaussians, because all, or
rural population spread over a wide region will entirely fail to show nearly all, pixels end up being mapped into either the bottom end of
up on the visualization, compared to a densely populated area, and the visible range (light gray), or the top end (pure black, used only for
will entirely dominate the plot if using the plot settings in A, either the single pixel holding the ‘very narrow spread’ distribution). The
of which would lead to a completely inappropriate decision if rest of the visible colors in this gray colormap are unused, conveying
making a judgment about that real-world data. Similar problems no information to the viewer about the rich structure that we know
occur for a heatmap view of the same data, as shown in Figure 8. this distribution contains. If the data were uniformly distributed
over the range from minimum to maximum counts per pixel (0 to
Here, the narrowly spread distributions lead to single pixels that
10,000 in this case), then the plot would work well, but that’s not the
have a very high count compared to the rest. If all the pixels’ counts
case for this data set and for many real-world data sets.
are linearly ramped into the available color range, from zero to that
high count value, then the wider spread values are obscured, as in B, So, let’s try transforming the data from its default linear
or entirely invisible, as in C. representation, or integer count values, into something that reveals
relative differences in count values by mapping them into visually
To avoid undersaturation, you can add an offset to ensure that low count,
distinct colors. A logarithmic transformation is one common choice
but nonzero, bins are mapped into a visible color, with the remaining
as shown on the next page in Figure 10.
intensity scale used to indicate differences in counts (Figure 9).
Aha! We can now see the full structure of the data set, with all five
Such mapping entirely avoids undersaturation, since all pixels are
Gaussians clearly visible in B and C and the relative spreads also
either clearly zero, in the background color, white in this case, or a
clearly visible in C. However, we still have a problem, though. Unlike
non-background color taken from the colormap. The widest-spread
the solutions to the first four pitfalls, the choice of a logarithmic
Gaussian is now clearly visible in all cases.
transformation to address the fifth problem was arbitrary and

Figure 7. Undersaturation with Scatterplots Figure 9. Avoiding Undersaturation Using an Offset

A B C
A B C

Figure 8. Undersaturation with Heatmaps

A B C

6 Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER


dependent on the specifics of this data set. The logarithmic Figure 11 plot C , the rank-order plotting example, now reveals the
transformation mainly works well because we happened to have full structure that we know was in this data set, i.e. five Gaussians
used an approximately geometric progression of spread sizes when with different spreads, with no arbitrary parameter choices. The
constructing the example. For large data sets with truly unknown differences in counts between pixels are now very clearly visible,
structure, can we have a more principled approach to mapping the across the full and very wide range of counts in the original data.
data set values into a visible range that will work across data sets?
Of course, we’ve lost the actual counts themselves, so we can no
Yes, if we think of the visualization problem in a different way. The longer tell just how many data points are in the ‘very narrow spread’
underlying difficulty in plotting this data set, as for many real-world pixel in this case. So, plot C is accurately conveying the structure, but
data sets, is that the values in each bin are numerically very different, additional information would need to be provided to show the actual
ranging from 10,000 in the bin for the ‘very narrow spread’ counts, by adding a color key mapping from the visible gray values
Gaussian to 0 or 1 for single data points from the ‘very large spread’ into the actual counts and/or by providing hovering value
Gaussian. Given the 256 gray levels available in a normal monitor information. Interactive approaches also work well at this point, with
and the similarly limited human ability to detect differences in gray the initial view showing where to investigate, at which point the
values, numerically mapping the data values into the visible range numerical values can be examined in each area of interest; actually
linearly is clearly not going to work well. But, given that we are showing the full range in a single plot will not work well, but in each
already backing off from a direct numerical mapping in the above local area it can be useful.
approaches for correcting undersaturation and for doing log
At this point, one could also consider explicitly highlighting hotspots
transformations, what if we entirely abandon the numerical
so that they cannot be overlooked. In plots B and C in Figure 11, the
mapping approach, using the numbers only to form an ordering of
two highest density pixels are mapped to the two darkest pixel colors,
the data values and plotting that rather than the magnitudes? Such
and with many monitor settings chosen to make black text ‘look
an approach would be a rank-order plot, preserving relative order
better,’ those values may not be clearly distinguishable from each
while discarding specific magnitudes. For 100 gray values, you can
other or from nearby grey values. Once the data is reliably and
think of it as a percentile based plot, with the lowest 1% of the data
automatically mapped into a good range for display, making explicit
values mapping to the first visible gray value, the next 1% mapping to
adjustments—based on wanting to make hotspots particularly
the next visible gray value, and so on to the top 1% of the data values
clear—can be done in a principled way that does not depend on the
mapping to the highest gray value 255 (black, in this case). The actual
actual data distribution or by just making the top few pixel values
data values would be ignored in such plots, but their relative
into a different color, highlighting the top few percentile ranges of
magnitudes would still determine how they map onto colors on the
the data.
screen, preserving the structure of the distribution, rather than the
numerical values. If we step back a bit, we can see that by starting with plots of specific
data points, we showed how typical visualization techniques will
We can approximate such a rank-order or percentile encoding using
systematically misrepresent the distribution of those points. With
the histogram equalization function from an image processing
Big Data, these problems are incredibly serious for businesses,
package, which makes sure that each gray level is used for about the
because the visualization is often the only way that we can
same number of pixels in the plot as shown in Figure 11.
understand the properties of the data set, leading to potentially costly
missed opportunities and incorrect decisions based on the data.

Figure 10. Dynamic Range with a Logarithmic Transformation Figure 11. Parameter-Free Visualization Using Rank Order Plotting

A B C A B C

Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER 7


Visualizing Big Data Effectively processing is being done and how to change that processing to
highlight specific aspects of the data that are needed for a decision.
Fortunately, there is now an approach to Big Data visualizations that
provides an optimized interaction between the data and the human
visual system, automatically avoiding all of the above plotting
The first step is about making a
pitfalls. The approach, in which raw data is ultimately rendered into decision about what to visualize.
an image, is a three-part operation:
1. Synthesize By contrast, traditional plotting was, at best, a two-step black box
2. Rasterize process, going from raw data to an image of a plot, with at most
3. Transfer some highly indirect control available to the analyst, such as
selecting transparency, dot size and a color scheme. Because those
SYNTHESIZE. The first step is to project or synthesize your data
choices are not directly expressible in terms of the data set itself,
onto a scene. One starts with free-form data, and then needs to make
they can only reveal the true picture of the underlying data after a
decisions as to how best to initially lay out that data on the monitor.
process of manual adjustment that requires significant domain
An example might be a basic graph of price vs. sales for a product. In
expertise and time for parameter adjustment for every plot.
the past, this would be the final step of the visualization process,
leading to any of the serious problems of visualizing Big Data that we Our solution provides several key advantages. Statistical
discussed above. In our approach, however, this is only the first step; transformations of data are now a first-class feature of the
it is about making a decision about what to visualize, which will then visualization—the data is processed according to a fully specified,
be rendered automatically onto the screen in the subsequent steps. rigorous criterion, not subject to human judgment. Algorithmic
processing of intermediate stages in the visualization pipeline is
RASTERIZE. The second step is rasterization, which can be thought
used both to reduce time-consuming manual interventions and the
of as replotting all of the data on a grid, so that any square of that
likelihood of covering up data accidentally. In traditional
grid serves as a finite subsection of the data space; within each square
approaches, these steps are done by trial and error; our approach
of this grid you then count the data points that fall there or do other
automates them and also makes those automation parameters easily
operations like averaging or measuring standard deviation. One
accessible for final tweaking. Rapid iteration of visual styles and
square may contain no data, another square may contain two points,
configurations, as well as interactive selections and filtering,
and others may contain many points. This step results in an
encourages open-minded data exploration, rather than the older
‘aggregate’ view of the data, binned into a fixed-sized data structure.
approaches of having to repeatedly adjust the plot before it will show
TRANSFER. The final step is transfer, which really exploits how the any useful data at all.
human visual system works. In this step, the aggregates are
All of these advantages are open for readjustment in an iterative
transformed into squares of color, producing an image. The
process of tuning one’s models and how best to display the data, in
colormapping will represent the data that lies within that subsection
which the data scientist can control how data is best transformed and
of the grid and ought to be chosen carefully based on what we know
visualized at each step, starting from a first plot that already
about how our brains process colors. This step is easy to grasp
faithfully reveals the overall data set.
intuitively, but to do it well requires introducing some sophisticated
statistical operations that drive the most appropriate transformation
Datashader for Big Data Visualization
of the data. Luckily, these steps can be automated so that they do not
depend on human judgment about unknown data sets. Anaconda provides all of the functionality described above with its
open source and freely available datashader library. The datashader
Despite the automation, it is important to emphasize that the data
library can be used in conjunction with Bokeh, another free, open
scientist should retain fine grained control at each step in these three
source library, to create richly interactive browser based visualizations.
processes. If the plots are to be interpreted, there must be no ‘black
boxes’ for any of the transformation — it should be clear both what

Figure 12. Stages of a Datashader Pipeline


Projection Aggregation Transformation Colormapping Embedding

Data Scene Aggregate(s) Image Plot

8 Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER


Figure 13. Datashader Rendering of the Five-Gaussians Example

The datashader library overcomes all of the pitfalls above, both by In Figure 13, you can see each of the five underlying distributions
automatically calculating appropriate parameters based on the data clearly, which have been manually labeled in the version on the right,
itself and by allowing interactive visualizations of truly large data for clarity.
sets with millions or billions of data points so that their structure can
The stages involved in these computations will be laid out one by
be revealed. The above techniques can be applied ‘by hand’, but
one below, showing both how the steps are automated and how they
datashader lets you do this easily, by providing a high performance
can be customized by the user when desired.
and flexible modular visualization pipeline, making it simple to do
automatic processing, such as auto-ranging and histogram
PROJECTION. Datashader is designed to render data sets projected
equalization, to faithfully reveal the properties of the data.
onto a 2D rectangular grid, eventually generating an image where
The datashader library has been designed to expose the stages each pixel corresponds to one cell in that grid. The projection stage
involved in generating a visualization. These stages can then be includes several steps:
automated, configured, customized or replaced wherever 1. Select which variable you want to have on the x axis and which
appropriate for a data analysis task. The five main stages in a one for the y axis. If those variables are not already columns in
datashader pipeline are an elaboration of the three main stages your dataframe—if you want to do a coordinate
above, after allowing for user control in between processing steps as transformation, you’ll first need to create suitable columns
shown in Figure 12. mapping directly to x and y for use in the next step.
2. Choose a glyph, which determines how an incoming data
Figure 12 illustrates a datashader pipeline with computational steps
point maps onto the chosen rectangular grid. There are three
listed across the top of the diagram, while the data structures, or
glyphs currently provided with the library:
objects, are listed along the bottom. Breaking up the computation
a. A Point glyph that maps the data point into the
into this set of stages is what gives datashader its power, because only
single closest grid cell
the first couple of stages require the full data set while the remaining
b. A Line glyph that maps that point into every grid
stages use a fixed-size data structure regardless of the input data set,
cell falling between this point and the next
making it practical to work with on even extremely large data sets.
c. A Raster glyph that treats each point as a square
To demonstrate, we’ll construct a synthetic data set made of the
in a regular grid covering a continuous space
same five overlapping 2D normal distributions introduced in the
3. Although new glyph types are somewhat difficult to create and
undersaturation example shown previously in Figure 7.
rarely needed, you can design your own if desired, to shade a
point onto a set of bins according to some kernel function or

LOCATION (2,2) (2,-2) (-2,-2) (-2,2) (0,0) some uncertainty value.


4. Decide what size final image you want in pixels, what range of
STANDARD DEVIATION 0.01 0.1 0.5 1.0 2.0 the data to plot in whatever units x and y are stored, and create
a canvas object to hold information.

At this stage, no computation has actually been done—the glyph and


Centered on each location shown are 10,000 randomly chosen
canvas objects are purely declarative objects that record your
points, drawn from a distribution with the indicated standard
preferences, which won’t actually be applied until the next stage.
deviation. Datashader is able to faithfully reveal the overall shape of
Thus, the projection stage is primarily conceptual—how do you want
this 50,000-point distribution, without needing to adjust or tune any
your data to be mapped for aggregation and when it is aggregated?
parameters, in only 15 milliseconds.
The scene object suggested above is not actually constructed in
memory, but conceptually corresponds to what other plotting
packages would render directly to the screen at this stage.

Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER 9


AGGREGATION. Once a conceptual scene object has been specified, For instance, in Figure 15, instead of plotting all the data, we can
it can then be used to guide aggregating the data into a fixed-sized easily find hotspots by plotting only those bins in the 99th percentile
grid. All of the aggregation options currently supported are by count or apply any NumPy ufunc to the bin values, whether or
implemented as incremental reduction operators. Using incremental not it is meaningful.
operations means that we can efficiently process data sets in a single
COLORMAPPING. As you can see in Figure 13-15, the typical way to
pass, which is particularly important for data sets larger than the
visualize an aggregate array is to map each array bin into a color for
memory available. Given an aggregate bin to update, typically
a corresponding pixel in an image. The examples maps a scalar
corresponding to one eventual pixel, and a new data point, the
aggregate bin value into an RGB (color) triple and an alpha (opacity)
reduction operator updates the state of the bin in some way.
value. By default, the colors are chosen from the colormap
Data points are normally processed in batches for efficiency, but it is [‘lightblue’,’darkblue’] (#ADD8E6 to #00008B), with intermediate
simplest to think about the operator as being applied per data point, colors chosen as a linear interpolation independently for the red,
and the mathematical result should be the same. green and blue color channels (AD to 00 for the red channel, in this
case). The alpha (opacity) value is set to 0 for empty bins and 1 for
Figure 14 shows four examples using different aggregation functions.
non-empty bins, allowing the page background to show through
wherever there is no data. You can supply any colormap you like as
TRANSFORMATION. Now that the data has been projected and
shown in Figure 16, including Bokeh palettes, matplotlib colormaps
aggregated into a gridded data structure, it can be processed in any
or a list of colors using the color names from ds.colors, integer
way you like, before converting it to an image, which will be
triples or hexadecimal strings.
described in the following section. At this stage, the data is still
stored as bin data, not pixels, which makes a wide variety of
operations and transformations simple to express.

Figure 14. Visualization of Various Aggregations Using Datashader Figure 15. Single- Line Operations Using xarray/NumPy Functions

A B A B
Count aggregation Any aggregation agg.where(agg>=np. numpy.sin(agg)
percentile(agg,99)

C D
Mean y aggregation Mean val aggregation

10 Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER


EMBEDDING. In Figure 16, the stages all eventually lead to a raster single tiny blue spot in the above plot. Such exploration is crucial for
image, displayed here as PNG images. However, these bare images understanding data sets with rich structure across different scales, as
do not show the data ranges, axis labels and so on, nor do they in most real world data.
support the dynamic zooming and panning necessary to understand
To illustrate the power of visualizing rich structures at a very large scale,
data sets across scales. To add these features, the datashader output
we will take a look at two data rich examples on the following pages.
can be embedded into plots in a variety of plotting programs, such
as an interactive Bokeh plot as illustrated in Figure 17.

On a live server, you can zoom and pan to explore each of the
different regions of this data set. For instance, if you zoom in far
enough on the blue dot, you’ll see that it does indeed include 10,000
points, they are just so close together that they show up as only a

Figure 16. Examples of Colormapping Using Datashader Figure 17. Datashader Embedded in Interactive Bokeh Visualizations

Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER 11


EXAMPLE 1: 2010 CENSUS DATA. The 2010 Census collected a Datashader will then merge all the categories present in each pixel to
variety of demographic information for all of the more than 300 show the average racial/ethnic makeup of that pixel, showing clear
million people in the United States. Here, we’ll focus on the subset of levels of segregation at the national level, again using only the
the data selected by the Cooper Center, who produced a map of the default parameter settings with no custom tuning or adjustment as
population density and the racial/ethnic makeup of the USA shown in Figure 19.
(http://www.coopercenter.org/demographics/Racial-Dot-Map). Each
Here, “segregation” means only that persons of different races or
dot in this map corresponds to a specific person counted in the
ethnicities are grouped differently geographically, which could have
census, located approximately at their residence. To protect privacy,
a very wide variety of underlying historical, social or political causes.
the precise locations have been randomized at the block level, so that
the racial category can only be determined to within a rough Even greater levels of segregation are visible when zooming into any
geographic precision. In this map, we show the results of running major population center, such as those shown in Figure 20.
novel analyses focusing on various aspects of the data, rendered
In the examples, we can see that Chicago and Manhattan’s historic
dynamically as requested, using the datashader library, rather than
Chinatown neighborhoods are clearly visible (colored in red), and
precomputed and pre-rendered, as in the above URL link.
other neighborhoods are very clearly segregated by race/ethnicity.
For instance, we can look at the population density by plotting the Datashader supports interactive zooming all the way in to see
x,y locations of each person, using all the default plotting values, individual data points, so that the amount of segregation can be seen
apart from selecting a more colorful colormap in Figure 18. very clearly at a local level, such as in Chicago’s Chinatown and
nearby neighborhoods.
Patterns relating to geography like mountain ranges, infrastructure
like roads in the Midwest and history such as high population density Here, datashader has been told to automatically increase the size of
along the East coast, are all clearly visible and additional structures each point when zooming in so far that data points become sparse,
are interactively visible when zooming into any local region. making individual points more visible.

For this data set, we can add additional information by colorizing


each pixel by the racial/ethnic category reported on the census data
for that person, using a key of:
• Purple: Hispanic/Latino
• Cyan: Caucasian/White
• Green: African American/Black
• Red: Asian/Pacific Islander
• Yellow: Other including Native American

Figure 18. Visualizing US Population Density with Datashader Figure 20. Race & Ethnicity with Datashader

A B
Zooming in to view race/ Zooming in to view race/
ethnicity data in Chicago ethnicity data in NYC

Figure 19. Visualizing US Population by Race with Datashader

C D
Zooming in to view race/ Zooming in to view race/
ethnicity data in Los Angeles ethnicity data in Chicago

12 Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER


EXAMPLE 2: NYC TAXI DATA SET. For this example, we’ll use By analogy to the US census race data, you can also treat each hour
part of the well-studied NYC taxi trip database, with the locations of of the day as a category and color them separately, revealing
all New York City taxicab pickups and dropoffs from January 2015. additional temporal patterns using the color key of:
The data set contains 12 million pickup and dropoff locations (in • Red: 12 a.m. Midnight
Web Mercator coordinates), with passenger counts and times of day. • Yellow: 4 a.m.
First, let’s look at a scatterplot of the dropoff locations, as would be • Green: 8 a.m.
rendered by subsampling with Bokeh, Figure 21. • Cyan: 12 p.m. Noon
• Blue: 4 p.m.
Here, the location of Manhattan can be seen clearly, as can the
• Purple: 8 p.m.
rectangular Central Park area with few dropoffs, but there are
serious overplotting issues that obscure any more detailed structure. In Figure 23, there are definitely different regions of the city where
pickups happen at specific times of day, with rich structure that can
With the default settings of datashader, apart from the colormap, all
be revealed by zooming in to see local patterns and relate them to
of the data can be shown with no subsampling required, revealing
the underlying geographical map as shown in Figure 24.
much richer structure. In Figure 22, the entire street grid of the New
York City area is now clearly visible, with increasing levels of detail
available by zooming in to particular regions, without needing any
specially tuned or adjusted parameters.

Figure 21. Plotting NYC Taxi Dropoffs with Bokeh Figure 23. NYC Taxi Pickup Times

Figure 22. Plotting NYC Taxi Dropoffs with Datashader Figure 24. Taxi Pickup Times Zoomed with Overlay

Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER 13


OPERATIONS IN VISUALIZATION. Once the data is in
Figure 25. Visualizing Drop-Off Location
datashader, it becomes very simple to perform even quite
sophisticated computations on the visualization, not just on the
original data. For instance, we can easily plot all the locations in
NYC where there are more pickups than dropoffs in shades of red,
and all locations where there are more dropoffs than pickups in
shades of blue on Figure 25.

Plotted in this way, it is clear that pickups are much more likely
along the main arteries—presumably where a taxi can be hailed
successfully, while dropoffs are more likely along side streets.
LaGuardia Airport (circled) also shows clearly segregated pickup
and dropoff areas, with pickups being more widespread,
dropoffs (blue) vs pick-up (red) locations
presumably because those are on a lower level and thus have lower
GPS accuracy due to occlusion of the satellites.

With datashader, building a plot like this is very simple, once the Figure 26. Filtering US Census Data
data has been aggregated. An aggregate is an xarray (see xarray.
pydata.org) data structure and, if we create an aggregate named
drops that contains the dropoff locations and one named picks
that contains the pickup locations, then drops.where(drops>picks)
will be a new aggregate holding all the areas with more dropoffs,
and picks.where(picks>drops) will hold all those with more
pickups. These can then be merged to make the plot above, in one
line of datashader code. Making a plot like this in another plotting
package would essentially require replicating the aggregation step
of datashader, which would require far more code. A
US census data, only including pixels with every race/
Similarly, referring back to the US census data, it only takes one
ethnicity included
line of datashader code to filter the race/ethnicity data to show only
those pixels containing at least one person of every category in
Figure.26, plot A.

The color then indicates the predominant race/ethnicity, but only


for those areas—mainly major metropolitan areas—with all races
and ethnicities included. Another single line of code will select only
those areas where the number of African Americans/Blacks is
larger than the number of Caucasians/Whites as shown in Figure
26, plot B.

Here, the predominantly African American/Black neighborhoods B


of major cities have been selected, along with many rural areas in US census data, only including pixels where African
the Southeast, along with a few largely Hispanic neighborhoods on Americans/Blacks outnumber Caucasians/Whites
the West Coast that nonetheless have more Blacks than Whites.

Alternatively, we can simply highlight the top 1% of the pixels by


population density, in this case by using a color range with 100
shades of gray and then changing the top one to red in Figure 26,
plot C.

Nearly any such query or operation that can be expressed at the


level of pixels (locations) can be expressed similarly simply,
providing a powerful counterpart to queries that are easy to
perform at the raw data level, or to filter by criteria already
C
provided as columns in the data set.
US population density, with the 1% most dense pixels
colored in red

14 Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER


OTHER DATA TYPES. The previous examples focus on scatterplots, Trajectory plots (ordered GPS data coordinates) can similarly use all
but datashader also supports line plots, trajectories and raster plots. the data available even for millions or billions of points, without
downsampling and with no parameter tuning, revealing
Line plots behave similarly to datashader scatter plots, avoiding the
substructure at every level of detail, as in Figure 28.
very serious overplotting and occlusion effects that happen for plots
of multiple overlaid time-series curves, by ensuring that overlapping In Figure 28, using one million points, there is an overall synthetic
lines are combined in a principled way, as shown in Figure 27. random-walk trajectory, but a cyclic ‘wobble’ can be seen when
zooming in partially, and small local noisy values can be seen when
With datashader, time series data with millions or billions of points
zooming in fully. These patterns could be very important, if, for
can be plotted easily, with no downsampling required, allowing
example, summing up total path length, and are easily discoverable
isolated anomalies to be detected easily and making it simple to
interactively with datashader, because the full data set is available,
zoom in to see lower-level substructure.
with no downsampling required.

Figure 27. Multiple Overlapping Time Series Curves Figure 28. Zooming in on the Data

Zoom level 0

Zoom level 1

Zoom level 2

Whitepaper • BIG DATA VISUALIZATION WITH DATASHADER 15


Summary
In this paper, we have shown some of the major challenges in
presenting Big Data visualizations, the failures of traditional
approaches to overcome these challenges and how a new approach
surmounts them. This new approach is a three-step process that:
optimizes the display of the data to fit how the human visual system
works, employs statistical sophistication to ensure that data is
transformed and scaled appropriately, encourages exploration of data
with ease of iteration by providing defaults that reveal the data
automatically and allows full customization that lets data scientists
adjust every step of the process between data and visualization.

We have also introduced the datashader library available with


Anaconda, which supports all of this functionality. Datashader uses
Python code to build visualizations and powers the plotting
capabilities of Anaconda Mosaic, which explores, visualizes and
transforms heterogeneous data and lets you make datashader plots
out-of-the-box, without the need for custom coding.

The serious limitations of traditional approaches to visualizing Big


Data are no longer an issue. The datashader library is now available
to usher in a new era of seeing the truth in your data, to help you
make smart, data-driven decisions.

About Continuum Analytics


Continuum Analytics’ Anaconda is the leading open data science platform powered by Python. We put superpowers into the hands
of people who are changing the world. Anaconda is trusted by leading businesses worldwide and across industries – financial services,
government, health and life sciences, technology, retail & CPG, oil & gas – to solve the world’s most challenging problems. Anaconda
helps data science teams discover, analyze, and collaborate by connecting their curiosity and experience with data. With Anaconda,
teams manage open data science environments and harness the power of the latest open source analytic and technology innovations.
Visit www.continuum.io.

Você também pode gostar