Você está na página 1de 14

Distributed Outlier Detection using Compressive Sensing

Ying Yan1 , Jiaxing Zhang1 , Bojun Huang1 , Xuzhan Sun2 , Jiaqi Mu3,
Zheng Zhang4, Thomas Moscibroda1
1
Microsoft Research, 2 Peking University, 3 UIUC, 4 NYU Shanghai
{ying.yan, jiaxz, bojhuang, moscitho}@microsoft.com,
1

2
sunxuzhan@pku.edu.cn, 3 jiaqimu2@illinois.edu, 4 zz@nyu.edu

ABSTRACT analysing massive amounts of data has become a key focus,


Computing outliers and related statistical aggregation func- both in industry as well as in the research community. For
tions from large-scale big data sources is a critical operation example, data volume within Microsoft almost doubles every
in many cloud computing scenarios, e.g. service quality as- year and there are thousands of business critical data ana-
surance, fraud detection, or novelty discovery. Such problems lytics jobs running on hundreds of thousands of machines
commonly have to be solved in a distributed environment everyday.
where each node only has a local slice of the entirety of the While each of these jobs is different, there are nevertheless
data. To process a query on the global data, each node a few common characteristics. First, instead of requiring the
must transmit its local slice of data or an aggregated subset raw data records, many applications typically issue queries
thereof to a global aggregator node, which can then compute against aggregated values (e.g. sum...group by...). For exam-
the desired statistical aggregation function. In this context, ple, our investigation in one of our own production clusters
reducing the total communication cost is often critical to the reveals that 90% of 2,000 regular analytics jobs in this clus-
overall efficiency. ter issue queries against such aggregated values. Secondly,
In this paper, we show both theoretically and empirically data analytics jobs over big data are often processed in a
that these communication costs can be significantly reduced shared-nothing distributed system that scales to thousands
for common distributed computing problems if we take ad- of machines. In other words, the entirety of the data is dis-
vantage of the fact that production-level big data usually tributed across a large set of nodes each of which contains a
exhibits a form of sparse structure. Specifically, we devise a slice of the data, and each node needs to transmit necessary
new aggregation paradigm for outlier detection and related information about its data to some aggregator node which
queries. The paradigm leverages compressive sensing for data then computes the desired function.
sketching in combination with outlier detection techniques. In this paper, we focus on a particularly important prob-
We further propose an algorithm that works even for non- lem that falls within the above categorization: the distributed
sparse data that concentrates around an unknown value. In outlier problem. We explain the problem using a concrete
both cases, we show that the communication cost is reduced scenario that arises in Microsoft Bing web search service
to the logarithm of the global data size. We incorporate quality analysis engine. Here, to measure the users satisfac-
our approach into Hadoop and evaluate it on real web-scale tion of search query results, each search query is assigned
production data (distributed click-data logs). Our approach either a positive (Success Click) or negative (Quick-Back
reduces data shuffling IO by up to 99%, and end-to-end job Click) score, depending on the users reaction to the shown
duration by up to 40% on many actual production queries. search results. These scores are logged in search event logs
at data centers across the globe, and stored in thousands of
remote and geo-distributed machines. In order to determine
1. INTRODUCTION the quality of a particular search query result, the task is
Large-scale data-intensive computation has become an in- now to aggregate these scores from all the various distributed
dispensable part of industrial cloud computing, and building search event logs. In a typical implementation of this ag-
appropriate system support and algorithms for efficiently gregation (Figure 1), each node first locally (and partially)
aggregates the scores (i.e., the values) grouping them by
This work was done when the authors were doing their
internships at Microsoft Research. market and vertical segments (i.e., the keys). Then, all the

This work was done when the author was working at Mi- nodes transmit their key-value pairs to a central node for
crosoft Research. final aggregation. Figure 1(a) shows a typical global aggre-
gated result from the production logs. It can be seen that the
Permission to make digital or hard copies of all or part of this work for personal or
result exhibits a sparse-like structure: while a majority of
classroom use is granted without fee provided that copies are not made or distributed the values are concentrated around the mode b (1800 in the
for profit or commercial advantage and that copies bear this notice and the full cita- example) and there is a small fraction of outliers (marked
tion on the first page. Copyrights for components of this work owned by others than with circles) whose values diverge greatly from b. Data from
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- different individual data centers may not be sparse when
publish, to post on servers or to redistribute to lists, requires prior specific permission considered independently. For obvious reasons, finding such
and/or a fee. Request permissions from permissions@acm.org.
SIGMOD15, May 31June 4, 2015, Melbourne, Victoria, Australia.
outliers across different markets and vertical segments is of
Copyright c 2015 ACM 978-1-4503-2758-9/15/05 ...$15.00.
http://dx.doi.org/10.1145/2723372.2747641.

3
Data Center 1 Data Center 2 Data Center 3

Key k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Sum group by Key


Top K Absolute Outlier K
Key Value Aggregator value top K mode=1800
Market, Vertical Score
1800

Key k k k k k k k k k k k k k k k
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(a) Motivating Example (b) Global Top v.s. Outlier (=3)

Figure 1: Distributed outlier detection in web search service quality analysis

critical importance to the analysis of our companys web fore, in order to get a global result of reasonable accuracy,
search service quality. all data (or at least a large fraction of the data) would have
There are three challenges that make it difficult to find an to be transmitted to the global aggregator node. This entails
efficient solution to the above problem. 1) We have found heavy communication across machines, accompanied with
that the local outliers and mode are often very different expensive disk I/O and computation (sort, partitioning and
from the global ones. Without transmitting all the data, the so on). In many practical scenarios, the aggregation commu-
true global outliers can only be approximated, and finding nication cost dominates system performance. In one study on
a high quality result is not guaranteed. As shown in Figure SCOPE clusters in Microsoft [42], for example, data-shuffling
1(a), the key k5 in the remote data centers appear normal. I/O caused by aggregation incurs 58.6% of cross-rack traffic.
However, after aggregation, it is an obvious outlier compared It is therefore not surprising that new techniques to reduce
to other keys. 2) As new data arrivesrecall that terabyte aggregation communication have been researched (different
of new click log data is generated every 10 mins, the global data compression, partial aggregation techniques, optimizing
outliers and mode will naturally change over time. Any the user-defined operators, etc [23]).
solution that cannot support incremental updates is therefore In this paper, we take a fundamentally different approach.
fundamentally unsuited. 3) The number of distributed data The key insight is that practical aggregated big data fre-
centers considered in the aggregation may also change. For quently exhibits sparse structure. The above search quality
example, when a new data center joins the network, or the application is a typical example: most of the values concen-
number of data centers considered in the analysis changes, trate around zero or some constant bias, yet users are mostly
the global outliers and mode may also change accordingly. interested in the analytics determined by the values that di-
We seek a solution with the ability of incremental addition verge from the majority of mass, such as querying for outliers,
and removal of data centers involved in the aggregation. mode and top-k. Intuitively, such queries have the potential
Formally, the above problem can be captured as a k-outlier for reducing the required communication for aggregation by
detection problem, which seeks to find the k values furthest exploiting the underlying datas sparse structure. However,
away from the mode b.1 In the context of a large-scale capitalizing on this potential is challenging, because different
production environment where data is located in many geo- local data slices may not be equally sparse. In Figure 1,
graphically distributed nodes, each node i holds a data slice for example, data x1 , x2 and x3 do not have modes, while
that is an N -dimensional vector xi RN (ordered according after aggregation,the global x shows a clear mode. Thus,
to a global key dictionary). In our concrete application, each whether we can exploit sparsity for our problem depends on
entry in the vector xi corresponds to the log entry of one the existence of an efficient lossy compression algorithm that
specific search query result stored at that particular node keeps high accuracy on the high-divergence values while being
i. The global data is the aggregation of these slice Pvectors insensitive to the distribution difference between local slices
by summing up the values of each entry, i.e., x = i xi . In and the global data. Fortunately, we find that compressing
our application, the global data is therefore the vector where sensing is a suitable choice for such a compression algorithm,
we sum up the log-entries for each query stored in all the due to its linearity in measurement (compression) as well as
different nodes. Outlier detection seeks the top-k outliers its focus and greediness on the significant components in
from this global data vector x. recovery (decompression).
The commonly-used baseline implementation for the dis- Compressive Sensing (CS) is an efficient technique for sam-
tributed outlier problem is to first aggregate the slices in pling high-dimensional data that has an underlying sparse
MapReduce-style. The problem with this approach is that structure. Since its emergence, CS has found several key
the outliers and mode of the slices in different nodes are applications, for example in the domain of photography [16],
different from each other, and also different from the global image reconstruction and facial recognition [39], or network
outliers and mode after aggregation (see Figure 1). There- traffic monitoring [43, 7]. In this paper, we show that apply-
ing CS can also yield significant benefits (both in theory and
1
Note the difference between the k-outlier problem and the practice) when applied to a classic distributed computing
top-k problem: the keys with most exceptional values are problem. While we focus here on the specific distributed out-
not necessarily the top-k (or absolute top-k)keys (cf. Fig-
ure 1(b)). lier problem at hand, our techniques may also be extended to

4
solve similar aggregation queries (mean, top-k, percentile,...) Definition 1 (s-sparse). A vector x = [x1 , . . . , xn ]T
N
in large-scale distributed systems. R is called s-sparse if it has s nonzero items.
In this paper, we make the following contributions:
Typically, aggregated big data in the real world is not sparse
We devise a new communication-efficient aggregation in the strict sense of the above definition. However, it is
paradigm for the distributed outlier problem. The so- sparse-like if we consider a more general data property on
lution leverages compressive sensing techniques and x. We study the Outlier Problem under this more general
exploits the underlying sparse structure of the data. property. In its most basic version, we assume the existence
Technically, the algorithm compresses the data by a of a dominating majority of the values of x. Formally, we
random projection at each local node. This compressed define a majority-dominated data as follows.
sketch is transmitted to the aggregator node, which
Definition 2 (majority-dominated). A vector x =
then uses a compressive sensing-based recovery algo-
[x1 , . . . , xn ]T RN is majority-dominated if there exists
rithm to determine the mode and the outliers simulta-
b R such that the set O = {i : xi 6= b} has |O| > N2 .
neously (see Figure 2).
Given a majority-dominated data x, we call the items in
We propose Biased OMP (BOMP), a new recovery O the outliers. Note that the value of the majority b, if
algorithm to recover both the outliers and the mode it exists, is unique by definition. The k-outlier problem is
which can be any unknown value that the values of defined as finding a set Ok O with min(k, |O|) elements
the aggregate data are concentrated around. BOMP that are furthest away from b. Formally, for any i Ok and
thus supports a much wider set of scenarios compared j/ Ok , we have |xi b| |xj b|.
to existing compressive sensing-based recovery algo- s-sparse data and majority-dominated data are simple
rithms [34, 37], which can only be applied to sparse concepts of sparse data. In practice, data often exhibits
data concentrating at zero (sparse at zero). weaker notions of sparse structure. In Figure 1, for exam-
ple, values concentrate around a mode b, but they are not
We show theoretically that our algorithm can find out-
necessarily equal to the exact b. Due to the imprecision of
liers with the aggregation communication reduced to
concentration, the outliers and the mode do not have unique
the logarithm of the global data size. Specifically, if
definitions on such data.
the data of size N satisfies an s-sparsity condition, our
As discussed before, in many real-world applications the
algorithm can recover the outliers with O(sc log N )
number of keys N can be very large, so the goal is to find
communication cost, where c is a small constant.
distributed algorithms in which the total data volume trans-
We implement our algorithm in Hadoop. We show that mitted by all nodes is sub-linear (logarithmic, if possible)
for the web search quality application described above, in N . Also, due to the possibly large latency of data trans-
our technique significantly outperforms existing algo- missions, we focus on single-round algorithms: every remote
rithms for outlier detection, reducing data-shuffling IO node transmits information derived from its local data to
by up to 99%. In our concrete production environment, the aggregator in parallel, the aggregator receives all this
this IO reduction results in a reduction of end-to-end information, and then locally computes the global results.
job duration by up to 40%.
2.2 Compressive Sensing and Recovery Algo-
The remaining of the paper is organized as follows. Section
rithms
2 introduces preliminaries of outlier detection and compressive- The basic ideas and mathematical foundations of compres-
sensing techniques. Section 3 formulates the problem and sive sensing were established in [8] [9] [15]. It is an efficient
presents our algorithm for compressive-sensing based dis- way to sample and compress data that has a sparse represen-
tributed outlier detection. Section 4 theoretically analyses tation. Compressive sensing has two phases: i) sampling the
the performance of BOMP. Section 5 introduces how to imple- data to a measurement and ii) recovering the data from the
ment our solution in Hadoop and discusses problems related sampled measurement. The sampling is done by linearly pro-
to system implementation. Experimental results are reported jecting a data vector x = [x1 , x2 , ..., xn ]T with a measurement
in Section 6. Section 7 discusses related work and Section 8 matrix = [1 , 2 , ..., n ] into y = x. Due to the linearity
concludes the paper. in this sampling, compressive sensing seems uniquely suited
for data that is distributed across different locations or data
that is frequently updated, scenarios in which traditional
2. PRELIMINARIES compression techniques often face challenges and may not
be applicable. In practice, is randomly generated, which
2.1 Distributed Outlier Detection was shown to be near-optimal [9] [15].
We first give a formal definition of distributed outlier For the recovery algorithms, the ones most widely used are
detection. Suppose a data vector x = [x1 , x2 , ..., xN ]T RN Basis Pursuit (BP) [13] and (Orthogonal) Matching Pursuit
is distributed across L nodes such that each node l holds (OMP) [30, 34, 37], which try to decompose the signal to a
T N
P slice of the data xl = [xl1 , xl2 , ..., xlN ] R
a partial sparse representation via an over-complete dictionary. BP
and xl = x. Now suppose an aggregator node wants to recovers the data x by solving the optimization problem
compute some statistics on x. In order to do so each of the minx kxk1 subject to y = x, which is transformed into a
L nodes can send information about its local data slice to linear programming problem. OMP is a greedy algorithm.
the global aggregator node. Starting from a residual r = y and an empty matrix ,
One important data property that the aggregate data x OMP performs a sequence of iterations. In iteration k, OMP
may satisfy is sparsity: selects the column vectors that has the largest inner

5
product with the current residual r. OMP appends this are arranged by their key in a globally fixed order forming
column to and records its position in by Ik . y is a vector xl = [x1 , x2 , ..., xN ]T . If a key does not exist on a
projected to the subspace spanned by to get the represen- node, we still keep an entry with value equals to 0. Thus, all
tation z = [z1 , z2 , ..., zk ]T . Then the residual is updated by keys have the same fixed position in all local vectors xl ; i.e.,
r = y z. When OMP is finished after some iterations, looking up the key dictionary with the position of a value,
it gets the recovered x from z by setting xIk = zk and the we can find its key. Distributed outlier detection tries to seek
the outliers on the aggregated data x = L
P
other positions in x to zero. l=1 xl .
Compared to BP, OMP has several advantages when used Local Compression. By a consensus, each node ran-
for the distributed k-outlier detection problem. First, OMP domly generates the same M N measurement matrix 0
is simple for implementation and faster than BP. Second, in with column 1 , 2 , ..., N . Locally, each node l then mea-
the outlier problem we are inherently interested are those sures its data xl by yl = 0 xl , resulting in a vector of length
significant components instead of the entirety of the data. M . The fraction M N
is the compression ratio, and we call yl
OMP naturally recovers the significant components during a local measurement.
the first several iterations, after which we can stop the execu- Measurement Transmission. Each node now transmits
tion of OMP without significant loss in accuracy. Exploiting its local measurement yl to the aggregator. Assuming the
this feature of OMP is critically important when applying system consists of L nodes, the communication cost is thus
compressive sensing to big data problems in general, and to reduced from O(N L) to O(M L) compared to the trivial
the outlier problem in particular, because the computational algorithm in which all data is transmitted to the aggregator.2
burden of recovery would pose an unsurmountable obstacle Global Measurement. Once the aggregator receives all
otherwise. local measurements P yl , it aggregates them into a global
measurement y = L l=1 yl :

3. ALGORITHM L L L
In this section, we introduce our CS-based algorithm for
X X X
y= yl = 0 xl = 0 xl = 0 x. (1)
the distributed k-outlier problem. We first show how we l=1 l=1 l=1
compress the distributed data, and then present the recovery
Notice that y exactly corresponds to the measurements on
algorithm that finds the outliers from the compressed data.
the aggregated data x = L
P
l=1 x l .
3.1 Distributed Compression 3.2 BOMP Algorithm
Each node contains up to N key-value pairs and we want to
In principle, we can now apply any recovery algorithm used
get the k outliers of the globally aggregated values grouped
in the compressive sensing literature (e.g., OMP) to the global
by the keys. Our solution consists of three phases. As
measurement y and use the recovered data to determine the
shown in Figure 2, the first phase is data compression. We
outliers. However, when the values in x concentrate on an
vectorize the key-value pairs into vectors and apply a random
unknown non-zero bias b, all existing compressive sensing
projection locally at each node to get the local compression
recovery algorithms are not applicable to this non-sparse
(or measurement). Then, the compressed data is transmitted
data. Therefore, to deal with this more general scenario, we
to the aggregator who then directly computes a compressed
propose a novel recovery algorithm that uses the traditional
aggregate from the different compressed data. Finally, the
OMP algorithm as a subroutine. We call the new algorithm
outliers and mode are recovered using the BOMP algorithm.
Biased OMP (BOMP), because it can recover any set of
We now describe each of these steps in detail. Also, please
values that concentrate around an unknown non-zero bias.
refer to Figure 3 for an example of how to apply the procedure
The BOMP algorithm is based on the following intuition.
in the context of our web search quality application.
A data vector x whose values concentrate around b can be
decomposed into vectors x = b + z, where z is near-sparse.
Node n1 Node n2 Node nl By this decomposition, we reduce the recovery problem on
a non-sparse x into an equivalent problem of recovering a
Partial Partial
aggregation X2
Partial
aggregation Xl
near-sparse vector z with a scalar b. Following this new
aggregation X1
x11
Global key
x21
Global key
xl1 Global key representation of x, measurement y = 0 x can be written
Dictionary Dictionary Dictionary
as
x1N x2N xlN

Local Compression N
X
y = 0 (b + z) = i b + 0 z.
Measurement i
Global Key
Dictionary Transmission
Carefully arranging the terms, we get another matrix-
Index Key
1 Outlier & Mode vector multiplication form
Aggregator
Detection
N N  
1 X Nb
Detect outlier & mode using BOMP ( y) y = [ i , 0 ] = x. (2)
N i=1 z
Figure 2: CS-Based Solution for Distributed Outlier
and Mode Detection The form in (2) can
be considered as a measurement on an ex-
tended vector z = [ N b, z]T with an extended measurement
Vectorization. Given a key space of size N , we can 2
In practice, additional compression techniques can be ap-
build a global key dictionary: The values on each node l plied on the data measurement for further data reduction.

6
matrix = [ 1N N
P
i=1 i , 0 ]. In this way, we essentially
Aggregation
=

To be recovered
give the vector ywhich is originally a measurement on xa =

=
new explanation: it is also a measurement on z. Attractively, Given

z is sparse and has a one-to-one mapping with x. Thus, we


= =

can use any existing compressive sensing recovery algorithm,
particularly OMP, to recover x, and finally assemble the vec- Compression
tor x from z. Overall, BOMP reduces the recovery problem Recovery - BOMP algorithm
on x to a recovery problem on z that is near-sparse by sepa- Mode

rating the bias b from x into an extra component to extend =


the vector that is to be recovered. This new recovery problem Goal



is solvable by the standard OMP algorithm. Notice that this Using OMP to
Assemble recover
new recovery problem has a different measurement matrix

(the left term in (2)) and data vector dimension (N + 1) from
the original ones.
Figure 3: Example of CS-based approach
Algorithm 1: BOMP
Input: Measurement matrix 0 = [1 , 2 , ..., N ], proved to have the restricted isometry property (RIP) [5],
measurement y = 0 x, iteration number R which is a guarantee for the recovery accuracy. However, in
Output: Outlier set O, mode b, recovered x BOMP, the first column in the extended measurement matrix
1: Extend the measurement matrix [0 , 0 ], where is the sum of all the other columns. This weak dependence
1 PN
0 = i=1 i
becomes negligible as the number of columns grows, but it
N introduces substantial challenges to the theoretical analysis
2: z = [z0 , z1 , ..., zN ]T OMP recovery on y = z with R of BOMP.
iterations. In this analysis, we apply BOMP to a biased s-sparse
3: x [z1 + zN 0
, z2 + zN
0
, ..., zN + zN ] , b zN
0 T 0
, vector, which has s outlier components differing from an
O {i|xi 6= b} unknown bias, with all the other components equal to the
4: return O, b, x bias. We prove that BOMP shows good recovery performance
in Theorem 1. It requires M = Asa log (N/) measurements
to recover an N -dimensional s-sparse vector with probability
Our BOMP recovery algorithm for the compressive sensing- 1 .
based outlier solution is illustrated in Algorithm 1 whose
detailed workings are illustrated in Figure 3. Besides recov-
ering the non-sparse data vector x, BOMP also estimates its Theorem 1. Suppose that x RN is a biased s-sparse
mode b and returns the outlier set O. Given the measure- vector. Choose M = Asa log (N/) where A and a are abso-
ment y = 0 x on data vector x = [x1 , x2 , ..., xN ]T , where lute constants. Draw a random M N measurement matrix
0 = [1 , 2 , ..., N ] is a M N measurement, BOMP first 0 whose entries are i.i.d N (0, 1/M ). BOMP can success-
extends 0 to = [0 , 1 , ..., N ] by adding an extra col- fully recover x from measurement y = 0 x with probability
umn exceeding 1 . The theorem holds under the Near-Isometric
Transformation and Near-Independent Inner Product conjec-
tures described below.
N
1 X
0 = i . (3)
N i=1 In the following, we give a detailed proof sketch of Theo-
rem 1. Intuitively, OMP is a sequence of column selections
Then, BOMP applies a standard OMP algorithm with (see Algorithm 2). In each iteration, the column of the mea-
R iterations to recover z = [z0 , z1 , ..., zN ]T from y = z. surement matrix with the largest inner product with the
Finally, we recover x through current residual r is selected. Then, the measurement vec-
tor y is projected to the subspace spanned by the selected
z0 z0 z0
x = [z1 + , z2 + , ..., zN + ]T . (4) column vector set s , and a new residual is calculated by
N N N y proj(y, s ), where proj(y, s ) is the projected result.

BOMP takes b = z0 / N as the mode of x. Since the size
of z is upper bounded by R, it follows (4) that the recovered Algorithm 2: Column Selection in OMP [37]
x has at most R 1 components apart from b; we take these
to be the outlier set O. Finally, the algorithm selects the k Input: Matrix = [0 , 1 , ..., N ], measurement vector y
elements from O which are furthest away from b. These are Output: selected column set s
the detected k outliers. 1: Initialize s as an empty set
2: r y
3: while krk2 > 0 do
4. THEORETICAL ANALYSIS 4: arg max |h, ri|
We theoretically analyze BOMP to determine its appli- 5: s s {}
cability and efficiency to the problem at hand. The key 6: r y proj(y, s )
fundamental difference between BOMP and the traditional 7: end while
OMP is that in OMP, all random entries in the measurement 8: return s
matrix are sampled independently. This matrix has been

7
Without loss of generality, assume the s outliers occur 4.2 Inner Product between Random Vectors
in the first s dimensions of x. Accordingly, partition the We now discuss the second major component of the proof
extended measurement matrix into [ , ], where which relates the inner product between random vectors. Let
includes the column vectors corresponding to the bias and u = k0.5r
T rk . Then
2
the s outliers. It is easy to prove that, if the first s + 1

iterations in Algorithm 2 always select the columns in ,


BOMP will stop after s + 1 steps and x is exactly recovered. (r) = 2|h, ui|.
In each iteration of Algorithm 2, statement 4 selects a new This is an inner product between two random vectors
column by comparing the inner product. Thus, to ensure and u. All entries in are independent N (0, 1/M ); by
BOMPs successful execution, it is sufficient that for any r Conjecture 1, kuk2 1 with probability P1 = 1 ecM .
in the subspace span( ) that is spanned by columns in , If and u are independent, then it follows from standard
there is techniques (see [37]) that
max |h, ri|
(r) = <1
max |h, ri| P {|h, ui| } 1 e
2
M/2
.
in all the s + 1 steps. In a single step In our case, is weakly dependent with u. When u = k00k2 ,

max |h, ri|
 the dependence is the strongest. In that case,
the covariance
P{(r) < 1} P < 1 matrix only has a non-zero diagonal 1/ N . When N is
kT rk2 /( s + 1)
sufficiently large, this dependence is so small that its influence
N s
on the inner product becomes negligible. We formalize this

|h, ri| 1
P < . (5) insight as follows.
kT rk2 s+1
Due to the independence property, in (5) can be any Conjecture 2 (Near-Independent Inner Product).
column in . Thus, our goal is to evaluate the value of Suppose x and y are two M -dimensional vectors. Both of
them have i.i.d N (0, 1/M ). x and y are statistically depen-
|h, ri| dent with each other with covariance E(xyT ) = I. Nor-
(r) = .
kT rk2 malize y as y = y/kyk2 . If || is sufficiently small, then
2a
P {|hx, y i| } 1 e M/2 where a is an absolute con-
To achieve a high success probability, we would like (r)
stant.
to be as small as possible. That is kT
rk2 is much larger
than |hr, i|. Again, we conducted extensive experimental evaluations
to verify Conjecture 2. When setting a = 1.1, we never
4.1 Near-Isometric Transformation observed any counter-examples; in fact, the condition was
In kT T
rk2 , the vector r is linearly transformed by . If satisfied by a wide margin in all cases.
all the entries in are independent, a simple proof in [5, 37]
shows that the norm of r is preserved with high probability, 4.3 Success Probability
Using the two previous conjectures, we now have the in-
P{kT
rk2 0.5krk2 } 1 e
cM
. (6) gredients to conclude the proof. Specifically, it holds that
Now, the first column 0 in has a very weak dependence    
with the other s columns. Depicting this dependence, it can 1 1
P (r) < = P |h, ui| <
be seen that the covariance matrix between0 and any other s+1 2 s+1
column only has a non-zero diagonal 1/ N . When N is ( 1 )2a M/2
 
1 e 2 s+1 P1
much larger than s, this dependence becomes smaller and
smaller, and ultimately negligible. So, we believe that the
 bM
  
1 e sa 1 ecM ,
near-isometric transformation property in (6) holds in spite
of the weak-dependence. We formulate these intuitions in where a and c are the constant values defined in Conjectures 1
the following conjecture. and 2 respectively; b is a constant derived from a. Plugging
this probability into (5), we get
Conjecture 1 (Near-Isometric Transformation).  bM
N s  N s
Given a M (s + 1) randomly generated matrix whose P{(r) < 1} 1 e sa 1 ecM
entries are all N (0, 1/M ). The first column vector is weakly  bM
  
dependent with any other column with covariance matrix I, 1 (N s)e sa 1 (N s)ecM
where || is sufficiently small. The other s column vectors are BM
independent with each other. For any r span( ), there 1 N e sa ,
is P{kT rk2 0.5krk2 } 1 e
cM
where c is an absolute where B is a new constant. Applying the union bound on
constant. all the s + 1 steps, we can upper bound the final success
probability
We conducted extensive numerical experiments to verify
the validity of Conjecture 1. We find that when s = 2, P {Esucc } 1 (s + 1)(1 P{(r) < 1})
M > s and gets its largest value 1/ s: c is around 0.4. BM
1 (s + 1)N e sa
When M and s are larger than 10 which is the case in real
applications, we observe kT
rk2 0.5krk2 always holds by Let P {Esucc } 1 . The final conclusion is that taking
a large margin. measurement size M = Asa log (N/), the recovery failure

8
probability of BOMP recovery failure probability is , where stops decreasing which seems to be very effective and accurate
A and a are constants. in practice.
An important optimization to speed up the recovery pro-
5. HADOOP IMPLEMENTATION cess is to select a proper iteration count R. The choice of R
is affected by the value of k and data sparsity. In practice, as
We implement our solution on Hadoop, version 2.4.0. In
in Algorithm 4, we denote R as f (k) since through parameter
this implementation, the mappers are responsible for collect-
tuning, we find that R [2k, 5k] is good enough for both
ing and aggregating the data for each key in the global key
recovery accuracy and efficiency.
list with length N , and then compress the partial results
Finally, note that the recovery algorithm can be accelerated
into a vector with length M using the CS-Mapper method
with 30X 40X speed-up [2, 20] using GPU which is our
as introduced in Algorithm 3. The reducer receives the M -
on-going work.
length vectors and then recovers the koutlier and mode
using the CS-Reducer method. The recovered results are
output to HDFS. As illustrated in Algorithm 4, the recovery 6. PERFORMANCE EVALUATION
process calls the BOMP algorithm. In the implementation, We begin by evaluating the performance of the CS-based
we optimized the matrix computation in the recovery using distributed k-outlier detection algorithm BOMP. To have a
QR factorization [6] with Gram-Schmidt process which is thorough performance evaluation, we first conduct experi-
implemented with Intel Math Kernel Library(MKL) in C ments on the result quality and effectiveness of our solution
and is pre-built in shared objects. We use Java Native In- in Section 6.1. Then, we implement a compressive sensing
terface(JNI) to call the functions from Hadoop code. Our based Hadoop system to evaluate the efficiency of our tech-
solution can greatly save the communication cost, which nique in Section 6.2. We use both synthetic and real-world
consists of the mapper output IO and shuffling in the re- production workloads in the experiments.
ducer. Although the recovery process may bring additional
overhead, the end-to-end latency is still greatly improved. 6.1 Effectiveness of BOMP
As the input file size becomes bigger, the saving of end to The effectiveness metrics we focus on are communication
end time is more significant: This is because each mapper cost and recovery quality. Given the true k-outliers OT which
needs to process a larger number of file chunks, which in turn is a set of key and value pairs <OT .Key , OT .V alue >, |OT | = k
implies that our method yields biggest savings in terms of and a k-outliers estimation OE , |OE | = k, the error of the
communication costs, thus further reduces the waiting time estimation is measured by the errors on keys and values:
of reducers. Thus, for larger input files, reducers need to wait
for a longer time period to get all the outputs from the map- |O . O . |
1. Error on Key (EK ): EK = 1 T KeyK E Key ,
pers. Our compressive sensing-based solution speeds up the
EK [0, 1]. It is the error on precision of the key set.
mapping time which correspondingly reduces the reducers
waiting time. As such, the total latency is greatly reduced.
2. Error on Value (EV ): Both OT and OE are ordered
according to the value. EV = kOT .VkO alue OE .V alue k2
T .V alue k2
,
Algorithm 3: CS-Mapper EV [0, 1]. It is the relative L2 error on the ordered
Input: Raw data, M , KeyList and seed value set.
Output: Compressed data vector y with length M
N KeyList.length A good approximate k-outlier solution should have small
Generate random M N measure matrix 0 with seed EK and EV .
Load raw data and do partial aggregation for each key in Our solution is independent of how the keys are distributed
the global KeyList, vector x aggregation values ordered over the different nodes. Given a fixed number of nodes, the
by KeyList communication cost is only determined by the size of the
y 0 x measurement matrix M . The evaluations in this section
return y; focus on the recovery process using the BOMP algorithm
with different measurement matrix size M .
To evaluate the sensitivity of our solution to different data
distributions and parameters, we use a synthetic data set in
Algorithm 4: CS-Reducer Section 6.1.1. In Section 6.1.2, we then show the effectiveness
Input: Compressed vector y, KeyList, K and seed of our algorithm in terms of communication cost on a real
Output: Recovered s, outlier set O, mode b search quality analysis production workload, and compare
N KeyList.length, M y.length with alternative approaches.
Generate random M N measure matrix 0 with seed
return BOMP(0 , y, f (k)) 6.1.1 Effectiveness of BOMP on Synthetic Data
We use two synthetic data sets as the aggregated scores
in this group of experiments. The first one is a majority-
One practical problem we encountered arises due to floating dominated data set containing N observations with a mode
point precision when using the Gram-Schmidt process for QR b. There are N s observations having the value of b. The
factorization. The floating point error becomes non-negligible remaining s observations are not equal to b. b is set to be
compared to the values in the remaining uncompressed vector 5000. We change the sparsity of the data through varying
as the recovery iteration increases, which may significantly the parameter s. The second one is a Power-Law distribu-
influence the recovery result. To reduce this effect, our tion with skewness parameter . Since it is distributed as
solution is to terminate the recovery process once the residual a continuous heavy-tailed distribution, there is no pair of

9
Probability of Exact Recovery
100%
We can see that even for Power-Law distributed data, we
can still correctly detect k-outliers with smaller M .
80%

6.1.2 Effectiveness of CS-Based Solution on Real Pro-


60%
duction Data
40% In this group of experiments, we evaluate the perfor-
BOMP S=50 mance of our solution with real production workload used
BOMP S=100
20% BOMP S=200 for analysing Bing web search service quality as introduced
OMP+known mode S=50
OMP+known mode S=100
OMP+known mode S=200
in Section 1. We use three production workloads with the
0%
100 200 300 400 500 600 700 800 900 1000 following template:
M (Size of Measure Matrix)
(a) Probability of Exact Recovery SELECT Outlier K SUM(Score),G1...Gm
30000
FROM Log Streams PARAMS(StartDate,EndDate)
s=50
s=100 WHERE Predicates
25000 s=200
GROUP BY G1...Gm;
Value of Mode (Bias)

20000

15000 There are more than 600 such aggregation jobs running in pro-
10000 duction environment everyday, which consume large amounts
s=50 s=100 s=200
5000 of communication and computation resources. We choose
0 three representative queries with different types of scores in
-5000
the experiments: advertise click score, core-search click score,
-10000
and answer click score. The GROUP-BY attributes including
-15000
QueryDate, Market, Vertical, RequestURL, DataCentre and
20 40 60 80 100 120 140 160 180 200 220 240 260 280 300
so on. The one-week log stream is 65TB total in size, and
Number of Iteration
(b) Value of Mode(Bias) in Each Iteration is merged from 8 geographically distributed data centers.
The log contains the information from 49 Markets and 62
Figure 4: BOMP on majority dominated data Verticals. After being filtered by the predicates, the total
number of keys N that the three queries are interested in
are 10.4K, 9K and 10K, respectively.
The performance metric we evaluate is communication cost
observations with the same value. The mode can be consid- computed from Nt St , where Nt is the number of tuples
ered as the peak of its density function. We show that our transmitted and St is the number of bytes we use to represent
algorithm can still show good performance over Power-Law a tuple. In our solution, the tuple on each node is represented
distributions. The length of the data N to set to 1K in the as a 1-dimension data vector with length N . Given L nodes,
experiments. M measurements of SM bytes each, the total communication
We being by examining the probability of exact recovery cost is L M SM . In our experiment, SM is 64 bits.
on majority-dominated data. In an exact recovery, EK = To the best of our knowledge, there is no existing dis-
EV = 0 and the number of outliers k equals s. For each tributed technique explicitly targeted for distributed outlier
measurement matrix of size M , we test 1000 times each detection. In practice, all data is usually transmitted to
time with a randomly generated measurement matrix, from the aggregator node. We consider this basic approach as
which the recovery percentage can be estimated. The number one of our baselines. Using this vectorized approach, the
of recovery iterations is min{M, s + 1}. We also test the communication cost is L N Sv , Sv is 64 bits. Alternately, if
standard OMP algorithm under the assumption that the the number of non-zero keys ni is small on each node, we can
mode b is known in advance. We can see in Figure 4(a), reduce the communication cost by shipping key-value pairs.
As such, the communication cost is li=1 ni St , where St is
P
BOMP has a similar performance as OMP even though
BOMP does not need knowledge of b. However, if the mode the size of keyid-value pair which is 96 bits. Our experience
is not known in advance (which is the more general case), working with the production data has shown that when trans-
OMP needs to transmit additional 2s + 1 data to get the mitting all data, the communication cost of the vectorized
mode. When s = 200, BOMP can save up to 40% of the approach is much smaller than shipping keyid-value pairs.
communication cost. We call this baseline ALL.
Then, for each s, we choose M to be the size of 100% exact As a second baseline, we also implement an approximate
recovery and log the value of bias b (which is the mode) in approach called k + based on [10]s three rounds framework.
each iteration in the recovery. The first 300 iterations of In the first round, each node samples g keys (g keys ids
BOMP are plotted in Figure 4(b). We can see that when are the same across different nodes) and sends them to the
the iteration number equals the sparsity s + 1, b starts to aggregator. Then, the aggregator computes an average value
stabilize which is of great interest, because it confirms in b from the aggregation values of g groups. In the second
practice the theoretical findings in Theorem 1. round, the aggregator transmits the value b to each remote
In the second experiment, we examine the recovery quality node. In the last round, each remote node transmits back
over Power-Law distributed data with = 0.9 and 0.95, the k + g outliers considering b as the mode. Finally,
respectively and N = 10K. The errors EK and EV with the aggregator outputs the k-outliers and mode b based
K = 5, 10 and 20 are illustrated in Figures 5 and 6. For on the aggregation results of the data received. As such,
each M , we run 100 times compression and recovery with the communication cost by shipping keyid-value pairs is
randomly generated measurement matrices. As such, we get L (K + ) St , where St is the size of keyid-value pair which
the MAX, MIN and AVG EK and EV errors in the 100 runs. is 96 bits. Clearly, the communication cost of this approach

10
100% 100% 100%
alpha = 0.9 Max alpha = 0.9 Max alpha = 0.9 Max
90% alpha = 0.9 Min 90% alpha = 0.9 Min 90% alpha = 0.9 Min
alpha = 0.9 Avg alpha = 0.9 Avg alpha = 0.9 Avg
alpha = 0.95 Max alpha = 0.95 Max alpha = 0.95 Max
80% alpha = 0.95 Min 80% alpha = 0.95 Min 80% alpha = 0.95 Min
alpha = 0.95 Avg alpha = 0.95 Avg alpha = 0.95 Avg
70% 70% 70%
Error on Key

Error on Key

Error on Key
60% 60% 60%

50% 50% 50%

40% 40% 40%

30% 30% 30%

20% 20% 20%

10% 10% 10%

0% 0% 0%
100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000
M (Size of Measure Matrix) M (Size of Measure Matrix) M (Size of Measure Matrix)
(a) k=5 (b) k=10 (c) k=20
Figure 5: Error on key for different k

100% 100% 100%


alpha = 0.9 Max alpha = 0.9 Max alpha = 0.9 Max
90% alpha = 0.9 Min 90% alpha = 0.9 Min 90% alpha = 0.9 Min
alpha = 0.9 Avg alpha = 0.9 Avg alpha = 0.9 Avg
alpha = 0.95 Max alpha = 0.95 Max alpha = 0.95 Max
80% alpha = 0.95 Min 80% alpha = 0.95 Min 80% alpha = 0.95 Min
alpha = 0.95 Avg alpha = 0.95 Avg alpha = 0.95 Avg
Error on Value

Error on Value

Error on Value
70% 70% 70%

60% 60% 60%

50% 50% 50%

40% 40% 40%

30% 30% 30%

20% 20% 20%

10% 10% 10%

0% 0% 0%
100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000
M (Size of Measure Matrix) M (Size of Measure Matrix) M (Size of Measure Matrix)
(a) k=5 (b) k=10 (c) k=20
Figure 6: Error on value with different k

depends on how the values of different keys are distributed run the CS-based approach 100 times with different random
over different nodes. When the values are distributed with measurement matrices and report MAX, MIN and AVG
big standard deviations, the mode and outliers on each node errors on key and value. We see that when the communicate
are vastly different from the global ones and the estimation cost is only 1% of ALL, the CS-Based approach yields accurate
of outliers may have large errors. However, if the values key results with K = 5. The error on value under the
are uniformly distributed across the nodes, the estimation same setting is only 4%. When K grows bigger, we need
based on the local data can be a good approximation of the more measurements to get the result with the same accuracy
global distribution. We therefore compare against this simple compared to smaller K. However, even when K = 20, we
k + -approach in order to determine howon real datathis only need 15% of the communication cost to get accurate key
natural approach compares against our CS-based algorithm. set. In contrast, the K+ approach returns big errors on
We call this baseline K+. both key and value even with much larger communication
In the first group of experiments, we evaluate the accuracy cost.
of outlier detection with different communication cost. We The trend in the results of ads and answer click score
use three production workloads on finding k-outliers over data are similar to the result above. In fact, the CS-based
sum of core-search, ads and answer click score. The keys algorithm significantly outperforms the other approaches
are defined from grouping by QueryDate, Market, Vertical, consistently over all benchmarks and data workloads.
RequestURL and DataCentre attributes. We partition the The mode b in each iteration of the three production
data exactly as it is partitioned on the different Microsoft data sets are shown in Figure 9. The values on y-axis are
data centers. anonymized. We can see that b becomes stable after 300, 650
We compare the CS-Based algorithm with ALL and K+. and 610 iteration (M=500, 800 and 800, respectively), from
For ALL, we adopt the vectorized approach because the total which we can also derive the sparsity of production data.
communication cost using keyid-value pairs is more than 3
times larger. In K+, the number of keys g to be sampled in 6.2 Efficiency of BOMP on Hadoop
the first round for the mode estimation should be big enough,
since it is important to the estimation of mode, and thus the In this group of experiments, we want to evaluate the
key to computing the outliers. From our investigation, the performance of BOMP on MapReduce implementation. Since
value of the sampled mode is relative stable when g is bigger we can not find an existing Hadoop implementation on outlier
than (K + )/2. To this end, given a communication cost detection, we compare our solution with traditional Top-k
budget, we always choose g to be 50% of the communication computing on Hadoop. Our solution can be extended directly
cost in the evaluation in order to get a fair comparison to solve distributed Top-k problem when the datas mode
among the three solutions. The communication cost of our is 0. All experiments are conducted on Hadoop 2.4.0 with
CS-Based approach and the K+ approach is normalized with both synthetic (Power-Law distribution with = 1.5, N =
the transmitting ALL solution. 100Kto1M ) and product data set as introduced in Section
The evaluation results on core-search score data are shown 6.1.2. We change the datas mode to 0 by subtracting the
in Figures 7 and 8. Given a target communication cost, we model from all the data. All experiments are running on a 10-
node cluster with Intel Core Xeon (CPU E5-2665@2.40GHz),

11
100% 100% 100%
K+delta K+delta K+delta
90% BOMP Avg 90% BOMP Avg 90% BOMP Avg
BOMP Max BOMP Max BOMP Max
BOMP Min BOMP Min BOMP Min
80% 80% 80%

70% 70% 70%


Error on Key

Error on Key

Error on Key
60% 60% 60%

50% 50% 50%

40% 40% 40%

30% 30% 30%

20% 20% 20%

10% 10% 10%

0% 0% 0%
1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15%
Communication Cost Nomalized by Transmitting All Communication Cost Nomalized by Transmitting All Communication Cost Nomalized by Transmitting All
(a) k=5 (b) k=10 (c) k=20
Figure 7: Error on key over production data

260% 280%
240% K+delta K+delta K+delta
BOMP Avg 240% BOMP Avg 260% BOMP Avg
220% BOMP Max BOMP Max BOMP Max
BOMP Min 240% BOMP Min
220% BOMP Min
200% 220%
200%
180% 200%
Error on Value

Error on Value
Error on Value

180%
160% 180%
160%
140% 160%
140%
120% 140%
120%
120%
100%
100% 100%
80%
80% 80%
60%
60% 60%
40% 40% 40%
20% 20% 20%
0% 0% 0%
1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 1% 2% 3% 4% 5% 6% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10%
Communication Cost Nomalized by Transmitting All Communication Cost Nomalized by Transmitting All Communication Cost Nomalized by Transmitting All
(a) k=5 (b) k=10 (c) k=20
Figure 8: Error on value over production data

100GB of RAM and Cent OS 6.3 64bit. Network bandwidth from matrix operations when N is large. However, compared
is 1Gbps. to the IO and shuffling cost, this overhead shows less impact
Firstly, we set N = 100K. Compared to traditional Top- on the overall efficiency compared to traditional approach.
k computation on Hadoop, our solution can greatly save Thus, BOMP manages to effectively trade-off an increase
the mappers output IO cost and shuffling cost in reducer. of computation cost for a reduction of communication cost.
However, it should also bring additional recovery cost. The As shown in Figure 12, with different N , our solution shows
amount of savings as well as overhead is mainly determined better efficiency over the traditional top-k approach.
by the size of measurement matrix M . The input file size of
synthetic data is 600M, as shown in Figure 10(a), the end
to end time of our solution is smaller than the traditional
7. RELATED WORK
implementation when M < 1100. The breakdown latencies
of both mapper and reducer are shown in Figures 11(a)
7.1 Theory of Distributed Outlier Detection
and 11(d). When we increase the input file size to 600G, In the most basic version of distributed data statistics, the
the savings become more significant which is demonstrated partial data xl {0, 1}N is a binary vector. A well-known
in Figure 10(b). Interestingly, the savings on reducer as problem in this setting is the set disjointness problem [25] [36]
shown in Figure 11(e) is more significant. As discussed in [28] [1]. Suppose each of two nodes has a set of n elements
Section 6.2, the reducers waiting time can be further reduces S1 and S2 , the problem asks the two nodes to exchange as
if we can greatly save mapper time. BOMP can achieve few bits of information as possible
T to determine whether the
that. For production data, the input file size is 12G. The two sets are disjoint (i.e. S1 S2 = ). Razborov proved in
comparison result is illustrated in Figure 10(c). When M [36] that (n) bits are needed for any algorithm to solve set
grows bigger than 1600, the overhead of the recovery process disjointness with constant success probability. Alon et al. [1]
in the reducer is much bigger. As such, end to end time can extended this linear lower bound to the multi-node scenario,
not be saved. Optimistically, both EV and EK are below showing that for L nodes with sets |S1 | = = |SL | = n and
5% when M = 400 for = 1.5 and M = 300 for product n L4 , (n/L3 ) bits are necessary to know whether the
data, under which settings our solution is more efficient than sets S1 . . . SL are pairwise disjoint. These lower bounds also
traditional Hadoop implementation. apply to problems in more general settings with non-binary
To show scalability of our approach, and to examine the data, and they have been used as building blocks to prove
effects of key size N on the efficiency, we increase N from various other lower bounds.
100K to 5M with fixed input file size 10G in the following Numerous studies consider the communication complexity
group of experiments. Both Traditional top-k approaches to learn statistical features of distributed positive data. In
and BOMP with k = 5 are tested in each evaluation. As these works, the partial data xl NN is a vector of natural
N increases, the IO and shuffling cost of the traditional numbers that typically corresponds to some counters over the
approach increases, which slows down the query processing. set {1, . . . , N }. Various features have been considered:
The response time of our solutionBOMPalso increases with Kuhn et al. [27] consider the aggregation data x as a
the increase of N . This is because of the overhead introduced frequency histogram of a sorted list a1 a2 am

12
s=650
s=300
Value of Mode (Bias)

Value of Mode (Bias)

Value of Mode (Bias)


s=610

50 100 150 200 250 300 350 400 450 500 0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800
Number of Iteration Number of Iteration Number of Iteration
(a) Core Search Click Score Data (b) Ads Click Score Data (c) Answer Click Score Data
Figure 9: Mode in the recovery iterations

70 300
BOMP 240 BOMP
65 280
220 Traditional Top-K Traditional Top-K
60 260
55 Traditional Top-K 200 240
50 180 220
200
45 160
Time(s)

Time(s)

Time(s)
180
40 140
160
35 120 140
30
100 120
25
80 100
20 80
60
15 60
10 40
40
5 20 20 BOMP
0 0 0
100 200 300 400 500 600 700 800 900 1000 1100 1200 200 400 600 800 1000 1200 1400 1600 1800 2000 200 400 600 800 1000 1200 1400 1600 1800 2000
M M M
(a) =1.5 small (b) =1.5 big (c) Product
Figure 10: End-to-end time on Hadoop
PN
(where m = i=1 xi and aj {1 . . . N }), and study to be less than the number of non-zero items in x, and
the problem of determining the median element in this may break the linear lower bound when F0 is sublinear to
sorted list {aj }, and more generally of determining the N . The key idea of the algorithm is to randomly partion
kth smallest element ak for any 1 k m. all N values into two groups and sum up the values of
Alon et al. [1] study the problem of determining the keys assigned to the same group, so that the key with the
k-order frequency moment of the aggregation data Fk = largest value is more likely to stay in the group with larger
PN
i=1 (x i )k . In this seminal paper, the authors consider sum. Repeating this routine (in parallel) multiple times
the space complexity (i.e. how many bits of space are can amplify this probabilistic fact to make the largest
needed) to solve the problem in a streaming model (where value stand out quickly.
the input data can only be processed in one pass), the re- The top-k problem is a generalization of the maximum
sults also apply to the communication complexity scenario problem in which the k largest values in x have to be iden-
in distributed computing. Specifically, their work adapts tified. This problem has been extensively studied in the
the lower bound technique in [28] and shows that for k 6, database community assuming a limited number of nodes
at least (N 15/k ) bits need to be exchanged between [18] [31] [3] [32] [4] [19]. In particular, a seminal work by
the nodes for any randomized algorithm in order to ap- Fagin et al. in [19] proposed the famous Threshold Algorithm
proximate Fk by Fk such that P[|Fk Fk | > 0.1Fk ] < 1/2. (TA), which was also independently discovered in [32] and
On the other hand, an O(N 11/k log N ) algorithm was [4]. The TA algorithm runs in multiple rounds. The idea is
proposed in [1], which works in both the streaming model to calculate the SUM of the values ordered at the same rank
and the distributed model. in each node as a threshold. The algorithm stops when there
A special case of the frequency moment problem is to are k keys whose SUM values are bigger than the threshold,
determine the sparsity of the aggregation data x, which because the threshold is an upper bound on the total value
asks how many components in x are non-zero (i.e. F0 ). for the keys it has not seen.
Again, the set disjointness lower bound implies a lower Although working well in many database scenarios, the
bound of (N ) for any algorithm to exactly solve the threshold algorithm suffers from limited scalability with re-
sparsity problem [35]. Efficient algorithms to compute spect to the number of nodes as it fundamentally runs in
F0 in expectation were proposed in [21] [17]. multiple rounds. Inspired by Fagins work, Pei Cao and Zhe
Finding the largest value in x is another problem closely Wang [10] proposed the TPUT algorithm, which consists
related to the frequency moment problem in that of three rounds: i) estimate the lower bound of the kth
limk Fk = max xi . Kuhn et al. point out that the value, ii) prune keys using the lower bound and iii) exact
lower bound of the set disjoint problem, again, applies top-k refinement. Charikar et al. [11] proposed an approx-
to this problem, leading to a randomized lower bound imation algorithm that finds k keys with values at least
of (N/(xmax )5 ), where xmax = max xi [26]. On the (1 )xk with high probability under the streaming model,
other hand, this work also proposes a randomized algo- using O((k + F2 /x2k ) log N ) bits of memory. This algorithm
rithm that solves the problem with high probability using can also be adapted to approximate the top-k results between
O( x2F2 log N ) bits. Note that x2F2 F0 is guaranteed distributed nodes with O((k + F2 /x2k ) log N ) bits.
max max

13
35 60 80 35 180
Traditional Mapper Traditional Mapper 240
55 75 Traditional Reducer
70 Traditional Mapper 160 220 Traditional Reducer
30 50 30
65 200
140
45 60
25 25 180
40 55 120
50 160
Time(s)

35 Traditional Reducer
20 45 20 100 140
30 40 120
15 35 15 80
25 100
30
20 25 60 80
10 10
15 20 60
40
10 15
5 5 40
10 20
5 5 20
BOMP Mapper BOMP Mapper BOMP Mapper BOMP Reducer BOMP Reducer BOMP Reducer
0 0 0 0 0 0
200 400 600 800 1000 1200 400 800 1200 1600 2000 400 800 1200 1600 2000 200 400 600 800 1000 1200 400 800 1200 1600 2000 400 800 1200 1600 2000
M M M M M M
(a) =1.5Small map (b) =1.5Big map (c) Product map (d) =1.5Small reduce(e) =1.5Big reduce (f) Product reduce
Figure 11: Breakdown time on Hadoop

400 400 450

350 350 400

350
300 300
300
250 250
Time(s)

Time(s)

Time(s)
250
200 200
200
150 150
150
100 100
100

50 Traditional topK 50 Traditional topK 50 Traditional topK


BOMP M=50 BOMP M=50 BOMP M=50
BOMP M=100 BOMP M=100 BOMP M=100
0 0 0
100K 200k 500K 1M 5M 100K 200K 500K 1M 5M 100K 200K 500K 1M 5M
Number of Key Number of Key Number of Key
(a) End-to-end Efficiency (b) Map Efficiency (c) Reduce Efficiency
Figure 12: Efficiency with different key size N

Different from all work above, our work extends the partial may encounter the risk of losing substantial information that
data xl RN to the real field. Particularly xij can be nega- is necessary for accurately answering the query on the global
tive numbers. The top-k problem over the natural number data.
field can be seen as a special case of the k-outlier problem
over the real field as considered in this paper. However, note 7.3 Communication-Efficient System Design
that the fact that the partial data can be negative in the Communication-efficient distributed computation of statis-
k-outlier problem invalidates key assumptions made in [19, tics is also attractive to distributed system design, e.g. [41].
10], namely that the partial sum is a lower bound on the Many efforts have been made to optimize query processing
aggregated sum. This means that TA and TPUT proposed for in MapReduce-type distributed systems. [33, 38] propose
the top-k problem cannot be easily adapted to the k-outlier techniques designed for reducing the Mapper and Reducers
problem. Besides, the top-k values are not necessarily the communication cost through job sharing. The authors of
k-outlier values when negative values are involved. [29, 14, 40] apply sampling to reduce the MapReduce jobs
The top-1 algorithm proposed in [26] may be adapted processing cost. Although these works are not tailored for
to solve the k-outlier problem with O(k2 log N ) messages outlier computation, they could be adjusted and applied
with high probability. However, this would require multiple together with our solution to further speed up MapReduce
iterations and rounds (specifically, k rounds) of communica- jobs.
tion. In contrast, in this paper we focus on a non-adaptive
algorithm that finds all k outliers in a single round.
8. CONCLUSIONS
7.2 Compressive Sensing and Data Sketches Production-level big data usually exhibits a sparse struc-
Compressive Sensing (CS) is a breakthrough in the signal ture. It is attractive to leverage this sparsity to reduce the
processing community. Its basic ideas and mathematical communication cost in various distributed aggregation algo-
foundations have been established in [8, 9, 15, 37]. CS rithms, such as the detection of outliers, mode, top-k, and
is an efficient approach to sample and reconstruct sparse so on. However, the distribution skew among data slices
data, and has found many applications (e.g., photography allocated in a shared-nothing distributed system impedes
[16], image reconstruction and recognition [39], and network traditional techniques that are based on first compressing
traffic monitoring [43]). In our work, we use CS to compress (sketching) local slices separately and then issuing the query
and recover sparse data vectors. to these aggregated sketches. In this paper, we make a case
One technique for reducing communication cost is data that compressive sensing-based compression and recovery
compression. Lossless compression usually does not help techniques may be ideally suited in such a distributed con-
a lot when each slice contains a large number of different text, especially in MapReduce-type systems. Specifically, we
non-zero real-values. Alternatively, some existing approaches show in this paper empirically and theoretically that such an
conduct lossy compression (also known as sketches) [12, 24, approach can be beneficial for the distributed outlier detec-
22] on the slice in each node. However, these scattered slices tion problem. More generally, we believe that there will be
could have different distributions or structures from each many more applications of compressive sensing in distributed
other. The traditional sketching techniques applied locally computing.

14
9. REFERENCES [17] M. Durand and P. Flajolet. Loglog counting of large
cardinalities. In Algorithms-ESA 2003, pages 605617.
Springer, 2003.
[1] N. Alon, Y. Matias, and M. Szegedy. The space [18] R. Fagin. Combining fuzzy information from multiple
complexity of approximating the frequency moments. systems. In Proceedings of the fifteenth ACM
In Proceedings of the Twenty-eighth Annual ACM SIGACT-SIGMOD-SIGART symposium on Principles
Symposium on Theory of Computing, STOC 96, pages of database systems, pages 216226. ACM, 1996.
2029, New York, NY, USA, 1996. ACM. [19] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation
[2] M. Andrecut. Fast gpu implementation of sparse signal algorithms for middleware. In P. Buneman, editor,
recovery from random projections. Engineering Letters, PODS. ACM, 2001.
17(3):151158, 2009. [20] Y. Fang, L. Chen, J. Wu, and B. Huang. Gpu
[3] B. Babcock and C. Olston. Distributed top-k implementation of orthogonal matching pursuit for
monitoring. In Proceedings of the 2003 ACM SIGMOD compressive sensing. In Proceedings of the 2011 IEEE
international conference on Management of data, pages 17th International Conference on Parallel and
2839. ACM, 2003. Distributed Systems, ICPADS 11, pages 10441047,
[4] W.-T. Balke and W. Kieling. Optimizing multi-feature Washington, DC, USA, 2011. IEEE Computer Society.
queries for image databases. VLDB,(Sep 2000), pages [21] P. Flajolet and G. Nigel Martin. Probabilistic counting
1014, 2000. algorithms for data base applications. Journal of
[5] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. computer and system sciences, 31(2):182209, 1985.
A simple proof of the restricted isometry property for [22] P. B. Gibbons and Y. Matias. New sampling-based
random matrices. Constructive Approximation, summary statistics for improving approximate query
28(3):253263, 2008. answers. In Proceedings of the 1998 ACM SIGMOD
[6] T. Blumensath and M. E. Davies. On the difference International Conference on Management of Data,
between orthogonal matching pursuit and orthogonal SIGMOD 98, pages 331342, New York, NY, USA,
least squares, 2007. 1998. ACM.
[7] T. Bu, J. Cao, A. Chen, and P. P. Lee. A fast and [23] Z. Guo, X. Fan, R. Chen, J. Zhang, H. Zhou,
compact method for unveiling significant patterns in S. McDirmid, C. Liu, W. Lin, J. Zhou, and L. Zhou.
high speed networks. In INFOCOM 2007. 26th IEEE Spotting code optimizations in data-parallel pipelines
International Conference on Computer through periscope. In OSDI, pages 121133, 2012.
Communications. IEEE, pages 18931901. IEEE, 2007. [24] I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of
[8] E. J. Candes, J. Romberg, and T. Tao. Robust top-k query processing techniques in relational
uncertainty principles: Exact signal reconstruction from database systems. ACM Computing Surveys (CSUR),
highly incomplete frequency information. Information 40(4):11, 2008.
Theory, IEEE Transactions on, 52(2):489509, 2006. [25] B. Kalyanasundaram and G. Schintger. The
[9] E. J. Candes and T. Tao. Near-optimal signal recovery probabilistic communication complexity of set
from random projections: Universal encoding intersection. SIAM J. Discret. Math., 5(4):545557,
strategies? Information Theory, IEEE Transactions on, Nov. 1992.
52(12):54065425, 2006. [26] F. Kuhn, T. Locher, and S. Schmid. Distributed
[10] P. Cao and Z. Wang. Efficient top-k query calculation computation of the mode. In Proceedings of the
in distributed networks. In S. Chaudhuri and S. Kutten, twenty-seventh ACM symposium on Principles of
editors, PODC, pages 206215. ACM, 2004. distributed computing, pages 1524. ACM, 2008.
[11] M. Charikar, K. Chen, and M. Farach-Colton. Finding [27] F. Kuhn, T. Locher, and R. Wattenhofer. Tight bounds
frequent items in data streams. In Automata, for distributed selection. In Proceedings of the
Languages and Programming, pages 693703. Springer, nineteenth annual ACM symposium on Parallel
2002. algorithms and architectures, pages 145153. ACM,
[12] S. Chaudhuri. An overview of query optimization in 2007.
relational systems. In PODS, pages 3443, 1998. [28] E. Kushilevitz and N. Nisan. Communication
[13] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic Complexity. Cambridge University Press, New York,
decomposition by basis pursuit. SIAM journal on NY, USA, 1997.
scientific computing, 20(1):3361, 1998. [29] N. Laptev, K. Zeng, and C. Zaniolo. Early accurate
[14] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, results for advanced analytics on mapreduce.
J. Gerth, J. Talbot, K. Elmeleegy, and R. Sears. Online Proceedings of the VLDB Endowment, 5(10):10281039,
aggregation and continuous query support in 2012.
mapreduce. In Proceedings of the 2010 ACM SIGMOD [30] S. G. Mallat and Z. Zhang. Matching pursuits with
International Conference on Management of data, time-frequency dictionaries. Signal Processing, IEEE
pages 11151118. ACM, 2010. Transactions on, 41(12):33973415, 1993.
[15] D. L. Donoho. Compressed sensing. Information [31] A. Marian, N. Bruno, and L. Gravano. Evaluating
Theory, IEEE Transactions on, 52(4):12891306, 2006. top-k queries over web-accessible databases. ACM
[16] M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Transactions on Database Systems (TODS),
Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk. 29(2):319362, 2004.
Single-pixel imaging via compressive sampling. Signal [32] S. Nepal and M. Ramakrishna. Query processing issues
Processing Magazine, IEEE, 25(2):8391, 2008. in image (multimedia) databases. In Data Engineering,

15
1999. Proceedings., 15th International Conference on, [39] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and
pages 2229. IEEE, 1999. Y. Ma. Robust face recognition via sparse
[33] T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and representation. Pattern Analysis and Machine
N. Koudas. Mrshare: sharing across multiple queries in Intelligence, IEEE Transactions on, 31(2):210227,
mapreduce. Proceedings of the VLDB Endowment, 2009.
3(1-2):494505, 2010. [40] Y. Yan, L. J. Chen, and Z. Zhang. Error-bounded
[34] Y. C. Pati, R. Rezaiifar, and P. Krishnaprasad. sampling for analytics on big sparse data. PVLDB,
Orthogonal matching pursuit: Recursive function 7(13):15081519, 2014.
approximation with applications to wavelet [41] J. Zhang, Y. Yan, L. J. Chen, M. Wang,
decomposition. In Signals, Systems and Computers, T. Moscibroda, and Z. Zhang. Impression store:
1993. 1993 Conference Record of The Twenty-Seventh Compressive sensing-based storage for big data
Asilomar Conference on, pages 4044. IEEE, 1993. analytics. In 6th USENIX Workshop on Hot Topics in
[35] B. Patt-Shamir. A note on efficient aggregate queries in Cloud Computing (HotCloud 14), Philadelphia, PA,
sensor networks. Theoretical Computer Science, June 2014. USENIX Association.
370(1):254264, 2007. [42] J. Zhang, H. Zhou, R. Chen, X. Fan, Z. Guo, H. Lin,
[36] A. A. Razborov. On the distributional complexity of J. Y. Li, W. Lin, J. Zhou, and L. Zhou. Optimizing
disjointness. Theoretical Computer Science, data shuffling in data-parallel computation by
106(2):385390, 1992. understanding user-defined functions. In NSDI,
[37] J. A. Tropp and A. C. Gilbert. Signal recovery from volume 12, pages 2222, 2012.
random measurements via orthogonal matching pursuit. [43] Y. Zhang, M. Roughan, W. Willinger, and L. Qiu.
Information Theory, IEEE Transactions on, Spatio-temporal compressive sensing and internet
53(12):46554666, 2007. traffic matrices. In ACM SIGCOMM Computer
[38] G. Wang and C.-Y. Chan. Multi-query optimization in Communication Review, volume 39, pages 267278.
mapreduce framework. Proceedings of the VLDB ACM, 2009.
Endowment, 7(3), 2013.

16

Você também pode gostar