Você está na página 1de 17

Hashing, random projections, and data streaming

Ioana Cosma

Statistical Laboratory
University of Cambridge

September 28, 2009

Ioana Cosma Hashing, random projections, and data streaming


Data streaming: definition and problem

Streaming data
Transiently observed sequence of data elements that arrive
unordered, with repetitions, and at very high rate of transmission.

Problem
Estimation of summary statistics over streaming data with fast
processing of data elements and modest storage requirements:
1 cardinality, lα distances (quasi-distances), entropy.
2 quantiles, histograms, and other measures of distributional
dissimilarity.

Ioana Cosma Hashing, random projections, and data streaming


Data streaming: definition and problem

Streaming data
Transiently observed sequence of data elements that arrive
unordered, with repetitions, and at very high rate of transmission.

Problem
Estimation of summary statistics over streaming data with fast
processing of data elements and modest storage requirements:
1 cardinality, lα distances (quasi-distances), entropy.
2 quantiles, histograms, and other measures of distributional
dissimilarity.

Ioana Cosma Hashing, random projections, and data streaming


Data streaming: definition and problem

Consider a data stream ST of length T with elements of the form

(it , dt ), t = 1, . . . , T ,

where the item type, it , belongs to a countable, possibly infinite


set D = {c1 , . . . , cN }, and the associated quantity is dt (either
positive or negative).

The accumulation vector of ST at stage T is aT = (a1 , a2 , . . .),


where
XT
aj = dt I(it = cj ), j = 1, . . . , N,
t=1

is the cumulative quantity of elements of type cj at stage T.

Ioana Cosma Hashing, random projections, and data streaming


Data streaming: definition and problem

Consider a data stream ST of length T with elements of the form

(it , dt ), t = 1, . . . , T ,

where the item type, it , belongs to a countable, possibly infinite


set D = {c1 , . . . , cN }, and the associated quantity is dt (either
positive or negative).

The accumulation vector of ST at stage T is aT = (a1 , a2 , . . .),


where
XT
aj = dt I(it = cj ), j = 1, . . . , N,
t=1

is the cumulative quantity of elements of type cj at stage T.

Ioana Cosma Hashing, random projections, and data streaming


Data streaming: definition and problem

Computing summary statistics over streaming data


1 The lα norm defined for α ≥ 1:
X 1/α
lα (aT ) = |aj |α .
j∈D

The cardinality is c := limα→0 lα (aT ) .
2 The lα distance between streams aT1 and bT2 over D for
α ≥ 1:
dα (aT1 , bT2 ) = lα (aT1 − bT2 ).
P
3 Let pj = aj / d∈D ad . The empirical Shannon entropy of
p = (p1 , p2 , . . .):
X
H(p) = − pj log pj .
j∈D,pj 6=0

Ioana Cosma Hashing, random projections, and data streaming


Data streaming: definition and problem

Computational complexity
Exact computation of functions of the accumulation vector
requires maintaining the counter aj for each j ∈ D, updating it
whenever it = cj by aj + dt 7→ aj .
The associated storage cost of O(c) is prohibitively large, so the
vector (a1 , a2 , . . .) cannot be stored on main computer memory or
on disk for fast and efficient access.

Data sketching via random projections


Process the data stream in a one-pass algorithm that retains
sufficient information in the form of a low-dimensional vector
of random projections.
The algorithm is fast, has small space requirements, and
stores sufficient information for efficient estimation of these
summary statistics.

Ioana Cosma Hashing, random projections, and data streaming


Data streaming: definition and problem

Computational complexity
Exact computation of functions of the accumulation vector
requires maintaining the counter aj for each j ∈ D, updating it
whenever it = cj by aj + dt 7→ aj .
The associated storage cost of O(c) is prohibitively large, so the
vector (a1 , a2 , . . .) cannot be stored on main computer memory or
on disk for fast and efficient access.

Data sketching via random projections


Process the data stream in a one-pass algorithm that retains
sufficient information in the form of a low-dimensional vector
of random projections.
The algorithm is fast, has small space requirements, and
stores sufficient information for efficient estimation of these
summary statistics.

Ioana Cosma Hashing, random projections, and data streaming


Random projections, hashing, and estimation

The stable distribution


X ∼ F is stable if for every n > 0 and X1 , . . . , Xn ∼ F i.i.d., there
exist constants an > 0 and bn such that
D
X1 + . . . + Xn = an X + bn .

The norming constant is an = n1/α , where 0 < α ≤ 2 is the index


of stability. If bn = 0, then X is said to be strictly stable.

Theorem on linear combinations


For constants a1 , a2 > 0, if X1 , X2 ∼ F are strictly stable of index
α and independent, then
 1/α
D
a1 X1 + a2 X2 = |a1 |α + |a2 |α X, X ∼ F. (1)

If, in addition, the distribution is symmetric, then (1) holds for all
a1 , a2 ∈ R.
Ioana Cosma Hashing, random projections, and data streaming
Random projections, hashing, and estimation

The stable distribution


X ∼ F is stable if for every n > 0 and X1 , . . . , Xn ∼ F i.i.d., there
exist constants an > 0 and bn such that
D
X1 + . . . + Xn = an X + bn .

The norming constant is an = n1/α , where 0 < α ≤ 2 is the index


of stability. If bn = 0, then X is said to be strictly stable.

Theorem on linear combinations


For constants a1 , a2 > 0, if X1 , X2 ∼ F are strictly stable of index
α and independent, then
 1/α
D
a1 X1 + a2 X2 = |a1 |α + |a2 |α X, X ∼ F. (1)

If, in addition, the distribution is symmetric, then (1) holds for all
a1 , a2 ∈ R.
Ioana Cosma Hashing, random projections, and data streaming
Random projections, hashing, and estimation

The method of random projections


Each element type cj in the data stream can be transformed to a
distinct random variable R(cj ) ∼ F to adequate approximation via
a pseudo-random number generator as follows:
1 Hash cj to an integer (or vector of integers) via a
deterministic hash function with low collision probability.
2 Use these integers to seed a pseudo-random number
generator.
3 Use the seeded generator to simulate a sequence of
independent random variables with distribution F; set R(cj ) to
the value at a fixed position in the sequence.

Ioana Cosma Hashing, random projections, and data streaming


Random projections, hashing, and estimation

Processing and estimation


The projection is then accumulated online as
T
X N
X
dt R(it ) = aj R(cj ),
t=1 j=1

and the process is repeated independently a further k − 1 times to


form a k-dimensional data sketch.
If R(cj ) ∼ F is strictly stable and symmetric of index α, then

N N
X 1/α
D
X
aj R(cj ) = |aj |α R,
j=1 j=1

where R ∼ F . The problem reduces to that of estimating a


scale parameter in an observed sample of size k from the
α-stable distribution.
Ioana Cosma Hashing, random projections, and data streaming
Contributions and future work

Contributions
We exploit properties of the α-stable distribution and the idea
of data sketching via random projections to propose efficient
estimators of cardinality, lα distances, and entropy over
streaming data.
The proposed algorithms have fast running time, and small
storage requirements.
The resulting estimators are asymptotically efficient (or nearly)
with good small sample performance (shown via simulations),
recursively computable (for cardinality estimation), and have
tail bounds that are exponentially decreasing with k (for
cardinality and entropy estimation - resulting in space
complexity bounds on the size of the data sketch).

Ioana Cosma Hashing, random projections, and data streaming


Contributions and future work

Contributions
We exploit properties of the α-stable distribution and the idea
of data sketching via random projections to propose efficient
estimators of cardinality, lα distances, and entropy over
streaming data.
The proposed algorithms have fast running time, and small
storage requirements.
The resulting estimators are asymptotically efficient (or nearly)
with good small sample performance (shown via simulations),
recursively computable (for cardinality estimation), and have
tail bounds that are exponentially decreasing with k (for
cardinality and entropy estimation - resulting in space
complexity bounds on the size of the data sketch).

Ioana Cosma Hashing, random projections, and data streaming


Contributions and future work

Contributions
We exploit properties of the α-stable distribution and the idea
of data sketching via random projections to propose efficient
estimators of cardinality, lα distances, and entropy over
streaming data.
The proposed algorithms have fast running time, and small
storage requirements.
The resulting estimators are asymptotically efficient (or nearly)
with good small sample performance (shown via simulations),
recursively computable (for cardinality estimation), and have
tail bounds that are exponentially decreasing with k (for
cardinality and entropy estimation - resulting in space
complexity bounds on the size of the data sketch).

Ioana Cosma Hashing, random projections, and data streaming


Contributions and future work

Future work
Shannon entropy as characterisation of a data stream applied
in developing convergence diagnostics for monitoring extensive
MCMC simulations in Bayesian inference problems.
Data sketching via random projections as tool for dimension
reduction in computationally expensive simulation algorithms
such as particle filtering, and particle MCMC.

References
http://www.statslab.cam.ac.uk/∼ioana

Ioana Cosma Hashing, random projections, and data streaming


Contributions and future work

Future work
Shannon entropy as characterisation of a data stream applied
in developing convergence diagnostics for monitoring extensive
MCMC simulations in Bayesian inference problems.
Data sketching via random projections as tool for dimension
reduction in computationally expensive simulation algorithms
such as particle filtering, and particle MCMC.

References
http://www.statslab.cam.ac.uk/∼ioana

Ioana Cosma Hashing, random projections, and data streaming

Você também pode gostar