Escolar Documentos
Profissional Documentos
Cultura Documentos
Ioana Cosma
Statistical Laboratory
University of Cambridge
Streaming data
Transiently observed sequence of data elements that arrive
unordered, with repetitions, and at very high rate of transmission.
Problem
Estimation of summary statistics over streaming data with fast
processing of data elements and modest storage requirements:
1 cardinality, lα distances (quasi-distances), entropy.
2 quantiles, histograms, and other measures of distributional
dissimilarity.
Streaming data
Transiently observed sequence of data elements that arrive
unordered, with repetitions, and at very high rate of transmission.
Problem
Estimation of summary statistics over streaming data with fast
processing of data elements and modest storage requirements:
1 cardinality, lα distances (quasi-distances), entropy.
2 quantiles, histograms, and other measures of distributional
dissimilarity.
(it , dt ), t = 1, . . . , T ,
(it , dt ), t = 1, . . . , T ,
Computational complexity
Exact computation of functions of the accumulation vector
requires maintaining the counter aj for each j ∈ D, updating it
whenever it = cj by aj + dt 7→ aj .
The associated storage cost of O(c) is prohibitively large, so the
vector (a1 , a2 , . . .) cannot be stored on main computer memory or
on disk for fast and efficient access.
Computational complexity
Exact computation of functions of the accumulation vector
requires maintaining the counter aj for each j ∈ D, updating it
whenever it = cj by aj + dt 7→ aj .
The associated storage cost of O(c) is prohibitively large, so the
vector (a1 , a2 , . . .) cannot be stored on main computer memory or
on disk for fast and efficient access.
If, in addition, the distribution is symmetric, then (1) holds for all
a1 , a2 ∈ R.
Ioana Cosma Hashing, random projections, and data streaming
Random projections, hashing, and estimation
If, in addition, the distribution is symmetric, then (1) holds for all
a1 , a2 ∈ R.
Ioana Cosma Hashing, random projections, and data streaming
Random projections, hashing, and estimation
N N
X 1/α
D
X
aj R(cj ) = |aj |α R,
j=1 j=1
Contributions
We exploit properties of the α-stable distribution and the idea
of data sketching via random projections to propose efficient
estimators of cardinality, lα distances, and entropy over
streaming data.
The proposed algorithms have fast running time, and small
storage requirements.
The resulting estimators are asymptotically efficient (or nearly)
with good small sample performance (shown via simulations),
recursively computable (for cardinality estimation), and have
tail bounds that are exponentially decreasing with k (for
cardinality and entropy estimation - resulting in space
complexity bounds on the size of the data sketch).
Contributions
We exploit properties of the α-stable distribution and the idea
of data sketching via random projections to propose efficient
estimators of cardinality, lα distances, and entropy over
streaming data.
The proposed algorithms have fast running time, and small
storage requirements.
The resulting estimators are asymptotically efficient (or nearly)
with good small sample performance (shown via simulations),
recursively computable (for cardinality estimation), and have
tail bounds that are exponentially decreasing with k (for
cardinality and entropy estimation - resulting in space
complexity bounds on the size of the data sketch).
Contributions
We exploit properties of the α-stable distribution and the idea
of data sketching via random projections to propose efficient
estimators of cardinality, lα distances, and entropy over
streaming data.
The proposed algorithms have fast running time, and small
storage requirements.
The resulting estimators are asymptotically efficient (or nearly)
with good small sample performance (shown via simulations),
recursively computable (for cardinality estimation), and have
tail bounds that are exponentially decreasing with k (for
cardinality and entropy estimation - resulting in space
complexity bounds on the size of the data sketch).
Future work
Shannon entropy as characterisation of a data stream applied
in developing convergence diagnostics for monitoring extensive
MCMC simulations in Bayesian inference problems.
Data sketching via random projections as tool for dimension
reduction in computationally expensive simulation algorithms
such as particle filtering, and particle MCMC.
References
http://www.statslab.cam.ac.uk/∼ioana
Future work
Shannon entropy as characterisation of a data stream applied
in developing convergence diagnostics for monitoring extensive
MCMC simulations in Bayesian inference problems.
Data sketching via random projections as tool for dimension
reduction in computationally expensive simulation algorithms
such as particle filtering, and particle MCMC.
References
http://www.statslab.cam.ac.uk/∼ioana