Você está na página 1de 4

Sampling events with given probability

Alcides Viamontes Esquivel

January 6, 2010

Many computer simulation techniques Event name Probability weight wi


require to randomly choose an event e1 0.3
with given probability from a set of pairs e2 0.2
(event, probability). This paper reviews an e3 0.6
O(n) method and an O(log n) method for e4 0.1
achieving that. e5 0.9
e6 0.05
Keywords: random sampling with prob-
ability, random sampling, tree, intrusive Table 1: A simple dataset. The first column por-
container trays event names, and the second one
the “probability weights”. Actual proba-
bility pi is taken as the normalized value
1 The problem Pwi
wi
i

Suppose that a simulation algo-


rithm maintains a set of pairs The second scenario is as follows: an
E = {(e1 , p1 ) , (e2 , p2 ) , . . . (en , pn )} where event is sampled and then E is modified
ei denotes a choice that should be taken in a discrete fashion, i.e., some pairs of
with probability pi . There are good qual- E are added/removed or the associated
ity pseudo-random number generators probabilities (pi values) change. If the m
available in almost all programming en- pairs that change are not related to the
vironments. The question is then, how size n of E, then it is possible to build an
to use a random number generator to O(m×log n) algorithm for sampling the set.
sample events with a given probability? Section 2 details the rather trivial algo-
Two cases should be accounted for. rithm for drawing in O(n) time, and section
First, there is the case where the set E 3 the one for drawing in O(log n) time.
is built, an event is sampled and then E
is replaced for a whole new set E 0 . When
this happens, construction of the E set has 2 The linear time algorithm
cost proportional to its size and there is no
meaning in looking for anything better than In both this and the following section we
O(n) in time. will use the dataset in table 1.

1
0 2.15 id=2
(o) tw=2.15
sw=0.20

+0.3=0.3 +0.2=0.5 +0.6=1.1 +0.1=1.2 +0.9=2.1


id=1 id=4
tw=0.30 tw=1.65
Figure 1: Event probability is represented by sub- sw=0.30 sw=0.10
interval length

id=3 id=5
The linear time algorithm uses just tw=0.60 tw=0.95
sw=0.60 sw=0.90
layouts the events’ probabilities as sub-
intervals in a continuous segment, and
samples a random number uniformly in
id=6
that segment. See figure 1. The length tw=0.05
of each sub-interval is proportional to the sw=0.05

probability of the event it represents.


Sampling is a simple process. A ran- Figure 2: A red/black tree containing self-weights
dom number x is generated with uniform (sw) of events and the cumulative
P
probability in the interval [0, i wi ) and a weights (tw) of the entire subtree.
subinterval ri = [ai , bi ) is located such that
ai ≤ x ≤ bi . This process requires to lin-
early scan all the segments of the inter- well stablished practice in the form of con-
val up to the one containing x. So, it has crete tree types for that [1]. Figure 2 shows
a worst case performance proportional to a red-black tree, but other binary tree kind
the number of events/segments. may be used.
The insertion algorithm is the same one
typically used for the kind of tree chosen,
3 The log-time algorithm with a small intrusive addition for mark-
ing nodes whose children change during
This one is a binary tree-based algorithm. the insertion process. Child switches are
As such, it has O(n) size complexity, m × caused both by the insertion of the new
O(log n) time complexity for updating the node itself and for common re-balancing
tree, and O(log n) complexity for sampling operators of the tree insertion algorithm. In
an event from the event space. We will steps, to insert a new element we need to
confirm that in the following paragraphs. It do the following:
is basically an adaptation of the segment 1. Insert the new node in the tree as
technique in previous section, but using usual for the binary tree kind chosen,
cumulative marks in the tree nodes. with the extra step of marking all the
As figure 2 shows, each node in the nodes in the tree (including the newly
tree contains both the probability weight of inserted one) whose right/left childs
the event it represents and the cumulative change. Those nodes will be denoted
weight of the subtree. The event id is used by S.
as key for sorting the tree. A balancing
police has to be implemented, but there is 2. Build a superset K of S containing

2
also all the ancestors of nodes s ∈ S. versatile header only library which also in-
The set K represents all the nodes cludes binary tree algorithms, see [2].
whose cumulative weight has to be It is straightforward to adapt a client
updated. The size of set K is pro- data-structure to be used with an intrusive
portional to the height of tree and tree. Further, as long as a tree is a partic-
thus both building it and updating the ular kind of directed graph, it is also pos-
cumulative probability weight values sible to adapt such data structure to work
should have O(log n) complexity. with boost.graph algorithms, and then just
use the topological sort there...
3. Compute a topological sort Ksorted =
κ1 , κ2 , . . . κ|K| of the set K such that
leaf nodes be first1 ; References
4. For each node in Ksorted , in that order, [1] T.H. Cormen, C.E. Leiserson, R.L.
the cumulative weight is re-computed Rivest, and C. Stein. Introduction to al-
summing the self-weight with the cu- gorithms. The MIT press, 2001.
mulative weight of the left and right
subtree. [2] Ion Gaztanaga and Olaf Krzikalla.
Boost.Intrusive. Boost.
Step 3 ensures that cumulative weights be
updated in the right order.
Sampling from the given probabilty pairs
is done following the tree structure. Algo-
rithm 1 shows the detailed steps.

4 Implementation notes for


C++
The main problem implementing the algo-
rithms in previous section is the balanced
tree logic. Handling tree rotations for a
given tree model might not be trivial.
Fortunately, it is possible to find library
implementations for many tree algorithms
in quite a few languages. In C++ the stan-
dard library already includes a tree based
dictionary implementation, although it is
not feasible to use it for this paper’s pur-
pose without changes to its source code,
due do the need of tracking tree structure
changes. However, there is another, very
1
Mathematically, for i < j κi is never an ancestor
of κj in the tree

3
Data: root_node pointer to the root of the red-black gree
Data: fact a real number from some (uniform) randomness source
begin
s = fast * (root_node -> tree_prb_weight) ;
current_node = root_node ;
if current_node -> left is not null then
a = current_node -> left -> tree_prb_weight ;
else
a=0;
b = current_node->self_prb_weight + left_start;
while s ≥ a and s < b do
if s < a then
current_node = current_node -> left ;
else
s=s−a;
current_node = current_node -> right ;
if current_node -> left is not null then
a = current_node -> left -> tree_prb_weight ;
else
a=0;
b = current_node->self_prb_weight ;
return current_node -> id
end

Algorithm 1: Sampling with a given probability

Você também pode gostar