Hierarchical Exploration For Accelerating Contextual Bandits

Hierarchical Exploration for Accelerating Contextual Bandits
Yisong Yue yisongyue@cmu.edu

iLab, H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Sue Ann Hong sahong@cs.cmu.edu
Carlos Guestrin guestrin@cs.cmu.edu
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Abstract Unfortunately, conventional bandit algorithms can

converge slowly with even moderately large feature
Contextual bandit learning is an increas-
spaces. For instance, the well-studied LinUCB algo-
ingly popular approach to optimizing recom-
rithm (Dani et al., 2008; Abbasi-Yadkori et al., 2011)
mender systems via user feedback, but can
achieves a regret bound that is linear in the dimension-
be slow to converge in practice due to the
ality of the feature space, which cannot be improved
need for exploring a large feature space. In
without further assumptions.1
this paper, we propose a coarse-to-fine hier-
archical approach for encoding prior knowl- Intuitively, any bandit algorithm make recommenda-
edge that drastically reduces the amount of tions that cover the entire feature space in order to
exploration required. Intuitively, user pref- guarantee learning a reliable user model. Therefore,
erences can be reasonably embedded in a a common approach to dealing with slow convergence
coarse low-dimensional feature space that can is dimensionality reduction based on prior knowledge,
be explored efficiently, requiring exploration such as previously learned user profiles, by represent-
in the high-dimensional space only as neces- ing new users as linear combinations of “stereotypical
sary. We introduce a bandit algorithm that users” (Li et al., 2010; Yue & Guestrin, 2011).
explores within this coarse-to-fine spectrum,
However, if a user deviates from stereotypical users,
and prove performance guarantees that de-
then a reduced space may not be expressive enough to
pend on how well the coarse space captures
adequately learn her preferences. The challenge lies in
the user’s preferences. We demonstrate sub-
appropriately leveraging prior knowledge to reduce the
stantial improvement over conventional ban-
cost of exploration for new users, while maintaining
dit algorithms through extensive simulation
the representational power of the full feature space.
as well as a live user study in the setting of
personalized news recommendation. Our solution is a coarse-to-fine hierarchical approach
for encoding prior knowledge. Intuitively, a coarse,
low-rank subspace of the full feature space may be suf-
1. Introduction ficient to accurately learn a stereotypical user’s pref-
erences. At the same time, this coarse-to-fine feature
User feedback (e.g., ratings and clicks) has become a hierarchy allows exploration in the full space when a
crucial source of training data for optimizing recom- user is not perfectly modeled by the coarse space.
mender systems. When making recommendations, one
must balance the needs for exploration (gathering in- We propose an algorithm, CoFineUCB, that automat-
formative feedback) and exploitation (maximizing es- ically balances exploration within the coarse-to-fine
timated user utility). A common formalization of such feature hierarchy. We prove regret bounds that de-
a problem is the linear stochastic bandit problem (Li pend on how well the user’s preferences project onto
et al., 2010), which models user utility as a linear func- the coarse subspace. We also present a simple and
tion of user and content features. general method for constructing feature hierarchies us-
ing prior knowledge. We perform empirical valida-
Appearing in Proceedings of the 29 th International Confer- 1
ence on Machine Learning, Edinburgh, Scotland, UK, 2012. The regret bound is information-theoretically optimal
Copyright 2012 by the author(s)/owner(s). up to log factors (Dani et al., 2008).
tion through simulation as well as a live user study

in personalized news recommendation, demonstrating
w*
that CoFineUCB can substantially outperform con-
ventional methods utilizing only a single feature space.
2. The Learning Problem

We study the linear stochastic bandit problem w! *
(Abbasi-Yadkori et al., 2011), which formalizes a rec-
ommendation system as a bandit algorithm that iter- Figure 1. A visualization of a feature hierarchy, where w∗
atively performs actions and learns from rewards re- denotes the user profile, and w̃∗ the projected user profile.
ceived per action. At each iteration t = 1, . . . , T , our
algorithm interacts with the user as follows: in practice, especially if additional structure can be
• The system recommends an item (i.e., performs an assumed. We now motivate and formalize one such
action) associated with feature vector xt ∈ Xt ⊂ �D , structure: the feature hierarchy.
which encodes content and user features.
For example, suppose that two of the D features corre-
• The user provides feedback (i.e., reward) ŷt .
spond to interest in articles about baseball and cricket.
Rewards ŷt are modeled as a linear function of actions Suppose also that our prior knowledge suggests that
x ∈ �D such that E[ŷt |x] = w∗� x, where the weight users are typically interested in one or the other, but
vector w∗ denotes the user’s (unknown) preferences. rarely both. Then we can design a feature subspace
We assume feedback to be independently sampled and where baseball and cricket topics project along oppo-
bounded within [0, 1],2 and that �x� ≤ 1 holds for all site directions in a single dimension. A bandit algo-
x. We quantify performance using the notion of regret rithm leveraging this structure should, ideally, first ex-
which compares the expected rewards of the selected plore at a coarse level to determine whether the user
actions versus the optimal expected rewards: is more interested in articles about baseball or cricket.
T
� We can formalize the different levels of exploration as
∗
RT (w ) = w∗� x∗t − w∗� xt , (1) a hierarchy that is composed of the full feature space
t=1 and a subspace. We define a K-dimensional subspace
where x∗t = argmaxx∈Xt w∗T x.3 using a matrix U ∈ �D×K , and denote the projection
of action x ∈ �D into the subspace as
We further suppose that user preferences are dis-
tributed according to some distribution W. We can x̃ ≡ U � x.
then define the expected regret over W as
Likewise, we can write the user’s preferences w∗ as
∗
RT (W) = Ew∗ ∼W [RT (w )] , (2)
w∗ = U w̃∗ + w⊥
∗
, (3)
and the goal now for the bandit algorithm is to perform where we call ∗
w⊥ the residual, or orthogonal compo-
well with respect to W. We will present an approach nent, of w∗ w.r.t. U . Then,
for optimizing (2) given a collection of existing user
profiles sampled i.i.d. from W. w∗� x = w̃∗� x̃ + w⊥
∗�
x.
Figure 1 illustrates a feature hierarchy with a two di-
3. Feature Hierarchies mensional subspace. Here, w∗ projects well to the sub-
To learn a reliable user model (i.e., a reliable esti- space, so we expect w∗� x ≈ w̃∗� x̃ (i.e., �w⊥
∗
� is small).
mate of w∗ ) from user feedback, bandit algorithms In such cases, a bandit algorithm can focus exploration
must make recommendations that explore the entire on the subspace to achieve faster convergence.
D-dimensional feature space. Conventional bandit al-
gorithms such as LinUCB place uniform a priori im- 3.1. Extension to Deeper Hierarchies
portance on each dimension, which can be inefficient For the �-th level, we define the projected w�∗ as
2
Our results also hold when each ŷt is independent with ∗
w�−1 = U� w�∗ + w�,⊥
∗
.
sub-Gaussian noise and mean w∗� xt (see Appendix A).
3
Since the rewards are sampled independently, any Then,
guarantee on (1) translates into a high probability guaran-
�
tee on the regret of the observed feedback Tt=1 w∗� x∗t − ŷt . w∗ = U1 (U2 (. . . (UL wL
∗ ∗
+ wL−1,⊥ ∗
) . . . w1,⊥ ∗
) + w⊥ .
Algorithm 1 CoFineUCB We will show that the following definitions of ct (·) and
1: input: λ, λ̃, U , ct (·), c̃t (·) c̃t (·) yield a valid 1 − δ confidence interval:
2: for t = 1, . . . , T do � � � �
(v) � � (b) � �
3: Define Xt ≡ [x1 , x2 , . . . , xt−1 ] c̃t (x) = α̃t �U � Mt−1 x� + α̃t �M̃t−1 U � Mt−1 x� (7)
M̃t−1
4: Define X̃t ≡ U � Xt � �
ct (x) = αt �x�M −1 + αt �Mt−1 x� ,
(v) (b)
5: Define Yt ≡ [ŷ1 , ŷ2 , . . . , ŷt−1 ] (8)
t
6: M̃t ← λ̃IK + X̃t X̃t�
7: w̃t ← M̃t−1 X̃t Yt� //least squares on coarse level (v) (b) (v) (b)
where α̃t , α̃t , αt , and αt are coefficients that
8: Mt ← λID + Xt Xt� must be set properly (Lemma 1).
9: wt ← Mt−1 (Xt Yt� + λU w̃t ) //least sq on fine level
10: Define µt (x) ≡ wt� x Broadly speaking, there are two types of uncertainty
11: xt ← argmaxx∈Xt µt (x)+ct (x)+c̃t (x) //play action affecting an estimate, wt� x, of the utility of x: vari-
with highest upper confidence bound
ance and bias. In our setting, variance is due to the
12: Recommend xt , observe reward ŷt
13: end for stochasticity of user feedback ŷt . Bias, on the other
hand, is due to regularization when estimating w̃t and
wt . Intuitively, as our algorithm receives more feed-
For simplicity and practical relevance, we focus on two- back, it becomes less uncertain (w.r.t. both bias and
level hierarchies. variance) of its estimates, w̃t and wt . This notion of
uncertainty is captured via the inverse feature covari-
4. Algorithm & Main Results ance matrices M̃t and Mt (Lines 6 & 8 in Algorithm 1).
Table 1 provides an interpretation of the four sources
We now present a bandit algorithm that exploits fea- of uncertainty described in (7) and (8).
ture hierarchies. Our algorithm, CoFineUCB, is an
Lemma 1 below describes how to set the coefficients
upper confidence bound algorithm that generalizes
such that ct (x)+c̃t (x) is a valid 1−δ confidence bound.
the well-studied LinUCB algorithm, and automatically
trades off between exploring the coarse and full feature Lemma 1. Define S̃ = �w̃∗ � and S⊥ = �w⊥
∗
�, and let
spaces. CoFineUCB is described in Algorithm 1. At � � �
each iteration t, CoFineUCB estimates the user’s pref- (v) 1/2 1/2
αt = log det (Mt ) det (λID ) /δ
erences in the subspace, w̃t , as well as the full feature � �
space, wt . Both estimates are solved via regularized � �1/2 � �1/2 �
(v)
least-squares regression. First, w̃t is estimated via α̃t = λ log det M̃t det λ̃IK /δ
t−1
� (b)
√
w̃t = argmin (w̃� x̃τ − ŷτ )2 + λ̃�w̃�2 , (4) αt = 2λS⊥
w̃ τ =1 (b)
α̃t = λλ̃S̃.
�
where x̃τ ≡ U xτ denotes the projected features of
Then (6) is a valid 1 − δ confidence interval.
the action taken at time τ . Then wt is estimated via
t−1
� With the confidence intervals defined, we are now
wt = argmin (w� xτ − ŷτ )2 + λ�w − U w̃t �2 , (5) ready to present our main result on the regret bound.
w
τ =1
Theorem 1. Define c̃t (·) and ct (·) as in (7), (8) and
which regularizes wt to the projection of w̃t back into Lemma 1. For λ ≥ maxx �x�2 and λ̃ ≥ maxx �x̃�2 ,
the full space. Both optimization problems have closed with probability 1 − δ, CoFineUCB achieves regret
form solutions (Lines 7 & 9 in Algorithm 1). � √ √ ��
RT (w∗ ) ≤ βT D + β̃T K 2T log(1 + T ),
CoFineUCB is an optimistic algorithm that chooses
the action with the largest potential reward (given where
some target confidence). Selecting such an action � √
requires computing confidence intervals around the βT = D log((1 + T /λ)/δ) + 2λS⊥ (9)
mean estimate wt . We maintain confidence intervals � �
for both the full space and the subspace, denoted ct (·) β̃T = K log((1 + T /λ̃)/δ) + λ̃S̃. (10)
and c̃t (·), respectively. Intuitively, a valid 1 − δ confi-
dence interval should satisfy the property that Lemma 1 and Theorem 1 are proved in Appendix A.
|x� (wt − w∗ )| ≤ ct (x) + c̃t (x) (6) Theorem 1 essentially bounds the regret as
�� √ �√ �
holds with probability at least 1 − δ. RT (w∗ ) = O λ̃�w̃∗ �K + ∗
2λ�w⊥ �D T , (11)
Term Interpretation Algorithm 2 LearnU: learning projection matrix

(v)
αt �x�M −1 feedback variance in full space 1: input: W ∈ �D×N , K ∈ {1, . . . , D}
� t �
α̃t �U � Mt−1 x�M̃ −1
(v)
feedback variance in coarse space 2: (A, Σ, B) ← SV D(W )
� −1 � t 3: U0 ← A1:K //top K singular vectors
αt � �
(b)
�Mt x � regularization bias in full space 4: Solve for Ω via (16) using U0 and W
(b) � �
α̃t �M̃t−1 U � Mt−1 x� regularization bias in coarse space 5: return: U0 Ω1/2
Table 1. Interpretating sources of uncertainty in (7), (8).

Given an orthonormal basis U0 ∈ �K×D , one can
choose U ∈ span(U0 ) to minimize its total contribu-
B
tion to the regret bound in (11) over the users in W :
!
�
argmin C̃ �w̃�, (13)
U ∈span(U0 )
B! B! ! B"
w∈W
Figure 2. An example of confidence regions utilized by where w̃ ≡ (U � U )−1 U � w, and C̃ = maxx �U � x� con-
CoFineUCB and LinUCB. B denotes the ellipsoid confi- strains the magnitude of U .
dence region used by LinUCB. CoFineUCB maintains two It is difficult to optimize (13) directly, so we approxi-
ellipsoid confidence regions, B̃ and B⊥ , for subspace and
mate it using a smooth formulation,4
full space, respectively. The joint confidence region of
CoFineUCB is essentially the convolution of B̃ and B⊥ , �
B̃ ⊗ B⊥ , which can be much smaller than B. argmin �w̃�2 , (14)
U ∈span(U0 ):�U �2F ro =K w∈W
ignoring log factors. In contrast, the conventional Lin- where we now constrain U via �U �2F ro = K.
UCB algorithm only explores in the full feature space We further restrict U to be U ≡ U0 Ω1/2 for Ω � 0.
and achieves an analogous regret bound of Under this restriction, (14) is equivalent to
�√ √ �
RT (w∗ ) = O λ�w∗ �D T . (12) �
argmin �w̃0� Ω−1 w̃0 �2 , (15)
Ω:trace(Ω)=K w∈W
Comparing (11) with (12) suggests that, when K <<
∗
D and �w⊥ � is small, CoFineUCB suffers much less where w̃0 ≡ (U0� U0 )−1 U0� w = U0� w. This formula-
regret due to more efficient exploration. Depending on tion is akin to multi-task structure learning, where W0
U , �w̃∗ � can also be much smaller than �w∗ �. Section would denote the various tasks and Ω denotes feature
5 describes an approach for computing such a U . relationships common across tasks (Argyriou et al.,
Intuitively, CoFineUCB enjoys a superior regret bound 2007; Zhang & Yeung, 2010). One can show that (15)
to LinUCB due to its use of tighter confidence regions. is convex and is minimized by
Figure 2 depicts a comparative example. LinUCB em- �
K
ploys ellipsoid confidence regions. CoFineUCB utilizes Ω= � W̃0 W̃0� , (16)
confidence regions that are essentially the convolution trace W̃0 W̃0 �
of two smaller ellipsoids, which can be much smaller

than the confidence regions of LinUCB. where W̃0 ≡ (U0� U0 )−1 U0� W = U0� W . See Appendix
B for a more detailed derivation.
5. Constructing Feature Hierarchies
6. Experiments
We now show how to construct a subspace U using pre-
existing user profiles W = {wi∗ }N
i=1 , where each profile We evaluate CoFineUCB via both simulations and a
is sampled independently from a common distribution live user study in the personalized news recommenda-
wi∗ ∼ W. In this setting, a reasonable objective is to tion domain. We first describe alternative methods, or
find a U that minimizes an empirical estimate of the baselines, for leveraging prior knowledge (pre-existing
bound on RT (W), which comprises �w̃� and �w⊥ �. profiles W ∈ �D×N ) that do not use a feature hierar-
Our approach is outlined in Algorithm 2. We assume chy. These baselines can conceptually be phrased as
that finding a K-dimensional subspace with low resid- special cases of CoFineUCB. The key idea is to alter
∗
ual norms �w⊥ � is straightfoward. In our experiments, 4
One can also regularize by inserting an axis-aligned
we simply use the top K singular vectors of W . “ridge” into W (i.e., W ← [W, ID ]).
the feature space such that �w∗ � in the new space is must choose a set of L actions and receives rewards
small. Thus, running LinUCB in the altered feature based on both the quality as well as diversity of the ac-
space yields an improved bound on the regret (12), tions chosen (L = 1 is the conventional bandit setting).
which is linear in �w∗ �. Using this structured action space leads to a more real-
istic setting for content recommendation, since recom-
6.1. Baseline Approaches mender systems often must recommend multiple items
at a time. It is straightforward to extend CoFineUCB
Mean-Regularized One simple approach is to reg- to the submodular bandit setting (see Appendix C).
ularize to w̄ (e.g., the mean of W ) when estimating wt
in LinUCB. The estimation problem can be written as
6.3. Simulations
t−1
� We performed simulation evaluations using data col-
wt = argmin (w� xτ − ŷτ )2 + λ�w − w̄�2 . (17) lected from a previous user study in personalized news
w
τ =1
recommendation by (Yue & Guestrin, 2011). The data
Typically, �w∗ − w̄� < �w∗ �, implying lower regret. includes featurized articles (D = 100) and N = 77
user profiles. We employed leave-one-out validation:
Reshape Another approach is to use LinUCB with a for each user, the transformations UD and U (K =
feature space “reshaped” via a transform UD ∈ �D×D : 5) were trained using the remaining users’ profiles.
For each user, we ran 25 simulations (T = 10000).
t−1
� All algorithms used the same U and UD projections,
wt = argmin (w� UD
�
xτ − ŷτ )2 + λ�w�2 . (18) where applicable. We also compared with a variant
w
τ =1 of CoFineUCB, CoFineUCB-focus, which scales down
exploration in the full space ct by a factor of 0.25.
As in the mean-regularization approach above, here
we would like the representation of w∗ in the reshaped Figure 3(a) shows the cumulative regret of each al-
space to have a small norm. In our experiments, we gorithm averaged over all users when recommending
use UD = LearnU(W, D) (Algorithm 2). one article per iteration (L = 1). All algorithms
dramatically outperform Naive LinUCB, with the ex-
We can incorporate such reshaping into CoFineUCB.
ception of Mean-Regularized which performs almost
We first project W into the space defined by UD , de-
identically. While Reshape shows good eventual con-
noted by Ŵ ,5 then compute U via LearnU(Ŵ , K).
vergence behavior, it incurs higher initial regret than
During model estimation, we replace (5) with
the CoFineUCB algorithms and SubspaceUCB. The
t−1
� trends also hold when recommending multiple articles
wt = argmin (w� UD
�
xτ − ŷτ )2 + λ�w − U w̃t �2 . per iteration (L = 5), as seen in Figure 3(b).
w
τ =1
The performance of the two variants of CoFineUCB
Incorporating reshaping into CoFineUCB can lead to a and SubspaceUCB demonstrate the benefit of explor-
decrease in S⊥ = �ŵ⊥ ∗
�. We found the modification to ing in the subspace. However, Figure 3(c) reveals the
be quite effective in practice; all our experiments in the critical shortfall of SubspaceUCB by comparing aver-
following sections employ this variant of CoFineUCB. age cumulative regret for the ten users with the largest
∗
residual �w⊥ �. For these atypical users, the subspace
SubspaceUCB Finally, we can simply ignore the is not sufficient to adequately learn their preferences,
full space and only apply LinUCB in the subspace. resulting in linear regret for SubspaceUCB.
While the method seems to perform well given a good Figure 3(d) shows the behavior of CoFineUCB as we
subspace (as seen in (Li et al., 2010; Chapelle & Li, vary K. Larger subspaces require more exploration,
2011; Yue & Guestrin, 2011), among others), it can which in general leads to increased regret.
yield linear regret if the residual of the user’s prefer-
ence is strong, as we will see in the experiments. Figure 3(e) shows the behavior of CoFineUCB as we
vary the scaling of exploration in the full space ct
(CoFineUCB-focus is the special case where the scal-
6.2. Experimental Setting
ing factor is 0.25). More conservative exploration in
We employ the submodular bandit extension of linear the full space tends to reduce regret. However, no ex-
stochastic bandits (Yue & Guestrin, 2011) to model ploration of the full space can lead to higher regret.
the news recommendation setting. Here, the algorithm
Synthetic Dataset. We used a 25-dimensional syn-
� � �−1 �
5
Ŵ ≡ UD UD UD W. thetic dataset to study the effect of mismatch between
4 5 5
10 10 10
4 4
10 10
3
10
Cumulative Regret
Cumulative Regret
Cumulative Regret
3 3
10 10
2
10
2 2
10 10
Naive
1 Mean−Regularized
10
Reshape 1 1
10 10
CoFineUCB
SubspaceUCB
CoFineUCB−focus
0 0 0
10 10 10
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
Iterations Iterations Iterations
(a) All users simulation (L = 1) (b) All users simulation (L = 5) (c) Atypical users simulation (L = 5)
Cumulative Regret, t=5000

3 1200
10
Naive
Cumulative Regret
1000
SubspaceUCB
Cumulative Regret
2
10
10
2 800 CoFineUCB−focus
600
t=10000
10
1 t=5000 400
CoFineUCB, K=15 t=2500
CoFineUCB, K=8 t=1250 200
CoFineUCB, K=5 1 t=625

10
0 10 0
0 1000 2000 3000 4000 5000 0 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Iterations
Multiplicative factor on ct Residual Magnitude
(d) CoFineUCB over varying K (e) CoFineUCB over varying ct (f) Subspace mismatch versus regret
Figure 3. (a)–(e) Cumulative regret results for news recommendation simulation. (f) Comparison over preference vectors
with varying projection residuals using synthetic simulation.
w∗ and U . This dataset allows for a more system- Comparison #Users Win/Tie/Lose Gain/Day
atic analysis by forcing every x and w∗ to have unit CoFineUCB
27 24 / 1 / 3 0.69
norm. For residual magnitude β ∈ [0, 1], we sampled v. Naive
CoFineUCB
w∗ uniformly
� in a 5-dimensional subspace with magni- v. Reshape
30 21 / 3 / 6 0.27
tude 1 − β 2 , and uniformly in the remaining dimen-
sions with magnitude β. Figure 3(f) shows the regret Table 2. User study comparing CoFineUCB with two base-
of both SubspaceUCB and CoFineUCB-focus increase lines. All results satisfy 95% statistical confidence.
with the residual, with SubspaceUCB exhibiting more
dramatic increase, beyond that of even Naive LinUCB.
first phase, we compared CoFineUCB with Naive. Af-
6.4. User Study terwards, we took all the user profiles learned so far
to estimate a reshaping of the full space UD , and com-
Our user study design follows the study conducted in
pared against Reshape. Due to the short duration of
(Yue & Guestrin, 2011). We presented each user with
each session (T = 10), we did not expect a meaningful
ten articles per day over ten days from January 21,
comparison between CoFineUCB and SubspaceUCB,
2012 to February 8, 2012. Each day comprised approx-
so we omitted it (We expect both methods to perform
imately ten thousand articles. We represented articles
equally well in early iterations, as seen in the simula-
using D = 100 features corresponding to topics learned
tion experiments.). For each user session, we counted
via latent Dirichlet Allocation (Blei et al., 2003). For
the total number of liked articles recommended by each
each day, articles shown are selected using an interleav-
algorithm. An algorithm wins a session if the user liked
ing of two bandit algorithms. The user is instructed
more articles recommended by it.
to briefly skim each article and mark each article as
“interested in reading in detail”or “not interested”. Table 2 shows that over the two stages, about 80% of
the users prefer CoFineUCB. We see a smaller gain
We conducted the user study in two phases. Prior to
against Reshape, a stronger baseline. On average,
the first phase, we conducted a preliminary study to
users liked over half an additional article per day from
collect preferences for constructing U (K = 5). In the
CoFineUCB over Naive, and about a quarter addi-
1 11
0.5 0.5
0.5
0 00
ï0.5 ï0.5
-0.5 w̃ �w⊥ �
1
0.8
ï1 ï1
-1 CoFineUCB
1 2 1 1 3 2 2 4 3 3 5 4 0.6 4 Naive6 5 5
0.4
0.2
−0.2
Figure 4. Each column of word clouds represents a dimension in the subspace. The bar lengths denote the magnitude
−0.4
in each dimension of preferences vectors learned by CoFineUCB (blue) and Naive LinUCB (red). The rightmost column
∗ −0.6
shows the norm of residual w⊥ of weight vectors learned by CoFineUCB and Naive LinUCB.
−0.8
1 2 3
tional per day over Reshape. These results show that knowledge for accelerated bandit learning.
CoFineUCB is effective in reducing the amount of ex-
Our work builds upon a long line of research on linear
ploration required.
stochastic bandits (Dani et al., 2008; Rusmevichien-
Figure 4 shows a representation of four dimensions tong & Tsitsiklis, 2010; Abbasi-Yadkori et al., 2011).
of U learned from user profiles. Each dimension is Although often practical, one limitation is the assump-
a combination of features, i.e., topics from LDA. In tion of realizability. In other words, we assume that
the top row, the i-th word cloud contains represen- the true model of user behavior lies within our class.
tative words from topics associated with high positive
The use of hierarchies in bandit learning is not new.
weights in i-th column of U , and the bottom row those
For instance, the work of (Pandey et al., 2007b;a) en-
with high negative weights. Examining Figure 4 can
code prior knowledge by hierarchically clustering arti-
reveal tendencies in the user preferences collected in
cles into a taxonomy. However, their setting is feature-
our study; for example, the third column shows that
free, which can make it difficult to generalize to new
users interested in Republican politics also tend to fol-
articles and users. In contrast, our approach makes
low healthcare debates, but tend to be uninterested
use of readily available feature-based prior knowledge
in videogaming. Figure 4 also shows a comparison of
such as the learned preferences of existing users.
weights estimated by CoFineUCB and Naive LinUCB
for one user. Since Naive LinUCB does not utilize the Another related line of work is that of sparse linear
subspace, the weights it estimates tend to have much bandits (Abbasi-Yadkori et al., 2012; Carpentier &
higher residual norm, whereas CoFineUCB puts higher Munos, 2012). The assumption is that the true w∗
weights on the subspace dimensions. is sparse, and one can achieve regret bounds that de-
pend on the sparsity of w∗ . In contrast, we consider
7. Related Work settings where user profiles are not necessarily sparse,
but can be well-approximated by a low-rank subspace.
Optimizing recommender systems via user feedback
It may be possible to integrate our feature hierar-
has become increasingly popular in recent years (El-
chy approach with other bandit learning algorithms,
Arini et al., 2009; Li et al., 2010; 2011; Yue & Guestrin,
such as Thompson Sampling (Chapelle & Li, 2011).
2011; Ahmed et al., 2012). Most prior work do not ad-
Thompson Sampling is a probability matching algo-
dress the issue of exploration and often train with pre-
rithm that samples wt from the posterior distribution.
collected feedback, which may lead to a biased model.
Using feature hierarchies, one can define a hierarchical
The exploration-exploitation tradeoff inherent in sampling approach that first samples w̃t in the sub-
learning from user feedback is naturally modeled as a space, and then samples wt around w̃t in the full space.
contextual bandit problem (Langford & Zhang, 2007;
Our approach can be applied to many structured
Li et al., 2010; Slivkins, 2011; Chapelle & Li, 2011;
classes of bandit problems (e.g., (Streeter & Golovin,
Krause & Ong, 2011). In contrast to most prior work,
2008; Cesa-Bianchi & Lugosi, 2009)), assuming that
we focus on principled approaches for encoding prior
actions can be featurized and modeled linearly. For in- Carpentier, Alexandra and Munos, Remi. Bandit theory
stance, our experiments demonstrated substantial im- meets compressed sensing for high dimensional stochas-
provements upon naive UCB algorithms for the linear tic linear bandit. In Conference on Artificial Intelligence
and Statistics (AISTATS), 2012.
submodular bandit problem (Yue & Guestrin, 2011).
Cesa-Bianchi, Nicol and Lugosi, Gabor. Combinatorial
The problem of learning a good subspace U is related bandits. In Conference on Learning Theory (COLT),
to finding a good regularization structure for multi- 2009.
task learning (Argyriou et al., 2007; Zhang & Yeung,
Chapelle, Olivier and Li, Lihong. An empirical evaluation
2010). Given a sample of user profiles (task weights), of thompson sampling. In Neural Information Processing
our goal is essentially to learn a regularization struc- Systems (NIPS), 2011.
ture so that future users (tasks) are solved efficiently.
Dani, Varsha, Hayes, Thomas, and Kakade, Sham.
However, the coarse subspace of our feature hierarchy Stochastic linear optimization under bandit feedback. In
was estimated using a relatively small number of im- Conference on Learning Theory (COLT), 2008.
perfectly estimated existing user profiles. A more gen-
El-Arini, Khalid, Veda, Gaurav, Shahaf, Dafna, and
eral problem would be to learn the feature hierarchy Guestrin, Carlos. Turning down the noise in the blo-
on-the-fly as an online learning problem itself. gosphere. In ACM Conference on Knowledge Discovery
and Data Mining (KDD), 2009.
8. Conclusion Krause, Andreas and Ong, Cheng Soon. Contextual gaus-
sian process bandit optimization. In Neural Information
We have presented a general approach to encoding Processing Systems (NIPS), 2011.
prior knowledge for accelerating contextual bandit
Langford, John and Zhang, Tong. The epoch-greedy al-
learning. In particular, our approach employs a coarse- gorithm for contextual multi-armed bandits. In Neural
to-fine feature hierarchy which dramatically reduces Information Processing Systems (NIPS), 2007.
the amount of exploration required. We evaluated our
Li, Lei, Wang, Dingding, Li, Tao, Knox, Daniel, and Pad-
approach in the setting of personalized news recom- manabhan, Balaji. Scene: A scalable two-stage personal-
mendation, where we showed significant improvements ized news recommendation system. In ACM Conference
over existing approaches for encoding prior knowledge. on Information Retrieval (SIGIR), 2011.
Acknowledgements. The authors thank the anonymous Li, Lihong, Chu, Wei, Langford, John, and Schapire,
reviewers for their helpful comments. The authors also Robert. A contextual-bandit approach to personalized
news article recommendation. In World Wide Web Con-
thank Khalid El-Arini for help with data collection and ference (WWW), 2010.
processing. This work was supported in part by ONR
(PECASE) N000141010672, ONR Young Investigator Pro- Pandey, Sandeep, Agarwal, Deepak, Chakrabarti, Deep-
ayan, and Josifovski, Vanja. Bandits for taxonomies: A
gram N00014-08-1-0752, and by the Intel Science and Tech-
model-based approach. In SIAM Conference on Data
nology Center for Embedded Computing. Mining (SDM), 2007a.
Pandey, Sandeep, Chakrabarti, Deepayan, and Agarwal,
References Deepak. Multi-armed bandit problems with dependent
arms. In International Conference on Machine Learning
Abbasi-Yadkori, Yasin, Pál, David, and Szepesvári, Csaba.
(ICML), 2007b.
Improved algorithms for linear stochastic bandits. In
Neural Information Processing Systems (NIPS), 2011. Rusmevichientong, Paat and Tsitsiklis, John. Linearly
parameterized bandits. Mathematics of Operations Re-
Abbasi-Yadkori, Yasin, Pal, David, and Szepesvari, Csaba. search, 35(2):395–411, 2010.
Online-to-confidence-set conversions and application to
sparse stochastic bandits. In Conference on Artificial Slivkins, Aleksandrs. Contextual bandits with similar-
Intelligence and Statistics (AISTATS), 2012. ity information. In Conference on Learning Theory
(COLT), 2011.
Ahmed, Amr, Teo, Choon Hui, Vishwanathan, S.V.N.,
and Smola, Alexander. Fair and balanced: Learning Streeter, Matthew and Golovin, Daniel. An online algo-
to present news stories. In ACM Conference on Web rithm for maximizing submodular functions. In Neural
Search and Data Mining (WSDM), 2012. Information Processing Systems (NIPS), 2008.
Argyriou, Andreas, Micchelli, Charles A., Pontil, Massi- Yue, Yisong and Guestrin, Carlos. Linear submodular ban-
miliano, and Ying, Yiming. A spectral regularization dits and their application to diversified retrieval. In Neu-
framework for multi-task structure learning. In Neural ral Information Processing Systems (NIPS), 2011.
Information Processing Systems (NIPS), 2007.
Zhang, Yu and Yeung, Dit-Yan. A convex formulation
Blei, David, Ng, Andrew, and Jordan, Michael. Latent for learning task relationships in multi-task learning.
dirichlet allocation. Journal of Machine Learning Re- In Conference on Uncertainty in Artificial Intelligence
search (JMLR), 3:993–1022, 2003. (UAI), 2010.

Hierarchical Exploration For Accelerating Contextual Bandits

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Hierarchical Exploration For Accelerating Contextual Bandits

Enviado por

Direitos autorais:

Formatos disponíveis

Hierarchical Exploration for Accelerating Contextual Bandits

Yisong Yue yisongyue@cmu.edu

Abstract Unfortunately, conventional bandit algorithms can

tion through simulation as well as a live user study

2. The Learning Problem

Term Interpretation Algorithm 2 LearnU: learning projection matrix

Table 1. Interpretating sources of uncertainty in (7), (8).

of two smaller ellipsoids, which can be much smaller

Cumulative Regret, t=5000

CoFineUCB, K=5 1 t=625

Você também pode gostar