Escolar Documentos
Profissional Documentos
Cultura Documentos
Part of Y is
Y =W W =W userW movie
observed
y = 〈W , X 〉 + b
X
Problem formulation
Primal problem
minimize f (A(W )) + φλ (W )
W ∈RR×C |ℓ {z }
=:f (W ) .
Outline
1 Introduction
Why learn low-rank matrices?
Existing algorithms
2 Proposed algorithm
Augmented Lagrangian ↔ Proximal minimization
Super-linear convergence
Generalizations
3 Experiments
Synthetic 10,000×10,000 matrix completion.
BCI classification problem (learning multiple matrices).
4 Summary
M-DAL algorithm
1 Initialize W 0 . Choose η1 ≤ η2 ≤ · · ·
2 Iterate until relative duality gap < ϵ
.
αt := argmin ϕt (α) (inner objective)
α∈Rm
³ ´
W t+1 = STληt W t + ηt A⊤ (αt )
Definition
W ∗ : the unique minimizer of the objective f (W ).
.
W t : sequence generated by the M-DAL algorithm with
q µ ¶
γ t+1 t 1/γ: Lipschitz con-
∥∇ϕt (α )∥ ≤ ηt ∥W
t − W ∥fro
stant of ∇fℓ .
Assumption
.
There is a constant σ > 0 such that
φ∗λ(w) Φ∗λ(w)
−λ 0 λ
Generalizations
1 Learning multiple matrices
°Ã (1) !°
X
K ° W 0 °
° °
φλ (W ) = λ ∥W (k ) ∥∗ = λ ° ..
. °
° (K )
°
k =1 0 W ∗
m=1,200,000 m=2,400,000
250 1200
18
8
1000
200
16
Time (s)
Time (s)
800
Rank
Rank
150 6
600 14
100
4 400
12
50 200
0 3 2 0 10
2 3 2
10 10 10 10
Regularization constant λ Regularization constant λ
Rank=10
Rank=20
Comparison
AG M-DAL (proposed)
1 IP
PG IP (interior point
Relative duality gap
Super-linear convergence
0 0
10 10
Relative duality gap
−2 −2
10 10
−4 −4
10 10
−6 −6
10 10
0 5 10 15 0 5 10 15
Number of iterations Time (s)
5 3
10 10 80
75
Number of iterations
4 2
10 10
70
Accuracy
Time (s)
3 1
10 10
65
2 0
10 10
60
1 −1
10 10 55
0 −2 50
10 1 0 −1 −2
10 1 0 −1 −2 1 0 −1 −2
10 10 10 10 10 10 10 10 10 10 10 10
Regularization constant Regularization constant Regularization constant
Summary
Background
Convex conjugate of f
Fenchel dual:
Inf-convolution
Inf-convolution (envelope function):
1
Φ∗λ (V ; W t ) = (φλ + ∥ · −W t ∥)∗ (V )
2µ ¶
∗ 1 t
= inf φλ (Y ) + ∥W + V − Y ∥2
Y ∈RR×C 2
= Φ∗λ (W t + V ; 0) = Φ∗λ (W t + V )
Nondifferentiable
Differentiable
Φλ(x) φλ*(v)
φλ(x) Φλ*(v)
³ ´
W t+1 = STληt W t + ηt A⊤ (αt )
where 1
ηt
Φ∗λη (W t +ηt A⊤ (α))
z }| {
t
³ 1 ´
αt = argmin fℓ∗ (−α) + ∥STληt (W t + ηt A⊤ (α))∥2
α 2ηt
| {z }
ϕt (α)
ϕt (α) is differentiable:
ϕt (α), ∇ϕt (α) can be computed using only the singular values
σj (W tα) ≥ ληt (and the corresponding SVs).
⇒ no need to compute full SVD of W tα := W t + ηt A⊤ (α).
But, computation of ∇2 ϕt (α) requires full SVD.
Consequence:
When A⊤ (α) is structured, computing the above SVD is cheap.
E.g., matrix completion: A⊤ (α) is sparse with only m non-zeros.
= Wt W tR +ηt
W tα L
References
Abernethy et al. A new approach to collaborative filtering: Operator estimation with
spectral regularization. JMLR 10, 2009.
Argyriou et al. Multi-task feature learning In NIPS 19, 2007.
Ji & Ye. An accelerated gradient method for trace norm minimization. In ICML 2009.
Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. on Control and
Optimization 14, 1976a.
Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in
convex programming. Math. of Oper. Res. 1, 1976b.
Srebro et al. Maximum-margin matrix factorization. In NIPS 17, 2005.
Tomioka & Aihara. Classifying matrices with a spectral regularization. In ICML 2007.
Tomioka et al. Super-Linear Convergence of Dual Augmented-Lagrangian Algorithm for
Sparsity Regularized Estimation. Arxiv:0911.4046, 2010.
Tomioka & Sugiyama. Sparse learning with duality gap guarantee. In NIPS workshop OPT
2008 Optimization for Machine Learning, 2008.
Tomioka & Müller. A regularized discriminative framework for EEG analysis with
application to brain-computer interface. Neuroimage 49, 2010.
Wright. Differential equations for the analytic singular value decomposition of a matrix.
Numer. Math. 63, 1992.