Notes on a ML course (2014 SJTU), 2 2 - Matrices and analysis The notes taken from the course by Zhihua Zhang, 2014@SJTU
Matrices and matrix norms Matrices Hermitian or self-adjoint if A = A H A = A_H A = A H , i.e., a i j = a j i ‾ {\displaystyle a_{ij}={\overline {a_{ji}}}} a i j = a j i (real then say it's symmetric ). skew Hermitian or anti-hermitian if A = − A H A=-A^H A = − A H , normal if A H A = A A H A^HA = AA^H A H A = A A H . (not equivalent to unitary!
) Orthogonal/Unitary (酉): A T A^T A T /A H A^H A H A square matrix in C n × n \mathbb C^{n\times n} C n × n is ... if A H A = I = A A H A^HA=I=AA^H A H A = I = A A H A H A A^HA A H A ,A A H AA^H A A H is always Hermitian and positive semi-definite (p.s.d) Singular value decomposition (SVD): A = U Σ V H A = U\Sigma V^H A = U Σ V H ; note the differences between full/thin SVD generally this could be done to any matrix, while the eigenvalue decomposition may not be computed QR decomposition, also a eigenvalue related method, there are a group methods to do this. recap the numerical mathematics. Norms A norm is called:
unitarily invariant / isometrically invariant ui
if ∣ ∣ U T V ∣ ∣ = ∣ ∣ T ∣ ∣ ||UTV||=||T|| ∣ ∣ U T V ∣ ∣ = ∣ ∣ T ∣ ∣ sub-multiplicative sm
if ∣ ∣ S T ∣ ∣ ≤ ∣ ∣ S ∣ ∣ ⋅ ∣ ∣ T ∣ ∣ ||ST||\leq||S||\cdot||T|| ∣ ∣ S T ∣ ∣ ≤ ∣ ∣ S ∣ ∣ ⋅ ∣ ∣ T ∣ ∣ consistent with vector norm ∣ ∣ . ∣ ∣ ||.|| ∣ ∣ . ∣ ∣ if for x x x : ∣ ∣ A x ∣ ∣ ≤ ∣ ∣ A ∣ ∣ ⋅ ∣ ∣ x ∣ ∣ ||Ax|| \leq||A||\cdot||x|| ∣ ∣ A x ∣ ∣ ≤ ∣ ∣ A ∣ ∣ ⋅ ∣ ∣ x ∣ ∣ Operator norm induced p p p -norm: p = 2 p=2 p = 2 spectral norm is also a Schatten norm
Schatten (p) norms Schattens are ui
and sm
∥ A ∥ p = ( ∑ i = 1 min { m ,   n } σ i p ( A ) ) 1 / p .   {\displaystyle \|A\|_{p}=\left(\sum _{i=1}^{\min\{m,\,n\}}\sigma _{i}^{p}(A)\right)^{1/p}.\,} ∥ A ∥ p = ⎝ ⎛ i = 1 ∑ min { m , n } σ i p ( A ) ⎠ ⎞ 1 / p .
Nuclear Norm: p=1 ∥ A ∥ ∗ = ∑ i σ i ( A )   {\displaystyle \|A\|_{*}=\sum _i \sigma _{i}(A)\,} ∥ A ∥ ∗ = i ∑ σ i ( A )
Frobenius Norm ∥ A ∥ F = ( ∑ i σ i 2 ( A ) ) 1 / 2   {\displaystyle \|A\|_{F}=\left(\sum _i \sigma _{i}^2(A)\right)^{1/2}\,} ∥ A ∥ F = ( i ∑ σ i 2 ( A ) ) 1 / 2
∥ A ∥ F 2 = ∑ i = 1 m ∑ j = 1 n ∣ a i j ∣ 2 = t r ( A † A ) = ∑ i = 1 σ i 2 ( A ) {\displaystyle \|A\|_{F}^2= {\sum _{i=1}^{m}\sum _{j=1}^{n}|a_{ij}|^{2}}= {tr (A^{\dagger }A)}= {\sum _{i=1}\sigma _{i}^{2}(A)}} ∥ A ∥ F 2 = i = 1 ∑ m j = 1 ∑ n ∣ a i j ∣ 2 = t r ( A † A ) = i = 1 ∑ σ i 2 ( A )
invariant under rotation: for any unitary matrix R R R (etc.: rotation, orthonormal basis, U , V U,V U , V in SVD)
∥ A R ∥ F 2 = t r ( R T A T A R ) = t r ( R R T A T A ) = t r ( A T A ) = ∥ A ∥ F 2 {\displaystyle \|AR\|_{\rm {F}}^{2}=tr \left(R^{\rm {T}}A^{\rm {T}}AR\right)=tr \left(RR^{\rm {T}}A^{\rm {T}}A\right)=tr \left(A^{\rm {T}}A\right)=\|A\|_{\rm {F}}^{2}} ∥ A R ∥ F 2 = t r ( R T A T A R ) = t r ( R R T A T A ) = t r ( A T A ) = ∥ A ∥ F 2
Derivatives Directional derivative & Gâteaux differentiable Let f : E → R f : \bf E \rightarrow R f : E → R , the directional derivative of a function f f f at x x x in a direction d ∈ E d \in E d ∈ E is the limit:
f ′ ( x , d ) = lim t ↘ 0 f ( x + c ⋅ d ) − f ( x ) t f'(x,d) = \lim_{t \searrow 0} \frac{f(x+c\cdot d)-f(x)}{t} f ′ ( x , d ) = t ↘ 0 lim t f ( x + c ⋅ d ) − f ( x )
if the limit exits. if f ′ f' f ′ is linear in d d d , s.t., f ′ = < a , d > f'=\left< a,d \right> f ′ = ⟨ a , d ⟩ (inner-product), Then, we say f f f is (Gâteaux) differentiable at x with (Gâteaux) derivative a a a .
Example:
f ( X ) = log ∣ X ∣ f(X) = \log |X| f ( X ) = log ∣ X ∣ find f ′ ( X ; Y ) f'(X;Y) f ′ ( X ; Y ) where X , Y ∈ S + + X,Y\in\mathscr S_{++} X , Y ∈ S + +
Solution
lim t → 0 log ∣ X + t Y ∣ − l o g ∣ X ∣ t = lim t → 0 1 t log ∣ I + t X − 1 Y ∣ \displaystyle\lim_{t\rightarrow 0}\frac{ \log|X+tY| - log|X|}{t} = \lim_{t\rightarrow 0} \frac{1}{t} \log|I+tX^{-1}Y| t → 0 lim t log ∣ X + t Y ∣ − l o g ∣ X ∣ = t → 0 lim t 1 log ∣ I + t X − 1 Y ∣
do spectral decomposition on X − 1 Y X^{-1}Y X − 1 Y , since X , Y ∈ S + + X,Y\in\mathscr S_{++} X , Y ∈ S + + : X − 1 Y = Q Λ Q T X^{-1}Y = Q\Lambda Q^T X − 1 Y = Q Λ Q T , where Λ = d i a g ( λ ) \Lambda = diag(\lambda) Λ = d i a g ( λ ) are eigenvalues.
⇒ log ∣ I + t X − 1 Y ∣ = log ∣ Q ( I + t Λ ) Q T ∣ \Rightarrow \log|I+tX^{-1}Y| = \log|Q(I+t\Lambda)Q^T| ⇒ log ∣ I + t X − 1 Y ∣ = log ∣ Q ( I + t Λ ) Q T ∣
= log ∏ i 1 + t λ i = ∑ i l o g ( 1 + t λ i ) =\log\prod_i 1+t\lambda_i = \sum_i log(1+t\lambda_i) = log ∏ i 1 + t λ i = ∑ i l o g ( 1 + t λ i )
f ′ ( X ; Y ) = lim t ↘ 0 ∑ i λ i ( 1 + t λ i ) = ∑ i λ i = t r ( X − 1 Y ) = < X − 1 , Y T > = < X − 1 , Y > f'(X;Y) = \displaystyle \lim_{t \searrow 0} \sum_i \frac{\lambda_i}{(1+t\lambda_i)} = \sum_i \lambda_i = tr(X^{-1}Y) =\left<X^{-1},Y^T\right>=\left<X^{-1},Y\right> f ′ ( X ; Y ) = t ↘ 0 lim i ∑ ( 1 + t λ i ) λ i = i ∑ λ i = t r ( X − 1 Y ) = ⟨ X − 1 , Y T ⟩ = ⟨ X − 1 , Y ⟩ , f ′ = X − 1 f' = X^{-1} f ′ = X − 1
Subgradient more see convex analysis
ϕ \phi ϕ is said to be a subgradient of f ( x ) f(x) f ( x ) at x 0 x_0 x 0 , if it satisfies < ϕ , x − x 0 > ≤ f ( x ) − f ( x 0 ) ,       ∀ x ∈ E \left<\phi,x-x_0\right> \le f(x) - f(x_0),\;\;\forall x \in E ⟨ ϕ , x − x 0 ⟩ ≤ f ( x ) − f ( x 0 ) , ∀ x ∈ E . E E E is a convex open set. the set of subgradients is called subdifferential at x 0 x_0 x 0 and is denoted ∂ f ( x 0 ) \partial f(x_0) ∂ f ( x 0 ) .
The subdifferential is always a nonempty convex compact set.
A convex function f : I → R f:I\rightarrow \bf R f : I → R is differentiable at x 0 x_0 x 0 if and only if the subdifferential is made up of only one point, which is the derivative at x 0 x_0 x 0
A point x 0 x_0 x 0 is a global minimum of a convex function f f f iff 0 ∈ ∂ f ( x 0 ) 0\in \partial f(x_0) 0 ∈ ∂ f ( x 0 )
Example
f ( x ) = 1 2 ( y − a ) 2 + λ ∣ y ∣ f(x)=\frac{1}{2}(y-a)^2+\lambda|y| f ( x ) = 2 1 ( y − a ) 2 + λ ∣ y ∣ , find the subdifferential at y = 0 y=0 y = 0 -> and find the inf y f \displaystyle\inf_y f y inf f Solution
∂ f = ( y − a ) + λ ∂ ∣ y ∣ \displaystyle \partial f = (y-a)+\lambda\partial|y| ∂ f = ( y − a ) + λ ∂ ∣ y ∣
∂ ϕ ⋅ y ≤ ∣ y ∣ \partial \phi\cdot y\le |y| ∂ ϕ ⋅ y ≤ ∣ y ∣ , we have subgradient of ∣ y ∣ : ∂ ϕ ∈ [ − 1 , 1 ] |y|:\partial \phi\in [-1,1] ∣ y ∣ : ∂ ϕ ∈ [ − 1 , 1 ]
∂ f = ( y − a ) + ∂ ϕ = [ y − a − λ , y − a + λ ] = { y − a + λ s g n ( y ) } \displaystyle \partial f = (y-a)+\partial\phi=[y-a-\lambda,y-a+\lambda]=\left\{y-a+\lambda sgn(y)\right\} ∂ f = ( y − a ) + ∂ ϕ = [ y − a − λ , y − a + λ ] = { y − a + λ s g n ( y ) } we notice f f f convex (easy to verify) if y ^ \hat y y ^ is the minimizer ; then 0 ∈ { y ^ − a + λ ∂ ∣ y ^ ∣ } ⇒ a − y ^ ∈ λ ∂ ∣ y ^ ∣ 0 \in \{\hat y-a+\lambda\partial|\hat y|\}\Rightarrow a-\hat y\in \lambda\partial|\hat y| 0 ∈ { y ^ − a + λ ∂ ∣ y ^ ∣ } ⇒ a − y ^ ∈ λ ∂ ∣ y ^ ∣
if y ^ = 0 \hat y = 0 y ^ = 0 , a ∈ [ − λ , λ ] a\in [-\lambda,\lambda] a ∈ [ − λ , λ ]
if y ^ ̸ = 0 \hat y \neq 0 y ^ ̸ = 0 , a − y ^ = λ s i g n ( y ^ ) a - \hat y =\lambda \mathbf{sign}(\hat y) a − y ^ = λ s i g n ( y ^ )
February 21, 2018