Notes on a ML course (2014 SJTU), 2

2 - Matrices and analysis

The notes taken from the course by Zhihua Zhang, 2014@SJTU

Matrices and matrix norms

Matrices

  • Hermitian or self-adjoint if A=AHA = A_H, i.e., aij=aji{\displaystyle a_{ij}={\overline {a_{ji}}}} (real then say it's symmetric).
  • skew Hermitian or anti-hermitian if A=AHA=-A^H,
  • normal if AHA=AAHA^HA = AA^H. (not equivalent to unitary!)
  • Orthogonal/Unitary (酉): ATA^T/AHA^H
    • A square matrix in Cn×n\mathbb C^{n\times n} is ... if AHA=I=AAHA^HA=I=AA^H
  • AHAA^HA,AAHAA^H is always Hermitian and positive semi-definite (p.s.d)
  • Singular value decomposition (SVD):
    • A=UΣVHA = U\Sigma V^H; note the differences between full/thin SVD
    • generally this could be done to any matrix, while the eigenvalue decomposition may not be computed
  • QR decomposition, also a eigenvalue related method, there are a group methods to do this. recap the numerical mathematics.

Norms

A norm is called:

  • unitarily invariant / isometrically invariant ui if UTV=T||UTV||=||T||
  • sub-multiplicative sm if STST||ST||\leq||S||\cdot||T||
  • consistent with vector norm .||.|| if for xx: AxAx||Ax|| \leq||A||\cdot||x||

Operator norm

induced pp-norm: p=2p=2 spectral norm is also a Schatten norm

Schatten (p) norms

Schattens are ui and sm

Ap=(i=1min{m,n}σip(A))1/p.{\displaystyle \|A\|_{p}=\left(\sum _{i=1}^{\min\{m,\,n\}}\sigma _{i}^{p}(A)\right)^{1/p}.\,}

Nuclear Norm: p=1

A=iσi(A){\displaystyle \|A\|_{*}=\sum _i \sigma _{i}(A)\,}

Frobenius Norm

AF=(iσi2(A))1/2{\displaystyle \|A\|_{F}=\left(\sum _i \sigma _{i}^2(A)\right)^{1/2}\,}

AF2=i=1mj=1naij2=tr(AA)=i=1σi2(A){\displaystyle \|A\|_{F}^2= {\sum _{i=1}^{m}\sum _{j=1}^{n}|a_{ij}|^{2}}= {tr (A^{\dagger }A)}= {\sum _{i=1}\sigma _{i}^{2}(A)}}

invariant under rotation: for any unitary matrix RR (etc.: rotation, orthonormal basis, U,VU,V in SVD)

ARF2=tr(RTATAR)=tr(RRTATA)=tr(ATA)=AF2{\displaystyle \|AR\|_{\rm {F}}^{2}=tr \left(R^{\rm {T}}A^{\rm {T}}AR\right)=tr \left(RR^{\rm {T}}A^{\rm {T}}A\right)=tr \left(A^{\rm {T}}A\right)=\|A\|_{\rm {F}}^{2}}

Derivatives

Directional derivative & Gâteaux differentiable

Let f:ERf : \bf E \rightarrow R, the directional derivative of a function ff at xx in a direction dEd \in E is the limit:

f(x,d)=limt0f(x+cd)f(x)tf'(x,d) = \lim_{t \searrow 0} \frac{f(x+c\cdot d)-f(x)}{t}

if the limit exits. if ff' is linear in dd, s.t., f=<a,d>f'=\left< a,d \right> (inner-product), Then, we say ff is (Gâteaux) differentiable at x with (Gâteaux) derivative aa.

  • Example:

    • f(X)=logXf(X) = \log |X| find f(X;Y)f'(X;Y) where X,YS++X,Y\in\mathscr S_{++}

    • Solution

    limt0logX+tYlogXt=limt01tlogI+tX1Y\displaystyle\lim_{t\rightarrow 0}\frac{ \log|X+tY| - log|X|}{t} = \lim_{t\rightarrow 0} \frac{1}{t} \log|I+tX^{-1}Y|

    do spectral decomposition on X1YX^{-1}Y, since X,YS++X,Y\in\mathscr S_{++}: X1Y=QΛQTX^{-1}Y = Q\Lambda Q^T, where Λ=diag(λ)\Lambda = diag(\lambda) are eigenvalues.

    logI+tX1Y=logQ(I+tΛ)QT\Rightarrow \log|I+tX^{-1}Y| = \log|Q(I+t\Lambda)Q^T|

    =logi1+tλi=ilog(1+tλi)=\log\prod_i 1+t\lambda_i = \sum_i log(1+t\lambda_i)

    f(X;Y)=limt0iλi(1+tλi)=iλi=tr(X1Y)=<X1,YT>=<X1,Y>f'(X;Y) = \displaystyle \lim_{t \searrow 0} \sum_i \frac{\lambda_i}{(1+t\lambda_i)} = \sum_i \lambda_i = tr(X^{-1}Y) =\left<X^{-1},Y^T\right>=\left<X^{-1},Y\right>, f=X1f' = X^{-1}

Subgradient

more see convex analysis

ϕ\phi is said to be a subgradient of f(x)f(x) at x0x_0, if it satisfies <ϕ,xx0>f(x)f(x0),    xE\left<\phi,x-x_0\right> \le f(x) - f(x_0),\;\;\forall x \in E. EE is a convex open set. the set of subgradients is called subdifferential at x0x_0 and is denoted f(x0)\partial f(x_0).

  • The subdifferential is always a nonempty convex compact set.

  • A convex function f:IRf:I\rightarrow \bf R is differentiable at x0x_0 if and only if the subdifferential is made up of only one point, which is the derivative at x0x_0

  • A point x0x_0 is a global minimum of a convex function ff iff 0f(x0)0\in \partial f(x_0)

  • Example

    • f(x)=12(ya)2+λyf(x)=\frac{1}{2}(y-a)^2+\lambda|y|, find the subdifferential at y=0y=0
    • -> and find the infyf\displaystyle\inf_y f
    • Solution

    f=(ya)+λy\displaystyle \partial f = (y-a)+\lambda\partial|y|

    ϕyy\partial \phi\cdot y\le |y|, we have subgradient of y:ϕ[1,1]|y|:\partial \phi\in [-1,1]

    f=(ya)+ϕ=[yaλ,ya+λ]={ya+λsgn(y)}\displaystyle \partial f = (y-a)+\partial\phi=[y-a-\lambda,y-a+\lambda]=\left\{y-a+\lambda sgn(y)\right\}
    we notice ff convex (easy to verify) if y^\hat y is the minimizer ; then 0{y^a+λy^}ay^λy^0 \in \{\hat y-a+\lambda\partial|\hat y|\}\Rightarrow a-\hat y\in \lambda\partial|\hat y|

    if y^=0\hat y = 0, a[λ,λ]a\in [-\lambda,\lambda]

    if y^̸=0\hat y \neq 0, ay^=λsign(y^)a - \hat y =\lambda \mathbf{sign}(\hat y)