Notes on the Elements of Statistical Learning, Ch4, 1

(E of SL) Ch. 4 Notes, 1

Notation

xXRpx \in \mathcal{X}\subset{\mathbb{R}}^p, feature vector, take N samples, we have X={xiT}i=1N\displaystyle X = \{x_i^T\}_{i=1}^N

G={1,...,K}\mathcal{G}= \{1,...,K\}, set of categories, where g(x)Gg(x) \in \mathcal{G}. Alternatively, we can also denote the class of xx as yy, where y{ek}y \in \{e_k\}, eke_k is the a n-vector with 11 in the kk-th dimension, and 00 elsewhere.

δk(x)\delta_k(x): discriminant function (methods of this kind will classify to the argmax on kk). A linear discriminant function (in xx) will result in a linear decision boundary. The rationale of using argmax comes from categorical distribution (where E(yx)=pE(y|x) = p)

Recap

The 0-1 loss

In Ch.2, with a binary loss function [ l(f,y)=1bool(f=y)\displaystyle l(f,y) = 1 - \mathbf{bool}(f=y), bool\mathbf{bool} is the boolean operator ] , we will have the Bayes classifier. Here we brief review the relationship:

g(x)=ky=ekg(x) = k \Leftrightarrow y=e_k

Then: l(f,y)=yTLfl(f,y) = y^TL\cdot f, where LL is a matrix with 00 on its diagonal and 11 elsewhere, i.e., L=11TInL = 11^T- I_n

The Bayesian classifier:

minf:xyR=y,xyTLf(x)  dP(x,y)\displaystyle \min_{f:x\rightarrow y}\mathbf{R} = \int_{y,x}y^TLf(x)\;d{\bf P}(x,y)

since ff unconstrained, pointwise minimization, x\displaystyle \forall x:
Rx=yxyxTLfdP(yx)\displaystyle {\bf R_{x}} = \int_{y|x} y_x^TL\cdot fd{\bf P}(y|x)
      =yxyxTdP(yx)Lf=E(yx)T(11TIn)f\displaystyle \;\;\; = \int_{y|x} y_x^Td{\bf P}(y|x)\cdot{\bf L}\cdot f= \mathbf{E}(y|x)^T \cdot({\bf 11^T - I_n})\cdot f\cdot
      =1E(yx)Tf\displaystyle \;\;\; =1 - \mathbf{E}\big(y|x\big)^Tf, thus we have x\forall x in the support:

f=argmaxk(E(yk)x)=argmaxk(P(yk=1)x)f = \text{arg}\max_k\big(\mathbf{E}(y_k)|x\big)=\text{arg}\max_k\big(\mathbf{P}(y_k=1)|x\big)

we will classify class of xx, y^=f(x)\hat{y}=f(x) to the most probable class, or the dominant class in N(x)N(x), NN is some neighborhood.

Logistic Regression

Assumption: with comparison class KK we assume

logP(G=jx)P(G=Kx)=βjTx\log\frac{\mathbf{P}(G=j|x)}{\mathbf{P}(G=K|x)}=\beta_j^T x

\Rightarrow

P(G=jx)=exp(βjTx)1+iK1exp(βiTx)\mathbf{P}(G=j|x) = \frac{\exp (\beta_j^T x) }{1+\sum_i^{K-1} \exp(\beta_i^T x)}

we do maximum likelihood:

logL({βi})=jlogP(gixi)\log L(\{\beta_i\}) = \sum_j \log \mathbf{P}(g_i | x_i)

Interesting notes

Binary case

if K=2K =2, then the max problem reduce to:

max{βi}logL=iNyilogP(yi=1xi)+(1yi)logP(yi=0xi)\max_{\{\beta_i\}} \log L = \sum_i^N y_i \log P(y_i=1|x_i) + (1-y_i)\log P(y_i=0|x_i)

maxlogL=jyjlog(1+exp(...))+(1yj)βTxj+(yj1)log(1+...)\max \log L = \sum_j - y_j \log(1+\sum\exp(...)) + (1-y_j)\beta^Tx_j + (y_j -1 )\log(1+...)

=j(1yj)βTxjlog(1+...)= \displaystyle \sum_j (1-y_j)\beta^Tx_j - \log(1+...)

logLβ=jyjxjxjexp(βTxj)1+...=jxj(1pj(1)yj)\displaystyle\frac{\partial \log L}{\partial \beta} = \sum_j - y_jx_j - x_j\frac {\exp(\beta^T x_j)}{1 + ...} = \sum_j x_j (1 - p_j(1)-y_j)

let pp be the vector of conditional probability being 1 [ P(G=1x)\mathbf P(G=1|x) ]

then logL=XT(1py)\log L' = X^T(1-p-y),

let logLββT=H=jxjxjTexp(βTxj)(1+...)2=XTWX\displaystyle\frac{\partial \log L}{\partial \beta\beta^T} = H = - \sum_j \frac{x_jx_j^T\exp(\beta^Tx_j)}{(1+...)^2} = X^TWX

w.t. W=diag[p1(1p1),p2(1p2),...,pN(1pN)]\displaystyle W=\mathbf{diag}\big [p_1(1-p_1), p_2(1-p_2), ..., p_N(1-p_N)\big ]

[ notice HH is always negative definite ]

multiclass

...