Notes on the Elements of Statistical Learning, Ch2

(E of SL) Ch. 1~2 笔记

Dots

minfR\min_{f} \mathbf{R}

s.t.    fF={ff:XY}s.t.\;\; f\in\mathcal{F}=\{f|f:X \mapsto Y\}

Statistical decision theory

Concepts: loss, risk (expected prediction error, EPE)

The loss function

the squared loss

0-1 loss in classification

Remark

miscellanea

Mean square error and residual sum of square:

P.F

Ch1

P.F 1.1

R=y,x(yf(x))2  dP(x,y)\displaystyle \mathbf{R} = \int_{y,x}(y-f(x))^2\;d{\bf P}(x,y), pointwise minimization CANNOT BE USED:
R=y,x(yβx)2  dP(x,y)\displaystyle \mathbf{R} = \int_{y,x}(y-\beta'x)^2\;d{\bf P}(x,y) convex in β\beta
R/β=yx2x(yxβ)  dP(x,y)=0E(xy)=E(xx)β\displaystyle \partial \mathbf{R} /\partial \beta = \int_{y|x}2x(y-x'\beta)\;d{\bf P}(x,y) = 0\Rightarrow \mathbf{E}(xy) = \mathbf{E}(xx')\beta
β=E(xx)1E(xy)\beta = \mathbf{E}(xx')^{-1}\mathbf{E}(xy)

E(xx)\mathbf{E}(xx') may not be invertible

P.F 1.3

The Bayesian classifier:
R=y,xyLf(x)  dP(x,y)\displaystyle \mathbf{R} = \int_{y,x}y'Lf(x)\;d{\bf P}(x,y), pointwise minimization, x\displaystyle \forall x:
Rx=yxyxLfdP(yx)\displaystyle {\bf R_{x}} = \int_{y|x} y_x'L\cdot fd{\bf P}(y|x)
      =yxyxdP(yx)Lf=E(yx)(11In)f\displaystyle \;\;\; = \int_{y|x} y_x'd{\bf P}(y|x)\cdot{\bf L}\cdot f= \mathbf{E}(y|x)' \cdot({\bf 11' - I_n})\cdot f\cdot
      =1E(yx)f\displaystyle \;\;\; =1 - \mathbf{E}\big(y|x\big)'f, thus we have x\forall x in the support:

f=argmaxk(E(yk)x)=argmaxk(P(yk=1)x)f = \text{arg}\max_k\big(\mathbf{E}(y_k)|x\big)=\text{arg}\max_k\big(\mathbf{P}(y_k=1)|x\big)

P.F. 1.4

Consider regression model for k-classification using L2\mathcal{L}_2 loss:
R=y,xyf(x)  dP(x,y)\displaystyle \mathbf{R} = \int_{y,x}||y-f(x)||\; d{\bf P}(x,y)
In this case, function ff maps XRm\mathcal{X} \subseteq \mathbb{R}^m to set of y{ek}k=1Ky \in \{\mathbf e_k\}_{k=1}^{K}
f=E(yx)f=\mathbf E(y|x), the result is exact