Notes on the Elements of Statistical Learning, Ch2

(E of SL) Ch. 1~2 笔记

Dots

  • The pair XRp,yRKX \in \mathbf{R}^p, y\in \mathbf{R}^K (k=1k=1 if scalar function), SL studies the function ff as estimation of yy, i.e. y^=f(x)\hat{y} = f(x).
    • and thus you need a loss [ l=l(f,y)l = l(f,y) ] and risk R=Ex,y(l)R = E_{x,y}(l) To evaluate the estimates of functions. By minimizing the risk, under constraints on space of f,Ff, \mathcal{F}, we have the "learned" estimates, the problem has the form:

minfR\min_{f} \mathbf{R}

s.t.    fF={ff:XY}s.t.\;\; f\in\mathcal{F}=\{f|f:X \mapsto Y\}

  • Three main classes in SL
    • Parametric,f(x)=f(xθ)f(x) = f(x|\theta). Since the model has an analytic form or a group parameters, you limit the space of ff
      • examples: linear regression, logistic, LDA/QDA
    • non-parametric: local/neighborhood, this class is highly dependent on the data, the samples caught in real time.
    • Mixture of both.

Statistical decision theory

Concepts: loss, risk (expected prediction error, EPE)

  • @true fact, EPE notation used in "The Elements in Statistical Learning" is basically a frequentist's risk.

The loss function

the squared loss

  • f(x)=E(yx);  xf(x) = \mathbf E(y|x);\;\forall x, if ff is unconstrained, fC0(Rm)f \in \mathbb{C}_0(\mathbb{R}^m)
  • if fC0f \in \mathbb{C}_0, we have the estimation:
    • β=E(xx)1E(xy)\bf{\beta} = \bf{\mathbf{E}}(xx')^{-1}\mathbf{E}(x'y)
  • if ff is a neighborhood-typed method:
    • the k-nearest-neighbor is very similar to simple functions, where f=E(yxN(x))f=\mathbf{E}(y|x\in \bf{N}(x))
    • trees, at each node you compute the mean of response.

0-1 loss in classification

  • for bayesian classifier if discrete y (categorical)
  • where:
    • L=11In\displaystyle \bf{L = 11' - I_n}, we have:
    • f(x)=ekf(x) = e_k, where k=argmaxk(P(yk=1x))k = \text{argmax}_k(P(y_k=1|x))
    • K-classification: equivalence of L2\mathcal{L}_2 for regression and 0-1 loss for bayesian classifition

Remark

  • if ff additive model:
    • It turns out that the optimal estimate for the additive model uses techniques such as k-nearest neighbors to approximate univariate conditional expectations simultaneously for each of the coordinate functions
      • P.F 1.4
  • if ll = absolute-loss
    • f(x)=median(yx)f(x) = \text{median}(y|x)

miscellanea

Mean square error and residual sum of square:

P.F

Ch1

P.F 1.1

R=y,x(yf(x))2  dP(x,y)\displaystyle \mathbf{R} = \int_{y,x}(y-f(x))^2\;d{\bf P}(x,y), pointwise minimization CANNOT BE USED:
R=y,x(yβx)2  dP(x,y)\displaystyle \mathbf{R} = \int_{y,x}(y-\beta'x)^2\;d{\bf P}(x,y) convex in β\beta
R/β=yx2x(yxβ)  dP(x,y)=0E(xy)=E(xx)β\displaystyle \partial \mathbf{R} /\partial \beta = \int_{y|x}2x(y-x'\beta)\;d{\bf P}(x,y) = 0\Rightarrow \mathbf{E}(xy) = \mathbf{E}(xx')\beta
β=E(xx)1E(xy)\beta = \mathbf{E}(xx')^{-1}\mathbf{E}(xy)

E(xx)\mathbf{E}(xx') may not be invertible

P.F 1.3

The Bayesian classifier:
R=y,xyLf(x)  dP(x,y)\displaystyle \mathbf{R} = \int_{y,x}y'Lf(x)\;d{\bf P}(x,y), pointwise minimization, x\displaystyle \forall x:
Rx=yxyxLfdP(yx)\displaystyle {\bf R_{x}} = \int_{y|x} y_x'L\cdot fd{\bf P}(y|x)
      =yxyxdP(yx)Lf=E(yx)(11In)f\displaystyle \;\;\; = \int_{y|x} y_x'd{\bf P}(y|x)\cdot{\bf L}\cdot f= \mathbf{E}(y|x)' \cdot({\bf 11' - I_n})\cdot f\cdot
      =1E(yx)f\displaystyle \;\;\; =1 - \mathbf{E}\big(y|x\big)'f, thus we have x\forall x in the support:

f=argmaxk(E(yk)x)=argmaxk(P(yk=1)x)f = \text{arg}\max_k\big(\mathbf{E}(y_k)|x\big)=\text{arg}\max_k\big(\mathbf{P}(y_k=1)|x\big)

P.F. 1.4

Consider regression model for k-classification using L2\mathcal{L}_2 loss:
R=y,xyf(x)  dP(x,y)\displaystyle \mathbf{R} = \int_{y,x}||y-f(x)||\; d{\bf P}(x,y)
In this case, function ff maps XRm\mathcal{X} \subseteq \mathbb{R}^m to set of y{ek}k=1Ky \in \{\mathbf e_k\}_{k=1}^{K}
f=E(yx)f=\mathbf E(y|x), the result is exact