Notes on the Elements of Statistical Learning, Ch4, 1
(E of SL) Ch. 4 Notes, 1
Notation
x∈X⊂Rp, feature vector, take N samples, we have X={xiT}i=1N
G={1,...,K}, set of categories, where g(x)∈G. Alternatively, we can also denote the class of x as y, where y∈{ek}, ek is the a n-vector with 1 in the k-th dimension, and 0 elsewhere.
δk(x): discriminant function (methods of this kind will classify to the argmax on k). A linear discriminant function (in x) will result in a linear decision boundary. The rationale of using argmax comes from categorical distribution (where E(y∣x)=p)
Recap
The 0-1 loss
In Ch.2, with a binary loss function [ l(f,y)=1−bool(f=y), bool is the boolean operator ] , we will have the Bayes classifier. Here we brief review the relationship:
g(x)=k⇔y=ek
Then: l(f,y)=yTL⋅f, where L is a matrix with 0 on its diagonal and 1 elsewhere, i.e., L=11T−In
The Bayesian classifier:
f:x→yminR=∫y,xyTLf(x)dP(x,y)
since f unconstrained, pointwise minimization, ∀x: Rx=∫y∣xyxTL⋅fdP(y∣x) =∫y∣xyxTdP(y∣x)⋅L⋅f=E(y∣x)T⋅(11T−In)⋅f⋅ =1−E(y∣x)Tf, thus we have ∀x in the support:
f=argkmax(E(yk)∣x)=argkmax(P(yk=1)∣x)
we will classify class of x, y^=f(x) to the most probable class, or the dominant class in N(x), N is some neighborhood.