# Machine learning notes : Least square error in linear regression

Least sqaure error is used as a cost function in linear regression. However, why should one choose sqaure error, instead of absolute error, or other choices? There’s a simple proof that can show that least sqaure error is a reasonable and natural choice.

Assume the target variable and inputs are related as below:

$$y^i=\sigma^Tx^i+\epsilon^i$$

,where $$\epsilon\sim\mathcal{N}(\mu,\,\sigma^{2})$$

i.e.

$$p(\epsilon^i)=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(\epsilon^i)^2}{2\sigma^2})$$

implies that

$$p(y^i|x^i_j\theta)=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y^i-\theta^Tx)^2}{2\sigma^2})$$

We would like to minimize the error by maximising the log likelihood. The likelihood function is

$$L(\theta)=L(\theta:X,\vec{y})=p(y^i|x^i_j\theta)$$
$$L(\theta)=\prod(y^i|x^i_j\theta)=\prod(\frac{1}{\sqrt{2\pi}}exp(-\frac{(\epsilon^i)^2}{2\sigma^2}))$$

Minimizing the log likelihood function

$$l(\theta)$$
$$=logL(\theta)$$
$$=log(\prod(\frac{1}{\sqrt{2\pi}}exp(-\frac{(\epsilon^i)^2}{2\sigma^2})))$$
$$=\prod_{i=1}^{m}(log(\frac{1}{\sqrt{2\pi}}exp(-\frac{(y^i-\theta^Tx)^2}{2\sigma^2})))$$
$$=mlog(\frac{1}{\sqrt{2\pi}\sigma})-\frac{1}{\sigma^2}\frac{1}{2}(\sum(y^i-\theta^Tx^i)^2)$$

which is equivalent to minimizing

$$\frac{1}{2}(\sum(y^i-\theta^Tx^i)^2)$$

, which is also known as the least sqaure function, and note that the $$\sigma^2$$ is irrelavent in this case.

Note that the least-sqaure method corresponds to the maximum likelihood estimation. Hence, one can justify the least-sqaure method, with the natural assumption of $$\epsilon\sim\mathcal{N}(\mu,\,\sigma^{2})$$.

*It’s my first time using Katex, and it’s tough writing mathematical equations in markdown files. I chose a simple example as a practice. Here is the Github repo of Katex: https://github.com/Khan/KaTeX