Machine learning notes : Sigmoid function, softmax function, and exponential family

The sigmoid function and softmax function are commonly used in the field of machine learning. And they are like “least square error” in linear regression. They can be derived from certain basic assumptions using the general form of Exponential family. Some of the basic linear regression and classification algorithms can also be derived from the general form. Let’s dig deep and see how we obtain the myterious functions.

Exponential family

Exponential family includes the Gaussian, binomial, multinomial, Poisson, Gamma and many others distributions. Loosely speaking, a distribution belongs to exponential family if it can be transformed into the general form:

$$p(x|\eta)=h(x)exp(\eta^TT(x)-A(\eta))$$

where
$$\eta$$ is canonical parameter

$$T(x)$$ is sufficient statistic

$$A(\eta)$$ is cumulant function
The regularity conditions of exponential family is mathematically rigorous. It can be referred here: http://stats.stackexchange.com/questions/187533/exponential-family-regularity-conditions

Nice properties of the general form

The general form of exponential family contains nice properties for constructing machine learning models.

1. Calculating moments
First derivative of the cumulant function is mean, while second derivative is the variance of the corresponding distribution. The cumulant generating function of exponential family distributions can be considered as $$A(\eta$$), which can be treated as an alternative way to calculate moments of a distribution. For moment generating function, we need to calculate the integral, however, for cumulant generating function, we just have to calculate the derivative, which is much more simple.

2. Obtaining sufficient statistics
The sufficient statistics, $$T(x)$$, can be obtained by inspection. The intuitive explaination of sufficiency is: Having observed $$T(x)$$, we can throw away $$X$$ for the purposes of inference with respect to $$\theta$$.
For example, $$T(x)=x$$ is sufficient statistics for bernoulli distribution and $$T(x)=[x,x^2]$$ is the sufficient statistics of gaussian distribution

3. Obtaining a general formula for maximum likelihood estimation
We can obtain a generalised formula for maximum likelihood estimates of the parameters in exponential family distributions. For example, for mean estimation, we have:
$$u_ML=\frac{1}{N}\sum^N_nT(x_n)$$

Transforming distributions into general form

It is easy to transform a distribution into the general form. And we can gain insight from the general form.

Consider Bernoulli distribution

$$p(x|\pi) =\pi^x(1-\pi)^{1-x} =exp(log(\frac{\pi}{1-\pi}x+log(1-\pi)))$$

where
$$\eta=\frac{\pi}{1-pi}$$

$$T(x)=x$$

$$A(\eta)=-log(1-\pi)=log(1+e^\eta)$$

$$h(x)=1$$
Solving $$\pi$$ in terms of $$\eta$$, we have:
$$\pi=\frac{1}{1+e^{-\eta}}$$
, which is the sigmoid function.

Similarly, we can transform the multinomial distribution and obtain:
$$\pi_k=\frac{e^{\eta_k}}{\sum^K_je^{\eta_j}}$$
,which is the softmax function.

We can also derive regression models using the distributions in exponential family. For example:

1. By having assumption that data follows gaussian noise, we can use gaussian distribution to model it, which give rise to least square error
2. For two-class classification problems, we can model it using bernoulli distribution, which give rise to binary classification using sigmoid function
3. For multi-class classification, we can model it using multinomial distribution, which give rise to multi-class classification using softmax function.

Moreover, other models like poisson regression can also be derived from the general form of exponential family. The method of plugging in a desired distribution under assumptions is called generalised linear model.