My Note

Logistic regression

1.Concept

Logistic regression is basically a binary (0/1) or distinguishing (the likelihood of a single event) problem.

You want to find the a set of $(b, w_1, w_2, \ldots, w_n)^*$ that make the best prediction of the given test data and return the value that estimate the probability of the result being 1 (the event happens).

flowchart LR
    A(["x in R^m"]) --> B(["z = (w^T)x + b"])
    C(["w in R^m"]) ---> B
    D(["b"]) ---> B
    B ---> E(["a = sigmoid(z)"])
    E ---> F(["L(a, y)"])



2.About the loss function

From the perspective of MLE(Maximum likelihood approximation), we naturally assume that n data points for result “y” follows $y \sim Ber(a)$ (follows the Bernoulli distribution). Transform this discrete distribution to something continuous(PMF):

\(\hat{y}^{y}(1-\hat{y})^{(1-y)}\)

Using

\[L(\hat{y}) = \prod_{i = 1}^{n}\hat{y}^{y}(1-\hat{y})^{(1-y)}\]

Take the logrithm and adjust a little, get:

\(J(w) = -\frac{1}{m}\sum_{i = 1}^{n}[y\log{\hat{y}} + (1 - y)\log{(1 - \hat{y})}]\)

which is something we want to minimize.

3.Gradient descent

Then we use the gradient descent to find the optimal $w \in \mathbb{R}^{m}$:

\(\frac{\partial J(w)}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} (\sigma_w(x^{(i)}) - y^{(i)})x_j^{(i)}\)

for each $w_j\;in\;w$.

Then,

\(w_j := w_j - \alpha\frac{\partial J(w)}{\partial w_j}\)

This will converge to the optimal value because $J(w)$ is convex.