Some math calculation tips

三角函数求导

$\frac{d}{d x} \arcsin x=\frac{1}{\sqrt{1-x^{2}}}$

Proof:
$$
y=\arcsin x \
\sin y=x\
\begin{aligned} \cos y \frac{d y}{d x} &=1 \ \frac{d y}{d x} &=\frac{1}{\cos y} \ \frac{d y}{d x} &=\frac{1}{\sqrt{1-x^{2}}} \end{aligned}
$$

Characteristic function of normal distribution

The characteristic function&action=edit&redlink=1) of the normal distribution with mean $\mu$ and variance $\sigma^2$ is
$$
\phi(t)=e^{i t \mu-\frac{1}{2} t^{2} \sigma^{2}}
$$
Proof:
$$
\begin{aligned} \phi(t) &=E\left[e^{i t X}\right] \ &=\int_{x \in \mathbb{R}} e^{i t x} \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{(x-\mu)^{2}}{2 \sigma^{2}}} \mathrm{d} x \ &=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \int_{x \in \mathbb{R}} e^{i t x} e^{-\frac{(x-\mu)^{2}}{2 \sigma^{2}}} \mathrm{d} x \ &=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \int_{x \in \mathbb{R}} e^{i t x-\frac{(x-\mu)^{2}}{2 \sigma^{2}}} \mathrm{d} x \end{aligned}
$$
By verifying
$$
i t x-\frac{(x-\mu)^{2}}{2 \sigma^{2}}=-\frac{(x-k)^{2}+2 \mu i t \sigma^{2}-t^{2} \sigma^{4}}{2 \sigma^{2}}
$$
We can then simplify the integral in (1)
$$
\begin{aligned} \int_{x \in \mathbb{R}} e^{i t x-\frac{(x-\mu)^{2}}{2 \sigma^{2}}} \mathrm{d} x &=\int_{x \in \mathbb{R} \atop 2 \pi i \sigma^{2}-t^{2} \sigma^{4}} e^{-\frac{(x-k)^{2}+2 \mu i \sigma^{2}-t^{2} \sigma^{4}}{2 \sigma^{2}}} \mathrm{d} x \ &=e^{\frac{2 \mu i t \sigma^{2}-t^{2} \sigma^{4}}{2 \sigma^{2}}} \int_{x \in \mathbb{R}} e^{-\frac{(x-k)^{2}}{2 \sigma^{2}}} \mathrm{d} x \ &=e^{\mu i t-\frac{1}{2} t^{2} \sigma^{2}} \int_{x \in \mathbb{R}} e^{-\left(\frac{x-k}{\sqrt{2} \sigma}\right)^{2}} \mathrm{d} x \ &=c \int_{x \in \mathbb{R}} e^{-\left(\frac{x-k}{\sqrt{2 \sigma}}\right)^{2}} \mathrm{d} x \end{aligned}
$$
insider it’s the kernel of another normal distribution. 凑个系数既可以得到结论。

Jenson’s inequality

If $\varphi$ is a convex function: then
$$
\varphi(\mathrm{E}[X]) \leq \mathrm{E}[\varphi(X)]
$$
Proof:

很简单的想法，就是假设一个很极端的xdistribution，比如两点分布，X=0概率是1/2，X=1概率是1/2，又可以看到，convex function随着x的变大，Y变大幅度增大，那么f(E(X)) = f(1/2)显然不如 1/2(f(0)+f(1)) (因为我们可以假设f(1)变得巨大无比，这不会影响convex function 的定义)。

General joint or the marginal distributions of $\bar{X}$ and $S^{2}$

Let $X_{1}, \ldots, X_{n} \text { be i.i.d. }$ random variables having a common distribution $P$ and $X = (X_1,\ldots,X_n)$. The sample mean $\bar{X}$ and sample variance $S^2$ are two commonly used statistics. Can we find the joint or the marginal distributions of $\bar{X}$ and $S^2$ ?

First, $E \bar{X}=\mu$ and $E S^{2}=\sigma^{2}$ ,

to show $E S^{2}=\sigma^{2}$,
$$
\begin{array}{l}{\mathbb{E}\left(\sum\left(X_{i}-\bar{X}\right)^{2}\right)=\mathbb{E}\left(\sum X_{i}^{2}-2 \bar{X} \sum X_{i}+n \bar{X}^{2}\right)=\sum \mathbb{E}\left(X_{i}^{2}\right)-\mathbb{E}\left(n \bar{X}^{2}\right)} \ {\sum \mathbb{E}\left(X_{i}^{2}\right)-\mathbb{E}\left(n \bar{X}^{2}\right)=\sum \mathbb{E}\left(X_{i}^{2}\right)-n \mathbb{E}\left(\bar{X}^{2}\right)=n \sigma^{2}+n \mu^{2}-\sigma^{2}-n \mu^{2}}\end{array}
$$
This simplifies to $(n-1) \sigma^{2}$.

So far, we have shown that $\mathbb{E}\left(\sum\left(X_{i}-\bar{X}\right)^{2}\right)=(n-1) \sigma^{2}$

$\mathbb{E}\left(s^{2}\right)=\mathbb{E}\left(\frac{\sum\left(X_{i}-\bar{X}\right)^{2}}{n-1}\right)=\frac{1}{n-1} \mathbb{E}\left(\sum\left(X_{i}-\bar{X}\right)^{2}\right)$

$\mathbb{E}\left(s^{2}\right)=\frac{(n-1) \sigma^{2}}{n-1}=\sigma^{2}$

With a finite $E\left|X_{1}\right|^{3}$, we can obtain $E(\bar{X})^{3}$ and $\operatorname{Cov}\left(\bar{X}, S^{2}\right)$, and with a finite $E\left|X_{1}\right|^{4}$, we can obtain $\operatorname{Var}\left(S^{2}\right)$.

Second, let $Y_{i}=\left(X_{i}-\mu,\left(X_{i}-\mu\right)^{2}\right)$, 这里构造这样的$Y_i$是为了凑后面协方差矩阵的好的形式。如果把$Y_i$写成$(X_i,(X_i-\mu)^2)$,后面计算协方差会比较困难。当然如果你这样算，算出来的结果也是一样的。then $Y_{1}, \dots, Y_{n}$ are i.i.d. random 2-vectors with $EY_1 = (0,\sigma^2)$ and variance-covariance matrix:
$$
\Sigma=\left(\begin{array}{cc}{\sigma^{2}} & {E\left(X_{1}-\mu\right)^{3}} \ {E\left(X_{1}-\mu\right)^{3}} & {E\left(X_{1}-\mu\right)^{4}-\sigma^{4}}\end{array}\right)
$$
Note that $\bar{Y}=n^{-1} \sum_{i=1}^{n} Y_{i}=\left(\bar{X}-\mu, \tilde{S}^{2}\right)$, where $\tilde{S}^{2}=n^{-1} \sum_{i=1}^{n}\left(X_{i}-\mu\right)^{2}$.

Applying the CLT to $Y_i^{‘}s$ （二维CLT), we obtain that $\sqrt{n}\left(\bar{X}-\mu, \tilde{S}^{2}-\sigma^{2}\right) \rightarrow_{d} N_{2}(0, \Sigma)$.

Since $S^{2}=\frac{n}{n-1}\left[\tilde{S}^{2}-(\bar{X}-\mu)^{2}\right]$ and $\bar{X} \rightarrow_{a . s .} \mu$, slutsky theorem says $S^2 \longrightarrow_d \tilde{S}^{2}$, and $\sqrt n S^2 \longrightarrow_d \sqrt n\tilde{S}^{2}$,

it leads to $\sqrt{n}\left(\bar{X}-\mu, S^{2}-\sigma^{2}\right) \rightarrow_{d} N_{2}(0, \Sigma)$ 用S代替了S tilde。

Matrix derivative 向量矩阵求导

Vector-by-scalar: trivial

Scalar-by-vector:

Vector-by-vector:

Big O notation and small O notation

Big O notation: let $f(x) = 6 x^{4}-2 x^{3}+5$ and $f(x)=O\left(x^{4}\right)$.

little O notation: $3n+4$ is $o(n^2)$.

Big Op notation

The notation $X_n = O_p(a_n)$ means that the for any $\epsilon < 0$, there exists an upper bound $M$ and a finite $N$ , which satisfy $P\left(\left|X_{n} / a_{n}\right|>M\right)<\varepsilon, \forall n>N$.

Least square estimation 求解beta

Identities:

scalar by vector identities: 这个链接告诉你如何做矩阵求导。

https://en.wikipedia.org/wiki/Matrix_calculus

$\Gamma$ 积分

首先，$\Gamma$函数的递推公式为：$\Gamma(x+1)=x \Gamma(x)$

对于正整数n,有
$$
\Gamma(n+1)=n !
$$

$$
\Gamma(n+1)=\int_{0}^{\infty} \mathrm{e}^{-x} x^{n+1-1} \mathrm{d} x=\int_{0}^{\infty} \mathrm{e}^{-x} x^{n} \mathrm{d} x
$$

Taylor expansion (泰勒展开)

The Taylor series of a real or complex-valued function f (x) that is infinitely differentiable at a real or complex number a is the power series
$$
f(a)+\frac{f^{\prime}(a)}{1 !}(x-a)+\frac{f^{\prime \prime}(a)}{2 !}(x-a)^{2}+\frac{f^{\prime \prime \prime}(a)}{3 !}(x-a)^{3}+\cdots
$$

Limit superior and inferior

A number t is the limit superior of a sequence $a_n$ if the following two conditions are both satisfied:

For every $s<t$ we have $s<a_n$ for infinitely many n’s.
For every $s>t$, we have $s<a_n$ for only finitely many n’s.

A number t is the limit inferior of a sequence $a_n$ if the following two conditions are both satisfied.

For every s>t we have s>an for infinitely many n’s.
For every s<t, we have s>an for only finitely many n’s (possibly none).

limit superior of a function

lim sup tries to ignore what happens on any finite interval $[0,\infty]$ and tells you the supremum once you look further and further away. In other words, it is the limit of the supremums as we look further away:
$$
\limsup _{x \rightarrow \infty} f(x)=\lim _{y \rightarrow \infty} \sup _{x \geq y} f(x)
$$

L1 loss function yields the median

proof of $E(E(X|Z,Y))= E(X|Y)$

Use the definition of conditional expectation.

Intuitive explanation for dividing by n-1 when calculating standard deviation

The standard deviation calculated with a divisor of 𝑛−1n−1 is a standard deviation calculated from the sample as an estimate of the standard deviation of the population from which the sample was drawn. Because the observed values fall, on average, closer to the sample mean than to the population mean, the standard deviation which is calculated using deviations from the sample mean underestimates the desired standard deviation of the population. Using 𝑛−1n−1 instead of 𝑛n as the divisor corrects for that by making the result a little bit bigger.

Note that the correction has a larger proportional effect when 𝑛n is small than when it is large, which is what we want because when n is larger the sample mean is likely to be a good estimator of the population mean.

When the sample is the whole population we use the standard deviation with 𝑛n as the divisor because the sample mean is population mean.